Sunday’s procrastination showed how to webscrape Wikipedia using Emacs.
I’ll quickly present a tidier version here, with Emacs code that scrapes a single page, outputting for each edit in the history the page topic, the user and the time-stamp. Then I’ll show a little bash script that calls the elisp many times.
Unlike the previous version, it just does one random Wikipedia URL at a time, and outputs topic, user and timestamp, not just topic and time-stamp. It uses much of the same code:
(defvar history-url "https://en.wikipedia.org/w/index.php?title=%s&offset=&limit=5000&action=history") (defun getrandompage () (set-buffer (url-retrieve-synchronously "https://en.wikipedia.org/wiki/Special:Random")) (re-search-forward "\\(.+\\) - Wikipedia " nil t) (match-string 1)) (defun gethistory (page) (let (results) (save-mark-and-excursion (set-buffer (url-retrieve-synchronously (format history-url page))) (while (re-search-forward "<li data-mw-revid=.+" nil t) (push (match-string 0) results)) results))) ;; doesn't store page-name unlike earlier version, just ;; the HTML code of the edit summaries, as a list (defun parse-wiki-edit-3 (slug) (let ((result)) (string-match "mw-changeslist-date[^>]+?>\\([^<]+\\)" slug) (push (match-string 1 slug) result) (string-match "Special:Contributions/\\([^\"]+\\)" slug) (push (match-string 1 slug) result) (string-match "title=\"\\([^\"]+\\)\"" slug) (push (match-string 1 slug) result)))
The gethistory
defun differs in that it returns just the edit-history HTML excerpts (Sunday’s version also returned the topic). The parse-wiki-edit-3
defun then returns a list containing the topic, user and timestamp.
The action code is as follows:
(mapcar (lambda (detail) (princ (apply 'format "%s\t%s\t%s\n" (parse-wiki-edit-3 detail)))) (gethistory (getrandompage)))
This gets a single random Wikipedia page (see Sunday’s account for details), extracts the edit history, and then for each edit, extracts the topic/user/timestamp. The princ
function outputs all this to standard output (with tabs as separators). Outputting to standard output makes this work as a shell command – save the above code to a file (scrapeone.el
) and execute the following command:
emacs --batch -l scrapeone.el
and you’ll get output like this:
brendan$ emacs --batch -l scrapeone.el
Contacting host: en.wikipedia.org:443
uncompressing publicsuffix.txt.gz...
uncompressing publicsuffix.txt.gz...done
Reference electrode Kaverin 17:11, 6 February 2005
Reference electrode Kaverin 17:12, 6 February 2005
Reference electrode Kaverin 17:15, 6 February 2005
Reference electrode Kaverin 17:17, 6 February 2005
Reference electrode Henrygb 23:41, 16 February 2005
Reference electrode 66.80.80.118 01:35, 18 February 2005
. . .
Some of these lines (i.e., the first three) are sent to standard error, not standard output, so if we redirect to a file, we get only the good stuff. The following code is a Bash script that takes two parameters, a file name for the output, and the number of random URLs to check, and redirects everything. The resulting output is a tab-delimited file that can be read into almost any processing software:
#!/usr/bin/env bash echo "Webscraping $2 random Wikipedia topics to $1.csv" echo -e "topic\tuser\ttimestamp" > $1.csv for i in $(seq $2); do emacs --batch -l scrapeone.el >> $1.csv done
Save that as a file and run it: bash scrapemany.sh test1 10
will create a file test1.csv
with info from 10 random Wikipedia pages.