Sunday’s procrastination showed how to webscrape Wikipedia using Emacs.
I’ll quickly present a tidier version here, with Emacs code that scrapes a single page, outputting for each edit in the history the page topic, the user and the time-stamp. Then I’ll show a little bash script that calls the elisp many times.
Unlike the previous version, it just does one random Wikipedia URL at a time, and outputs topic, user and timestamp, not just topic and time-stamp. It uses much of the same code:
(defvar history-url
"https://en.wikipedia.org/w/index.php?title=%s&offset=&limit=5000&action=history")
(defun getrandompage ()
(set-buffer (url-retrieve-synchronously
"https://en.wikipedia.org/wiki/Special:Random"))
(re-search-forward "\\(.+\\) - Wikipedia " nil t)
(match-string 1))
(defun gethistory (page)
(let (results)
(save-mark-and-excursion
(set-buffer (url-retrieve-synchronously
(format history-url page)))
(while (re-search-forward "<li data-mw-revid=.+" nil t)
(push (match-string 0) results))
results)))
;; doesn't store page-name unlike earlier version, just
;; the HTML code of the edit summaries, as a list
(defun parse-wiki-edit-3 (slug)
(let ((result))
(string-match "mw-changeslist-date[^>]+?>\\([^<]+\\)" slug)
(push (match-string 1 slug) result)
(string-match "Special:Contributions/\\([^\"]+\\)" slug)
(push (match-string 1 slug) result)
(string-match "title=\"\\([^\"]+\\)\"" slug)
(push (match-string 1 slug) result)))
The gethistory defun differs in that it returns just the edit-history HTML excerpts (Sunday’s version also returned the topic). The parse-wiki-edit-3 defun then returns a list containing the topic, user and timestamp.
The action code is as follows:
(mapcar (lambda (detail)
(princ (apply 'format "%s\t%s\t%s\n"
(parse-wiki-edit-3 detail))))
(gethistory (getrandompage)))
This gets a single random Wikipedia page (see Sunday’s account for details), extracts the edit history, and then for each edit, extracts the topic/user/timestamp. The princ function outputs all this to standard output (with tabs as separators). Outputting to standard output makes this work as a shell command – save the above code to a file (scrapeone.el) and execute the following command:
emacs --batch -l scrapeone.el
and you’ll get output like this:
brendan$ emacs --batch -l scrapeone.el
Contacting host: en.wikipedia.org:443
uncompressing publicsuffix.txt.gz...
uncompressing publicsuffix.txt.gz...done
Reference electrode Kaverin 17:11, 6 February 2005
Reference electrode Kaverin 17:12, 6 February 2005
Reference electrode Kaverin 17:15, 6 February 2005
Reference electrode Kaverin 17:17, 6 February 2005
Reference electrode Henrygb 23:41, 16 February 2005
Reference electrode 66.80.80.118 01:35, 18 February 2005
. . .
Some of these lines (i.e., the first three) are sent to standard error, not standard output, so if we redirect to a file, we get only the good stuff. The following code is a Bash script that takes two parameters, a file name for the output, and the number of random URLs to check, and redirects everything. The resulting output is a tab-delimited file that can be read into almost any processing software:
#!/usr/bin/env bash
echo "Webscraping $2 random Wikipedia topics to $1.csv"
echo -e "topic\tuser\ttimestamp" > $1.csv
for i in $(seq $2); do
emacs --batch -l scrapeone.el >> $1.csv
done
Save that as a file and run it: bash scrapemany.sh test1 10 will create a file test1.csv with info from 10 random Wikipedia pages.