Sunday’s procrastination showed how to webscrape Wikipedia using Emacs.
I’ll quickly present a tidier version here, with Emacs code that scrapes a single page, outputting for each edit in the history the page topic, the user and the time-stamp. Then I’ll show a little bash script that calls the elisp many times.
Unlike the previous version, it just does one random Wikipedia URL at a time, and outputs topic, user and timestamp, not just topic and time-stamp. It uses much of the same code:
(defvar history-url "https://en.wikipedia.org/w/index.php?title=%s&offset=&limit=5000&action=history") (defun getrandompage () (set-buffer (url-retrieve-synchronously "https://en.wikipedia.org/wiki/Special:Random")) (re-search-forward "
\\(.+\\) - Wikipedia" nil t) (match-string 1)) (defun gethistory (page) (let (results) (save-mark-and-excursion (set-buffer (url-retrieve-synchronously (format history-url page))) (while (re-search-forward "<li data-mw-revid=.+" nil t) (push (match-string 0) results)) results))) ;; doesn't store page-name unlike earlier version, just ;; the HTML code of the edit summaries, as a list (defun parse-wiki-edit-3 (slug) (let ((result)) (string-match "mw-changeslist-date[^>]+?>\\([^<]+\\)" slug) (push (match-string 1 slug) result) (string-match "Special:Contributions/\\([^\"]+\\)" slug) (push (match-string 1 slug) result) (string-match "title=\"\\([^\"]+\\)\"" slug) (push (match-string 1 slug) result)))
gethistory defun differs in that it returns just the edit-history HTML excerpts (Sunday’s version also returned the topic). The
parse-wiki-edit-3 defun then returns a list containing the topic, user and timestamp.
The action code is as follows:
(mapcar (lambda (detail) (princ (apply 'format "%s\t%s\t%s\n" (parse-wiki-edit-3 detail)))) (gethistory (getrandompage)))
This gets a single random Wikipedia page (see Sunday’s account for details), extracts the edit history, and then for each edit, extracts the topic/user/timestamp. The
princ function outputs all this to standard output (with tabs as separators). Outputting to standard output makes this work as a shell command – save the above code to a file (
scrapeone.el) and execute the following command:
emacs --batch -l scrapeone.el
and you’ll get output like this:
brendan$ emacs --batch -l scrapeone.el
Contacting host: en.wikipedia.org:443
Reference electrode Kaverin 17:11, 6 February 2005
Reference electrode Kaverin 17:12, 6 February 2005
Reference electrode Kaverin 17:15, 6 February 2005
Reference electrode Kaverin 17:17, 6 February 2005
Reference electrode Henrygb 23:41, 16 February 2005
Reference electrode 184.108.40.206 01:35, 18 February 2005
. . .
Some of these lines (i.e., the first three) are sent to standard error, not standard output, so if we redirect to a file, we get only the good stuff. The following code is a Bash script that takes two parameters, a file name for the output, and the number of random URLs to check, and redirects everything. The resulting output is a tab-delimited file that can be read into almost any processing software:
#!/usr/bin/env bash echo "Webscraping $2 random Wikipedia topics to $1.csv" echo -e "topic\tuser\ttimestamp" > $1.csv for i in $(seq $2); do emacs --batch -l scrapeone.el >> $1.csv done
Save that as a file and run it:
bash scrapemany.sh test1 10 will create a file
test1.csv with info from 10 random Wikipedia pages.