Webscraping Wikipedia: update

Sunday’s procrastination showed how to webscrape Wikipedia using Emacs.

I’ll quickly present a tidier version here, with Emacs code that scrapes a single page, outputting for each edit in the history the page topic, the user and the time-stamp. Then I’ll show a little bash script that calls the elisp many times.

Unlike the previous version, it just does one random Wikipedia URL at a time, and outputs topic, user and timestamp, not just topic and time-stamp. It uses much of the same code:

(defvar history-url 
  "https://en.wikipedia.org/w/index.php?title=%s&offset=&limit=5000&action=history")

(defun getrandompage ()
  (set-buffer (url-retrieve-synchronously
               "https://en.wikipedia.org/wiki/Special:Random"))
  (re-search-forward "\\(.+\\) - Wikipedia" nil t)
  (match-string 1))

(defun gethistory (page)
  (let (results)
    (save-mark-and-excursion
      (set-buffer (url-retrieve-synchronously
                   (format history-url page)))
      (while (re-search-forward "<li data-mw-revid=.+" nil t)
        (push (match-string 0) results))
      results))) 
;; doesn't store page-name unlike earlier version, just
;; the HTML code of the edit summaries, as a list

(defun parse-wiki-edit-3 (slug)
  (let ((result))
    (string-match "mw-changeslist-date[^>]+?>\\([^<]+\\)" slug)
    (push (match-string 1 slug) result)
    (string-match "Special:Contributions/\\([^\"]+\\)" slug)
    (push (match-string 1 slug) result)
    (string-match "title=\"\\([^\"]+\\)\"" slug)
    (push (match-string 1 slug) result)))

The gethistory defun differs in that it returns just the edit-history HTML excerpts (Sunday’s version also returned the topic). The parse-wiki-edit-3 defun then returns a list containing the topic, user and timestamp.

The action code is as follows:

(mapcar (lambda (detail)
          (princ (apply 'format "%s\t%s\t%s\n" 
                        (parse-wiki-edit-3 detail))))
        (gethistory (getrandompage)))

This gets a single random Wikipedia page (see Sunday’s account for details), extracts the edit history, and then for each edit, extracts the topic/user/timestamp. The princ function outputs all this to standard output (with tabs as separators). Outputting to standard output makes this work as a shell command – save the above code to a file (scrapeone.el) and execute the following command:

emacs --batch -l scrapeone.el

and you’ll get output like this:

brendan$ emacs --batch -l scrapeone.el
Contacting host: en.wikipedia.org:443
uncompressing publicsuffix.txt.gz...
uncompressing publicsuffix.txt.gz...done
Reference electrode Kaverin 17:11, 6 February 2005
Reference electrode Kaverin 17:12, 6 February 2005
Reference electrode Kaverin 17:15, 6 February 2005
Reference electrode Kaverin 17:17, 6 February 2005
Reference electrode Henrygb 23:41, 16 February 2005
Reference electrode 66.80.80.118 01:35, 18 February 2005
. . .

Some of these lines (i.e., the first three) are sent to standard error, not standard output, so if we redirect to a file, we get only the good stuff. The following code is a Bash script that takes two parameters, a file name for the output, and the number of random URLs to check, and redirects everything. The resulting output is a tab-delimited file that can be read into almost any processing software:

#!/usr/bin/env bash
echo "Webscraping $2 random Wikipedia topics to $1.csv"

echo -e "topic\tuser\ttimestamp" > $1.csv
for i in $(seq $2); do
    emacs --batch -l scrapeone.el >> $1.csv
done

Save that as a file and run it: bash scrapemany.sh test1 10 will create a file test1.csv with info from 10 random Wikipedia pages.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.