Webscraping Wikipedia: update

Sunday’s procrastination showed how to webscrape Wikipedia using Emacs.

I’ll quickly present a tidier version here, with Emacs code that scrapes a single page, outputting for each edit in the history the page topic, the user and the time-stamp. Then I’ll show a little bash script that calls the elisp many times.

Unlike the previous version, it just does one random Wikipedia URL at a time, and outputs topic, user and timestamp, not just topic and time-stamp. It uses much of the same code:

(defvar history-url 
  "https://en.wikipedia.org/w/index.php?title=%s&offset=&limit=5000&action=history")

(defun getrandompage ()
  (set-buffer (url-retrieve-synchronously
               "https://en.wikipedia.org/wiki/Special:Random"))
  (re-search-forward "\\(.+\\) - Wikipedia" nil t)
  (match-string 1))

(defun gethistory (page)
  (let (results)
    (save-mark-and-excursion
      (set-buffer (url-retrieve-synchronously
                   (format history-url page)))
      (while (re-search-forward "<li data-mw-revid=.+" nil t)
        (push (match-string 0) results))
      results))) 
;; doesn't store page-name unlike earlier version, just
;; the HTML code of the edit summaries, as a list

(defun parse-wiki-edit-3 (slug)
  (let ((result))
    (string-match "mw-changeslist-date[^>]+?>\\([^<]+\\)" slug)
    (push (match-string 1 slug) result)
    (string-match "Special:Contributions/\\([^\"]+\\)" slug)
    (push (match-string 1 slug) result)
    (string-match "title=\"\\([^\"]+\\)\"" slug)
    (push (match-string 1 slug) result)))

The gethistory defun differs in that it returns just the edit-history HTML excerpts (Sunday’s version also returned the topic). The parse-wiki-edit-3 defun then returns a list containing the topic, user and timestamp.

The action code is as follows:

(mapcar (lambda (detail)
          (princ (apply 'format "%s\t%s\t%s\n" 
                        (parse-wiki-edit-3 detail))))
        (gethistory (getrandompage)))

This gets a single random Wikipedia page (see Sunday’s account for details), extracts the edit history, and then for each edit, extracts the topic/user/timestamp. The princ function outputs all this to standard output (with tabs as separators). Outputting to standard output makes this work as a shell command – save the above code to a file (scrapeone.el) and execute the following command:

emacs --batch -l scrapeone.el

and you’ll get output like this:

brendan$ emacs --batch -l scrapeone.el Contacting host: en.wikipedia.org:443 uncompressing publicsuffix.txt.gz... uncompressing publicsuffix.txt.gz...done Reference electrode Kaverin 17:11, 6 February 2005 Reference electrode Kaverin 17:12, 6 February 2005 Reference electrode Kaverin 17:15, 6 February 2005 Reference electrode Kaverin 17:17, 6 February 2005 Reference electrode Henrygb 23:41, 16 February 2005 Reference electrode 66.80.80.118 01:35, 18 February 2005 . . .

Some of these lines (i.e., the first three) are sent to standard error, not standard output, so if we redirect to a file, we get only the good stuff. The following code is a Bash script that takes two parameters, a file name for the output, and the number of random URLs to check, and redirects everything. The resulting output is a tab-delimited file that can be read into almost any processing software:

#!/usr/bin/env bash
echo "Webscraping $2 random Wikipedia topics to $1.csv"

echo -e "topic\tuser\ttimestamp" > $1.csv
for i in $(seq $2); do
    emacs --batch -l scrapeone.el >> $1.csv
done

Save that as a file and run it: bash scrapemany.sh test1 10 will create a file test1.csv with info from 10 random Wikipedia pages.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Sociology, Statistics and Software

Thoughts on computers, data analysis and the social sciences

Leave a Reply Cancel reply