{"id":579,"date":"2019-10-08T20:07:43","date_gmt":"2019-10-08T20:07:43","guid":{"rendered":"http:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=579"},"modified":"2019-10-08T20:07:43","modified_gmt":"2019-10-08T20:07:43","slug":"webscraping-wikipedia-update","status":"publish","type":"post","link":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=579","title":{"rendered":"Webscraping Wikipedia: update"},"content":{"rendered":"<p>Sunday&#8217;s procrastination showed how to <a href=\"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=580\">webscrape Wikipedia using Emacs<\/a>.<\/p>\n<p>I&#8217;ll quickly present a tidier version here, with Emacs code that scrapes a single page, outputting for each edit in the history the page topic, the user and the time-stamp. Then I&#8217;ll show a little bash script that calls the elisp many times.<\/p>\n<p>Unlike the previous version, it just does one random Wikipedia URL at a time, and outputs topic, user and timestamp, not just topic and time-stamp. It uses much of the same code:<br \/>\n<!--more--><\/p>\n<pre>\r\n(defvar history-url \r\n  \"https:\/\/en.wikipedia.org\/w\/index.php?title=%s&amp;offset=&amp;limit=5000&amp;action=history\")\r\n\r\n(defun getrandompage ()\r\n  (set-buffer (url-retrieve-synchronously\r\n               \"https:\/\/en.wikipedia.org\/wiki\/Special:Random\"))\r\n  (re-search-forward \"<title>\\\\(.+\\\\) - Wikipedia<\/title>\" nil t)\r\n  (match-string 1))\r\n\r\n(defun gethistory (page)\r\n  (let (results)\r\n    (save-mark-and-excursion\r\n      (set-buffer (url-retrieve-synchronously\r\n                   (format history-url page)))\r\n      (while (re-search-forward \"&lt;li data-mw-revid=.+<\/li>\" nil t)\r\n        (push (match-string 0) results))\r\n      results))) \r\n;; doesn't store page-name unlike earlier version, just\r\n;; the HTML code of the edit summaries, as a list\r\n\r\n(defun parse-wiki-edit-3 (slug)\r\n  (let ((result))\r\n    (string-match \"mw-changeslist-date[^&gt;]+?&gt;\\\\([^&lt;]+\\\\)&quot; slug)\r\n    (push (match-string 1 slug) result)\r\n    (string-match &quot;Special:Contributions\/\\\\([^\\&quot;]+\\\\)&quot; slug)\r\n    (push (match-string 1 slug) result)\r\n    (string-match &quot;title=\\&quot;\\\\([^\\&quot;]+\\\\)\\&quot;&quot; slug)\r\n    (push (match-string 1 slug) result)))\r\n<\/pre>\n<p>The <code>gethistory<\/code> defun differs in that it returns just the edit-history HTML excerpts (Sunday&#8217;s version also returned the topic). The <code>parse-wiki-edit-3<\/code> defun then returns a list containing the topic, user and timestamp.<\/p>\n<p>The action code is as follows:<\/p>\n<pre>\r\n(mapcar (lambda (detail)\r\n          (princ (apply 'format \"%s\\t%s\\t%s\\n\" \r\n                        (parse-wiki-edit-3 detail))))\r\n        (gethistory (getrandompage)))\r\n<\/pre>\n<p>This gets a single random Wikipedia page (see Sunday&#8217;s account for details), extracts the edit history, and then for each edit, extracts the topic\/user\/timestamp. The <code>princ<\/code> function outputs all this to standard output (with tabs as separators). Outputting to standard output makes this work as a shell command &#8211; save the above code to a file (<code>scrapeone.el<\/code>) and execute the following command:<\/p>\n<p><code>emacs --batch -l scrapeone.el<\/code><\/p>\n<p>and you&#8217;ll get output like this:<\/p>\n<p><code>brendan$ emacs --batch -l scrapeone.el<br \/>\nContacting host: en.wikipedia.org:443<br \/>\nuncompressing publicsuffix.txt.gz...<br \/>\nuncompressing publicsuffix.txt.gz...done<br \/>\nReference electrode\tKaverin\t17:11, 6 February 2005<br \/>\nReference electrode\tKaverin\t17:12, 6 February 2005<br \/>\nReference electrode\tKaverin\t17:15, 6 February 2005<br \/>\nReference electrode\tKaverin\t17:17, 6 February 2005<br \/>\nReference electrode\tHenrygb\t23:41, 16 February 2005<br \/>\nReference electrode\t66.80.80.118\t01:35, 18 February 2005<br \/>\n. . .<br \/>\n<\/code><\/p>\n<p>Some of these lines (i.e., the first three) are sent to standard error, not standard output, so if we redirect to a file, we get only the good stuff. The following code is a Bash script that takes two parameters, a file name for the output, and the number of random URLs to check, and redirects everything. The resulting output is a tab-delimited file that can be read into almost any processing software:<\/p>\n<pre>\r\n#!\/usr\/bin\/env bash\r\necho \"Webscraping $2 random Wikipedia topics to $1.csv\"\r\n\r\necho -e \"topic\\tuser\\ttimestamp\" &gt; $1.csv\r\nfor i in $(seq $2); do\r\n    emacs --batch -l scrapeone.el &gt;&gt; $1.csv\r\ndone\r\n<\/pre>\n<p>Save that as a file and run it: <code>bash scrapemany.sh test1 10<\/code> will create a file <code>test1.csv<\/code> with info from 10 random Wikipedia pages. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Sunday&#8217;s procrastination showed how to webscrape Wikipedia using Emacs. I&#8217;ll quickly present a tidier version here, with Emacs code that scrapes a single page, outputting for each edit in the history the page topic, the user and the time-stamp. Then I&#8217;ll show a little bash script that calls the elisp many times. Unlike the previous &hellip; <a href=\"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=579\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Webscraping Wikipedia: update<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/579"}],"collection":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=579"}],"version-history":[{"count":11,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/579\/revisions"}],"predecessor-version":[{"id":599,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/579\/revisions\/599"}],"wp:attachment":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=579"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=579"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=579"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}