{"id":580,"date":"2019-10-06T16:45:21","date_gmt":"2019-10-06T16:45:21","guid":{"rendered":"http:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=580"},"modified":"2019-10-08T20:08:55","modified_gmt":"2019-10-08T20:08:55","slug":"webscraping-wikipedia-with-emacs","status":"publish","type":"post","link":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=580","title":{"rendered":"Webscraping Wikipedia with Emacs"},"content":{"rendered":"<div id=\"outline-container-org43f87f6\" class=\"outline-2\">\n<h2 id=\"org43f87f6\">Idle hands<\/h2>\n<div class=\"outline-text-2\" id=\"text-1\">\n<p>\nFor the want of something better to do (okay, because procrastination), a pass at webscraping Wikipedia. For fun. I&#8217;m going to use it&#8217;s &#8220;Random Page&#8221; to sample pages, and then extract the edit history (looking at how often edited, and when). Let&#8217;s say we&#8217;re interested in getting an idea of the distribution of interest in editing pages.\n<\/p>\n<p>See update: <a href=\"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=579\">tidier code<\/a>.<\/p>\n<p>\nI&#8217;m going to use Emacs lisp for the web scraping.\n<\/p>\n<p>\nOK, Wikipedia links to a random page from the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Special:Random\">Random Page<\/a> link in the lefthand menu. This is a URL:\n<\/p>\n<pre class=\"example\">\r\nhttps:\/\/en.wikipedia.org\/wiki\/Special:Random\r\n<\/pre>\n<p>\nHow random is this page? See <a href=\"https:\/\/en.wikipedia.org\/wiki\/Wikipedia:FAQ\/Technical#random\">https:\/\/en.wikipedia.org\/wiki\/Wikipedia:FAQ\/Technical#random<\/a>\n<\/p>\n<\/div>\n<\/div>\n<p><!--more--><\/p>\n<div id=\"outline-container-org245c2a7\" class=\"outline-2\">\n<h2 id=\"org245c2a7\">Scraping code<\/h2>\n<div class=\"outline-text-2\" id=\"text-2\">\n<p>\nThe little function below uses <code>url-retrieve-synchronously<\/code> to retrieve the page, into an emacs buffer, and then uses regular-expression search to extract the title of the page.\n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-emacs-lisp\">(<span style=\"color: #A52A2A;font-weight: bold\">defun<\/span> <span style=\"color: #00578E;font-weight: bold\">getrandompage<\/span> ()\r\n  (set-buffer (url-retrieve-synchronously <span style=\"color: #4E9A06\">\"https:\/\/en.wikipedia.org\/wiki\/Special:Random\"<\/span>))\r\n  (re-search-forward <span style=\"color: #4E9A06\">\"&lt;title&gt;<\/span><span style=\"color: #4E9A06;font-weight: bold\">\\\\<\/span><span style=\"color: #4E9A06;font-weight: bold\">(<\/span><span style=\"color: #4E9A06\">.+<\/span><span style=\"color: #4E9A06;font-weight: bold\">\\\\<\/span><span style=\"color: #4E9A06;font-weight: bold\">)<\/span><span style=\"color: #4E9A06\"> - Wikipedia&lt;\/title&gt;\"<\/span> nil t)\r\n  (match-string 1))\r\n<\/pre>\n<\/div>\n<p>\nA page&#8217;s edit-history is available at this URL\n<\/p>\n<pre class=\"example\">\r\nhttps:\/\/en.wikipedia.org\/w\/index.php?title=PAGENAME&offset=&limit=5000&action=history\r\n<\/pre>\n<p>\nNormally, there are links for varying numbers edits shown per page, up to a max of 500, but I&#8217;ve put 5000 in as the limit, and it seems to work.\n<\/p>\n<p>\nThe edits are in a HTML list, and can be picked up with a regexp of the following form:\n<\/p>\n<pre class=\"example\">\r\n\"&lt;li data-mw-revid=.+&lt;\/li&gt;\"\r\n\r\n<\/pre>\n<p>\nThe following function takes a page name, and returns a list of the edit entries, with the name of the page at the top of the list.\n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-emacs-lisp\">(<span style=\"color: #A52A2A;font-weight: bold\">defvar<\/span> <span style=\"color: #0084C8;font-weight: bold\">history-url<\/span> \r\n  <span style=\"color: #4E9A06\">\"https:\/\/en.wikipedia.org\/w\/index.php?title=%s&amp;offset=&amp;limit=5000&amp;action=history\"<\/span>)\r\n(<span style=\"color: #A52A2A;font-weight: bold\">defun<\/span> <span style=\"color: #00578E;font-weight: bold\">gethistory<\/span> (page)\r\n  (<span style=\"color: #A52A2A;font-weight: bold\">let<\/span> (results)\r\n    (<span style=\"color: #A52A2A;font-weight: bold\">save-mark-and-excursion<\/span>\r\n      (set-buffer (url-retrieve-synchronously\r\n                   (format history-url page)))\r\n      (<span style=\"color: #A52A2A;font-weight: bold\">while<\/span> (re-search-forward <span style=\"color: #4E9A06\">\"&lt;li data-mw-revid=.+&lt;\/li&gt;\"<\/span> nil t)\r\n        (<span style=\"color: #A52A2A;font-weight: bold\">push<\/span> (match-string 0) results))\r\n      (cons page results))))\r\n\r\n<\/pre>\n<\/div>\n<p>\nThere is lots of information in the edit entry (date, user, size of resulting file, change in file size, comment, on top of URLs to compare versions). I just want dates, which I capture with this regexp:\n<\/p>\n<pre class=\"example\">\r\n\"mw-changeslist-date[^&gt;]+?&gt;\\\\([^&lt;]+\\\\)\"\r\n\r\n<\/pre>\n<p>\nThe following block puts this all together, writing the results to a tab-delimited file for further processing. Putting everything in a <code>dotimes<\/code> loop is crude and fragile (any error and the whole batch of data is lost; a more professional approach would catch and deal with errors) but it&#8217;s simple. It also ties up the Emacs process, so it is probably best done in a separate process (or <code>--batch<\/code> style).\n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-emacs-lisp\">\r\n(<span style=\"color: #A52A2A;font-weight: bold\">defun<\/span> <span style=\"color: #00578E;font-weight: bold\">parse-wiki-edit<\/span> (slug)\r\n  (string-match <span style=\"color: #4E9A06\">\"mw-changeslist-date[<\/span><span style=\"color: #4E9A06\">^<\/span><span style=\"color: #4E9A06\">&gt;]+?&gt;<\/span><span style=\"color: #4E9A06;font-weight: bold\">\\\\<\/span><span style=\"color: #4E9A06;font-weight: bold\">(<\/span><span style=\"color: #4E9A06\">[<\/span><span style=\"color: #4E9A06\">^<\/span><span style=\"color: #4E9A06\">&lt;]+<\/span><span style=\"color: #4E9A06;font-weight: bold\">\\\\<\/span><span style=\"color: #4E9A06;font-weight: bold\">)<\/span><span style=\"color: #4E9A06\">\"<\/span> slug)\r\n  (match-string 1 slug))\r\n\r\n<span style=\"color: #204A87\">;; <\/span><span style=\"color: #204A87\">Do it 1000 times, store in a tab-delimited CSV<\/span>\r\n(<span style=\"color: #A52A2A;font-weight: bold\">with-temp-buffer<\/span>\r\n  (insert <span style=\"color: #4E9A06\">\"pagetitle\\tedittime\\n\"<\/span>)\r\n  (<span style=\"color: #A52A2A;font-weight: bold\">dotimes<\/span> (i 1000)\r\n    (<span style=\"color: #A52A2A;font-weight: bold\">setq<\/span> page (<span style=\"color: #A52A2A;font-weight: bold\">save-excursion<\/span> (gethistory (getrandompage))))\r\n    (<span style=\"color: #A52A2A;font-weight: bold\">setq<\/span> title (car page))\r\n    (mapcar (<span style=\"color: #A52A2A;font-weight: bold\">lambda<\/span> (x) (insert\r\n                         (format\r\n                          <span style=\"color: #4E9A06\">\"%s\\t%s\\n\"<\/span>\r\n                          title\r\n                          (format-time-string <span style=\"color: #4E9A06\">\"%Y-%m-%d-%T\"<\/span>\r\n                                              (encode-time (parse-time-string\r\n                                                            (parse-wiki-edit x)))))))\r\n            (cdr page)))\r\n  (write-file <span style=\"color: #4E9A06\">\"\/tmp\/wiki.csv\"<\/span>))\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgd6c7cdb\" class=\"outline-2\">\n<h2 id=\"orgd6c7cdb\">Some results<\/h2>\n<div class=\"outline-text-2\" id=\"text-3\">\n<p>\nI switch to Stata to analyse the resulting file (you can use anything you like, as long as it&#8217;s not <a href=\"https:\/\/twitter.com\/BrendanTHalpin\/status\/1171798865361682433?s=20\">Excel<\/a>. We see an average of about 100 edits per page, but the range is from 1 to almost 2750:\n<\/p>\n<pre class=\"example\">\r\n Variable |     Obs      Mean  Std. Dev.    Min     Max\r\n----------+--------------------------------------------\r\n        N |     996  103.1697  265.6243       1    2748\r\n<\/pre>\n<div class=\"figure\">\n<p><img decoding=\"async\" src=\"http:\/\/teaching.sociology.ul.ie\/bhalpin\/histogN.png\" alt=\"histogN.png\" \/><\/p>\n<\/div>\n<p>\nAn extract from the page frequency table gives a flavour of the eclecticism of Wikipedia:\n<\/p>\n<pre class=\"example\">\r\n. tab pagetitle, sort\r\n\r\n                              pagetitle |   Freq. Percent    Cum.\r\n----------------------------------------+------------------------\r\n Mitt Romney 2012 presidential campaign |   2,748    2.67    2.67\r\n               Resident Evil: Afterlife |   2,466    2.40    5.07\r\n          Cartoon Network (Philippines) |   2,255    2.19    7.27\r\n                     Adamson University |   2,226    2.17    9.43\r\nWorld War II casualties of the Soviet.. |   2,176    2.12   11.55\r\n                                 Cumans |   2,056    2.00   13.55\r\n                         IndyCar Series |   1,950    1.90   15.45\r\n               Syrian Democratic Forces |   1,881    1.83   17.28\r\n                        Australian Army |   1,835    1.79   19.07\r\n                 Padmanabhaswamy Temple |   1,829    1.78   20.85\r\n              List of Cluedo characters |   1,538    1.50   22.34\r\n[ . . . ]\r\n             Podosinovets, Kirov Oblast |       1    0.00  100.00\r\n                     Stuart Fitzsimmons |       1    0.00  100.00\r\n                              Ust-Morzh |       1    0.00  100.00\r\n----------------------------------------+------------------------\r\n                                  Total | 102,757  100.00\r\n<\/pre>\n<p>\nHere is a picture of how long ago the earliest edit was:\n<\/p>\n<div class=\"figure\">\n<p><img decoding=\"async\" src=\"..\/histogD.png\" alt=\"histogD.png\" \/>\n<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Idle hands For the want of something better to do (okay, because procrastination), a pass at webscraping Wikipedia. For fun. I&#8217;m going to use it&#8217;s &#8220;Random Page&#8221; to sample pages, and then extract the edit history (looking at how often edited, and when). Let&#8217;s say we&#8217;re interested in getting an idea of the distribution of &hellip; <a href=\"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=580\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Webscraping Wikipedia with Emacs<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/580"}],"collection":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=580"}],"version-history":[{"count":6,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/580\/revisions"}],"predecessor-version":[{"id":600,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/580\/revisions\/600"}],"wp:attachment":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=580"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=580"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=580"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}