{"id":186,"date":"2012-06-19T23:05:32","date_gmt":"2012-06-19T23:05:32","guid":{"rendered":"http:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=186"},"modified":"2014-04-19T14:47:57","modified_gmt":"2014-04-19T14:47:57","slug":"discrepancy-analysis-in-stata","status":"publish","type":"post","link":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=186","title":{"rendered":"Discrepancy analysis in Stata"},"content":{"rendered":"<p>In <a href=\"http:\/\/mephisto.unige.ch\/pub\/publications\/gr\/Studer-et-al_SMR11-preprint\">Studer et al (2011)<\/a> an important new tool is introduced to the field of sequence analysis, the idea of &#8220;discrepancy&#8221; as a way of analysing pairwise distances. This quantity is shown to be analogous to variance, and is thus amenable to ANOVA-type analysis, which means it is a very attractive complement to cluster analysis of distance matrices.<\/p>\n<p>This has been implemented in <a href=\"http:\/\/mephisto.unige.ch\/traminer\/\">TraMineR<\/a> (under R), along with a raft of other innovations coming out of Geneva and Lausanne. Up to now it hasn&#8217;t been available elsewhere. I spoke to Matthias Studer at the <a href=\"http:\/\/www3.unil.ch\/wpmu\/sequences2012\/\">LaCOSA<\/a> conference, and he convinced me that it was easy to code, and that all the information required was in the paper. This turned out to be the case, and I have written an initial Stata implementation. <!--more-->The program, below, focuses on calculating the distance of each sequence to the &#8220;centre of gravity&#8221; of the group it is in (we group the sequences according to a categorical variable, either an observed one or a cluster solution). Summing this distance across the data set gives us an analogue of sums of squares for ANOVA: if all the data is in a single group, the SS is the total sum of squares; if grouped by a &#8220;predictor&#8221; variable, it is the predicted sum of squares. This leads directly to a pseudo-R-squared, and an F-test (though the significance of the F-test needs to be calculated by permutation).\u00a0 Another advantage is that we can easily identify a &#8220;type-sequence&#8221; for a group (either observed or clustered, independently of the cluster algorithm): the &#8220;medoid&#8221; is the sequence (or sequences) with the smallest distance from the centre. Depending on the data, the medoid can be a good summary of a cluster; indeed we can use discrepancy to estimate how good, by calculating the mean distance to the centre for each group.<\/p>\n<p>In due course I will incorporate this in my SADI package (<code>net from http:\/\/teaching.sociology.ul.ie\/sadi; net install sadi<\/code>), but I felt like making the initial attempt public (I need to test this a bit more, and decide how to organise it). For now, if you copy and paste the code below (starting &#8220;<code>program define<\/code>&#8220;) into a file <code>getmedoid.ado<\/code> on your adopath, you will get access to a command like this:<\/p>\n<pre>getmedoid groupvar, distmat(dist) medvar(varstub)<\/pre>\n<p>This takes a distance matrix (<code>dist<\/code>) and a grouping variable (<code>groupvar<\/code>), and creates two new variables. If the value of varstub is, say, &#8220;med&#8221; it creates <code>med<\/code> and <code>med_d<\/code>, the former identifying medoids and\u00a0 the latter containing the distance to the centre of the group. To get the pseudo-R-squared, do the following:<\/p>\n<pre>gen one = 1\r\ngetmedoid one, distmat(dist) medvar(ss)\r\ngetmedoid group, distmat(dist) medvar(g)\r\nsu ss_d, meanonly\r\nlocal SSt `r(mean)'\r\nsu g_d, meanonly\r\nlocal SSw `r(mean)'\r\nqui tab group\r\nlocal ncats `r(r)'\r\ndi \"pseudo-R2: \" (`SSt' - `SSw')\/`SSt'\r\ndi \"pseudo-F: \" ((`SSt' - `SSw')\/(`ncats'-1))\/(`SSw'\/(_N-`ncats'))<\/pre>\n<p>That is, to get the discrepancy for the whole data set as a single group, run the command on a variable that has one value only. The sum of those distances are the TSS equivalent. Running it on the multi-category group variable and summing the distances gives the RSS analogue.<\/p>\n<p>The permutation analysis to get an idea of the sampling distribution of the F-statistic seems to be relatively straightforward. I&#8217;ve coded something as proof of concept but not yet set it up as easy to use. If you&#8217;re interested, <a href=\"mailto:brendan.halpin@ul.ie\">email me<\/a> or wait until I get around to including it in SADI.<\/p>\n<p>Many thanks to Matthias and his colleagues.<\/p>\n<pre>program define getmedoid\r\n syntax varlist (min=1 max=1), DISTmat(string) MEDvar(string)\r\n\r\n tempvar seqid\r\n tempvar orgseqid\r\n tempname groupN\r\n\r\n \/\/ Get the classification and its size into matrices\r\n qui tab `varlist', matcell(`groupN')\r\n\r\n \/\/ Save the sort order in a local and a variable (seqid)\r\n qui des, varlist\r\n local so `r(sortlist)'\r\n if (\"`so'\"==\"\") local so `orgseqid'\r\n \/\/ Sort by the grouping variable, putting the permuted seqid into mata\r\n gen `orgseqid' = _n\r\n sort `varlist'\r\n gen `seqid' = _n\r\n\r\n \/\/ Restore sort order\r\n \/\/ di \"Sorting by [`so']\"\r\n sort `so'\r\n mata: getmedoid(st_matrix(\"`distmat'\"), st_data(.,\"`seqid'\"), st_matrix(\"`groupN'\"), \"`medvar'\")\r\n\r\nend\r\n\r\n mata:\r\n void getmedoid(real matrix dist, real vector seqorder, real vector groupsize, string varstub) {\r\n\r\n real scalar ngroups, i, low, high\r\n real vector cumulate, ro, medoid, dg, SS\r\n real matrix distg\r\n\r\n ngroups = rows(groupsize)\r\n cumulate = J(ngroups+1,1,0)\r\n ro\u00a0\u00a0\u00a0\u00a0 = J(rows(dist),1,.)\r\n medoid = J(rows(dist),1,.)\r\n dg\u00a0\u00a0\u00a0\u00a0 = J(rows(dist),1,.)\r\n SS = J(ngroups,1,.)\r\n\r\n distg = dist[invorder(seqorder),invorder(seqorder)]\r\n\r\n for (i=1; i&lt;=ngroups; i++) {\r\n cumulate[i+1] = cumulate[i]+groupsize[i]\r\n low = cumulate[i]+1\r\n high = cumulate[i+1]\r\n ro[low..high] = rowsum(distg[low..high,low..high])\r\n SS[i] = sum(distg[low..high,low..high])*0.5*(1\/rows(distg[low..high,low..high]))\r\n dg[low..high,1] = (ro[low..high] :- SS[i]) :\/ rows(distg[low..high,1])\r\n\r\n medoid[low..high,1] = dg[low..high,1] :== min(dg[low..high,1])\r\n }\r\n\r\n medoid = medoid[seqorder]\r\n dg = dg[seqorder]\r\n\r\n idx = st_addvar(\"double\",(varstub, varstub+\"_d\"))\r\n st_view(V=.,.,idx)\r\n V[.,.] = (medoid, dg)\r\n\r\n }\r\n\r\nend<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>In Studer et al (2011) an important new tool is introduced to the field of sequence analysis, the idea of &#8220;discrepancy&#8221; as a way of analysing pairwise distances. This quantity is shown to be analogous to variance, and is thus amenable to ANOVA-type analysis, which means it is a very attractive complement to cluster analysis &hellip; <a href=\"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=186\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Discrepancy analysis in Stata<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7,4],"tags":[],"_links":{"self":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/186"}],"collection":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=186"}],"version-history":[{"count":24,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/186\/revisions"}],"predecessor-version":[{"id":294,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/186\/revisions\/294"}],"wp:attachment":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=186"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=186"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=186"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}