Discrepancy analysis in Stata

In Studer et al (2011) an important new tool is introduced to the field of sequence analysis, the idea of “discrepancy” as a way of analysing pairwise distances. This quantity is shown to be analogous to variance, and is thus amenable to ANOVA-type analysis, which means it is a very attractive complement to cluster analysis of distance matrices.

This has been implemented in TraMineR (under R), along with a raft of other innovations coming out of Geneva and Lausanne. Up to now it hasn’t been available elsewhere. I spoke to Matthias Studer at the LaCOSA conference, and he convinced me that it was easy to code, and that all the information required was in the paper. This turned out to be the case, and I have written an initial Stata implementation. The program, below, focuses on calculating the distance of each sequence to the “centre of gravity” of the group it is in (we group the sequences according to a categorical variable, either an observed one or a cluster solution). Summing this distance across the data set gives us an analogue of sums of squares for ANOVA: if all the data is in a single group, the SS is the total sum of squares; if grouped by a “predictor” variable, it is the predicted sum of squares. This leads directly to a pseudo-R-squared, and an F-test (though the significance of the F-test needs to be calculated by permutation).  Another advantage is that we can easily identify a “type-sequence” for a group (either observed or clustered, independently of the cluster algorithm): the “medoid” is the sequence (or sequences) with the smallest distance from the centre. Depending on the data, the medoid can be a good summary of a cluster; indeed we can use discrepancy to estimate how good, by calculating the mean distance to the centre for each group.

In due course I will incorporate this in my SADI package (net from http://teaching.sociology.ul.ie/sadi; net install sadi), but I felt like making the initial attempt public (I need to test this a bit more, and decide how to organise it). For now, if you copy and paste the code below (starting “program define“) into a file getmedoid.ado on your adopath, you will get access to a command like this:

getmedoid groupvar, distmat(dist) medvar(varstub)

This takes a distance matrix (dist) and a grouping variable (groupvar), and creates two new variables. If the value of varstub is, say, “med” it creates med and med_d, the former identifying medoids and  the latter containing the distance to the centre of the group. To get the pseudo-R-squared, do the following:

gen one = 1
getmedoid one, distmat(dist) medvar(ss)
getmedoid group, distmat(dist) medvar(g)
su ss_d, meanonly
local SSt `r(mean)'
su g_d, meanonly
local SSw `r(mean)'
qui tab group
local ncats `r(r)'
di "pseudo-R2: " (`SSt' - `SSw')/`SSt'
di "pseudo-F: " ((`SSt' - `SSw')/(`ncats'-1))/(`SSw'/(_N-`ncats'))

That is, to get the discrepancy for the whole data set as a single group, run the command on a variable that has one value only. The sum of those distances are the TSS equivalent. Running it on the multi-category group variable and summing the distances gives the RSS analogue.

The permutation analysis to get an idea of the sampling distribution of the F-statistic seems to be relatively straightforward. I’ve coded something as proof of concept but not yet set it up as easy to use. If you’re interested, email me or wait until I get around to including it in SADI.

Many thanks to Matthias and his colleagues.

program define getmedoid
 syntax varlist (min=1 max=1), DISTmat(string) MEDvar(string)

 tempvar seqid
 tempvar orgseqid
 tempname groupN

 // Get the classification and its size into matrices
 qui tab `varlist', matcell(`groupN')

 // Save the sort order in a local and a variable (seqid)
 qui des, varlist
 local so `r(sortlist)'
 if ("`so'"=="") local so `orgseqid'
 // Sort by the grouping variable, putting the permuted seqid into mata
 gen `orgseqid' = _n
 sort `varlist'
 gen `seqid' = _n

 // Restore sort order
 // di "Sorting by [`so']"
 sort `so'
 mata: getmedoid(st_matrix("`distmat'"), st_data(.,"`seqid'"), st_matrix("`groupN'"), "`medvar'")

end

 mata:
 void getmedoid(real matrix dist, real vector seqorder, real vector groupsize, string varstub) {

 real scalar ngroups, i, low, high
 real vector cumulate, ro, medoid, dg, SS
 real matrix distg

 ngroups = rows(groupsize)
 cumulate = J(ngroups+1,1,0)
 ro     = J(rows(dist),1,.)
 medoid = J(rows(dist),1,.)
 dg     = J(rows(dist),1,.)
 SS = J(ngroups,1,.)

 distg = dist[invorder(seqorder),invorder(seqorder)]

 for (i=1; i<=ngroups; i++) {
 cumulate[i+1] = cumulate[i]+groupsize[i]
 low = cumulate[i]+1
 high = cumulate[i+1]
 ro[low..high] = rowsum(distg[low..high,low..high])
 SS[i] = sum(distg[low..high,low..high])*0.5*(1/rows(distg[low..high,low..high]))
 dg[low..high,1] = (ro[low..high] :- SS[i]) :/ rows(distg[low..high,1])

 medoid[low..high,1] = dg[low..high,1] :== min(dg[low..high,1])
 }

 medoid = medoid[seqorder]
 dg = dg[seqorder]

 idx = st_addvar("double",(varstub, varstub+"_d"))
 st_view(V=.,.,idx)
 V[.,.] = (medoid, dg)

 }

end