// * Section 2: Working with real data
//
// ** Working with real data
//
// We will work with two main example data sets, a six-year span of labour
// market activity of women who have a birth at the end of year two (drawn
// from the BHPS) and the McVicar/Anayadike-Danes school leaver data set.
// We will work with the mothers' labour market sequences first:
//
// #+BEGIN_SRC
use http://teaching.sociology.ul.ie/oslo/bsseq
// #+END_SRC
//
// The labour market status variable has the values full-time, part-time,
// unemployed and non-employed, but we have 72 consecutive observations.
// How do we get an overview?
// #+begin_src
tab state1 state72
tab state1 state24
tab state24 state72
// #+end_src
//
// Better to make a chronogram or time-dependent state-distribution graph:
// #+begin_src
chronogram state*
// #+end_src
//
// This loses all individual information but summarises the temporal
// average well. To retain the individual level view, we need to use
// indexplots, provided by the SQ package. SQ requires the data in a
// different format (long, with one observation per person--time-unit):
// #+begin_src
reshape long state, i(pid) j(t)
sqset state pid t
sqindexplot, order(pid)
// #+end_src
//
// This plot orders the data by PID, which is pretty much at random with
// respect to the sequences. As a result, it is hard to see much structure.
// The default =sqindexplot= is to sort "lexically", by month 1, then by
// month 2, etc.
// #+begin_src
sqindexplot
// #+end_src
// This is a little better, and in particular we see that there are
// significant numbers of sequences 100% in the same state. However, for
// those that change state, after the first transition there is the same
// chaos as before.
//
// *** Sort by another time-point (optional)
// <>
// You could also sort by another time-point, e.g., the
// date of the first month in work after the birth. To calculate this
// variable we move back to the wide format temporarily.
// #+begin_src
reshape wide
gen retdate = 99
forvalues x = 25/72 {
replace retdate = `x' if inlist(state`x',1,2) & retdate==99
}
reshape long
sqindexplot, order(retdate)
// #+end_src
//
// ** SQ descriptives
//
// We can use one or two other SQOM utilities to get more information on
// the sequences:
// #+begin_src
sqtab
sqdes
// #+end_src
//
// =sqtab= tabulates the sequences (showing a truncated representation),
// which makes it apparent (with numbers) that we have lots of simple
// sequences, with 27.45% being non-employed the whole time. =sqdes= shows
// that there are 417 (of 940) unique patterns. For 392 patterns there is
// only a single example, but the 100% non-employed pattern has 258 examples.
//
// ** Survival curves (optional)
//
// Indexplots and chronograms take advantage of the whole time-span but it
// can be useful to focus on single transitions in a more traditional
// framework such as Kaplan-Meier survival curves. Let's look at the return
// to work again, defined as the first month full or part-time after the
// birth (for some it is month 1, for some never).
//
// (See [[sorttp][above for definition of =retdate=]])
// #+begin_src
reshape wide
gen returned = retdate<99
stset retdate, failure(returned)
sts graph
// #+end_src
//
// ** School-leavers data: independent work
//
// Download the MVAD data. Note that this has six states (school, training,
// further education, higher education, unemployment, employment). Use the
// tools described above to get an overview of it. Note that this dataset
// has a few useful covariates, so try adding the =by= option to the
// chronogram, indexplot and survival graphs (e.g., =by(male)= or
// =by(gcse5eq)=).
//
// ** Calculating OM and Hamming distances with real data
//
// To calculate OM and Hamming distances on the mothers' data or the MVAD
// data, use the same method as with the small data set above. The only
// difference is setting the substitution and indel costs. For the mothers
// data, the same substitution matrix can be used:
// #+begin_src
matrix subs = (0,1,2,3\ ///
1,0,1,2\ ///
2,1,0,1\ ///
3,2,1,0)
matrix list subs
// #+end_src
// This puts the four states on a single dimension from full-time work to
// non-employment. For the MVAD data, use the cost matrix they
// used in their paper:
// #+begin_src
matrix mvdanes = (0,1,1,2,1,3 \ ///
1,0,1,2,1,3 \ ///
1,1,0,2,1,2 \ ///
2,2,2,0,1,1 \ ///
1,1,1,1,0,2 \ ///
3,3,2,1,2,0 )
// #+end_src
//
// For either or both data sets, calculate OM (set indel to 1.5) and
// Hamming distances. You may need to increase the maximum matrix size
// first (the default is 800 and there are 940 mothers' sequences):
// #+begin_src
clear
use http://teaching.sociology.ul.ie/oslo/mvad
matrix mvdanes = (0,1,1,2,1,3 \ ///
1,0,1,2,1,3 \ ///
1,1,0,2,1,2 \ ///
2,2,2,0,1,1 \ ///
1,1,1,1,0,2 \ ///
3,3,2,1,2,0 )
hamming state1-state72, subsmat(mvdanes) pwd(ham)
oma state1-state72, subsmat(mvdanes) indel(2) pwd(oma) len(72)
// #+end_src
//
// Note that the =hamming= command is relatively slow as it is coded in
// Mata, rather than C. The =oma= command has to make rather more
// calculations so it is necessary to code this in C for speed. However
// both commands are making hundreds of thousands of sequence comparisons
// (for the mothers' data, $441,330 = \frac{940\times 939}{2}$), so neither is
// instantaneous.
//
// You can compare the results by inspecting parts of
// the distance matrices like this:
// #+begin_src
matlist ham[1..10,1..10]
matlist oma[1..10,1..10]
// #+end_src
//
// Note that in many cases the distances are the same. You can compare
// pairs visually using a string representation of the sequence:
// #+begin_src
stripe state*, gen(seqstr) symb("EFHSTU")
list seqstr if inlist(_n,1,10)
list seqstr if inlist(_n,6,10)
// #+end_src
//
// ** Clustering
// When you have large distance matrices, simple inspection is not
// effective. We need to reduce the complexity of the data in some manner.
// The usual one is cluster analysis: group sequences with similar
// sequences in order to create a data-driven classification. This is
// readily carried out in Stata. Given distances in the matrix =oma=:
// #+begin_src
clustermat wards oma, add
cluster generate g8=groups(8)
tab g8
sort g8 seqstr
list g8 seqstr
chronogram state*, by(g8)
// #+end_src
//
// We can also create cluster-specific index plots:
// #+begin_src
cluster generate g999 = groups(800), ties(fewer)
reshape long state, i(id) j(t)
sqset state id t
sqindexplot, by(g8) // by cluster, lexical order within
sqindexplot, order(g999) // by dendrogram order
sqindexplot, by(g8) order(g999) // by cluster, sub-cluster order within
// #+end_src
//
// It can be useful to view the dendrogram, but for more than about 40
// sequences it needs to be "trimmed" using the =cutnumber()= or
// =cutvalue()= options:
// #+begin_src
reshape wide
cluster dendrogram, cutnumber(40)
cluster dendrogram, cutvalue(15)
// #+end_src
//
// ** Comparing distance matrices
// We can run the same analysis on OMA distances and on Hamming distances.
// The results will not be very different but will not be the same. First,
// we can compare the distances against each other, for instance by
// correlation:
// #+begin_src
corrsqm ham oma
// #+end_src
// As we see, the correlation is very high, partly because for many pairs
// the distances are the same. But if we cluster the Hamming distance and
// compare it with the OM cluster solution, small differences are amplified
// a little:
// #+begin_src
clustermat wards ham, add
cluster gen h8=groups(8)
permtab g8 h8
// #+end_src
//
// =permtab= permutes the second classification to maximise its match with
// the first, and then tabulates the result. As we see, a substantial
// minority of sequences are clustered differently.
//
//
// ** Other cluster linkages (optional exercise)
//
// Ward's "linkage" (i.e., method for deciding what elements to combine
// into clusters) is often used in cluster analysis, and it has a nice
// interpretation in terms of minimising intra-cluster variance. There are
// a number of other linkages in hierarchical agglomerative cluster
// analysis, as well as non-hierarchical cluster methods to consider. While
// non-hierarchical methods such as k-means and k-medians are not available
// for distance matrices in Stata, it is easy to replace Ward's with other
// methods. The choices available are:
//
// - =singlelinkage= single-linkage cluster analysis
// - =averagelinkage= average-linkage cluster analysis
// - =completelinkage= complete-linkage cluster analysis
// - =waveragelinkage= weighted-average linkage cluster analysis
// - =medianlinkage= median-linkage cluster analysis
// - =centroidlinkage= centroid-linkage cluster analysis
// - =wardslinkage= Ward's linkage cluster analysis
//
// Try =clustermat median oma, add= and so on, using e.g., =permtab= to compare
// solutions.
//
// In my experience, Ward's is the most likely to yield a fairly even
// distribution of cases across clusters.
//
//
// ** String representations and regular expressions
//
// We have seen the utility of representing sequences as strings:
// #+begin_src
permtab g8 h8, newvar(ph8)
list seqstr if g8==1 & ph8==7 // View cases in disagreement
list seqstr if g8==1 & ph8!=7
// #+end_src
//
// "Regular expressions" are ways of defining patterns in strings. Regexes are
// implemented in many languages, including Stata. These can be very useful
// for exploring the structure of sequences.
//
// - =A*B=: =A= zero or more times followed by a =B=
// - =A+B=: at least one =A= followed by a =B=
// - =A?B=: =A= zero or one time followed by a =B=
// - =A.+B=: an =A= followed by at least one character, followed by a =B=
// - =(AB)+C=: one or more =AB=s followed by a =C=
// - =^ABC=: a string starting with =ABC=
// - =ABC$=: a string ending with =ABC=
// - =^A+$=: a string composed entirely of {=A=}s
// - =A[BCD]+E=: =A= followed by at least one of =B=, =C= or =D=, followed by =E=
// - =A[^A]+A=: =A= followed by at least one element that is not =A=, followed by =A=
//
// For instance, to ask how many sequences are 100% in non-employment:
// #+begin_src
clear
use http://teaching.sociology.ul.ie/oslo/bsseq
stripe state*, gen(seqstr) symb("FPun")
count if regexm(seqstr,"^n+$")
// #+end_src
// How many sequences do we observe entering unemployment from full-time:
// #+begin_src
count if regexm(seqstr,"Fu")
// #+end_src
// How many sequences do we observe entering unemployment from full-time,
// but later return to full or part-time:
// #+begin_src
count if regexm(seqstr,"Fu.+[FP]")
// #+end_src
//
// What clusters include sequences with 12 months part-time:
// #+begin_src
// Generate clusters
oma state1-state72, subsmat(subs) indel(2) pwd(oma) len(72)
clustermat wards oma, add
cluster gen g8=groups(8)
// Test patterns
gen pt12 = regexm(seqstr,"PPPPPPPPPPPP")
tab g8 pt12
// #+end_src
// What clusters include sequences that are never in non-employment:
// #+begin_src
gen nevern = regexm(seqstr,"^[FPu]+$")
tab g8 nevern
// #+end_src
// Which include sequences that experience at least 12 consecutive months of full-time,
// at least 6 of non-employment and at least 12 of full-time later:
// #+begin_src
gen fnf = regexm(seqstr,"FFFFFFFFFFFF.*nnnnnn.*FFFFFFFFFFFF")
tab g8 fnf
// #+end_src
//
// With either data set, explore this method to characterise as many of the
// clusters as possible. It is a very good way of assessing the coherence
// of the cluster solution in terms of substantive meaning. With luck and
// perseverance you may end up defining a set of ideal types that
// characterise the bulk of the data.
//
// ** Other ways of characterising clusters: n-spells, cumdur, entropy, turbulence
//
// The =SADI= package includes a number of other utilities that can be
// useful for summarising clusters or individual sequences.
//
// #+begin_src
cumuldur state*, cd(dur) nstates(4)
table g8, c(mean dur1 mean dur2 mean dur3 mean dur4)
nspells state*, gen(nspells)
table g8, c(mean nspells)
entropy state*, cd(rel) nstates(4) gen(entropy)
table g8, c(mean entropy)
// #+end_src
//
// Cumulated duration simply adds time-units in each state, creating one
// summary variable (=dur1= to =dur4= in this case) per state. =nspells=
// treats consecutive runs in the same state as spells (with missing
// counted as a state, if present). The entropy calculation calculates
// Shannon entropy, based on cumulated duration. Entropy is calculated as
// $-\sum p_i\log_2 p_i$ where $p_i$ is the proportion of months in state
// $i$. It takes no account of order, only of cumulative duration, so is
// clearly inferior to Elzinga's /turbulence/ measure, which is available
// in R but not in Stata.
//
//