Sequence analysis workshop in Helsinki, 19 May 2010

Slides from my workshop on sequence analysis as part of the "Pathways to Adulthood" Workshop, Helsinki Collegium for Advanced Studies, 19 May 2010:

Software

Sequence analysis is possible with Stata using the SQ-ados, with Stata using my C-plugin and using the TraMineR package with R. You can install the SQ add-on by issuing the command "

ssc install sq

" from within Stata (but check that you do not need to access the internet via a proxy server; in that case set the appropriate variable -- see "help netio" within Stata for information). You can install my C-plugin using the instructions here, and you can install TraMineR in R using the command

install.package(TraMineR)

within R.

Here are some examples, using this artificial data set.

Stata

This example uses the SQ add-on to fit pairwise distances between all pairs of sequences:

input id len s1 s2 s3 s4 s5
1 5 1 2 3 2 3
2 5 2 3 4 1 4
3 5 4 4 4 4 4
4 5 1 2 3 2 3
5 5 1 1 1 1 1
6 4 4 3 2 1 -1
end

#delimit ;
matrix subsmat = (0,1,2,3 \
                  1,0,1,2 \
                  2,1,0,1 \
                  3,2,1,0  );
#delimit cr

reshape long s, i(id) j(m)
sqset s id m
sqdes
sqtab
sqom if s!=-1, subcost(rawdistance) indelcost(2) full standard(longer)
matrix list SQdist

Use the command

help sqdemo

for more information.

This example runs my plugin and SQOM on the same data:

#delimit ;
adopath + [[ path to where you have installed plugin and helper programs ]];

set matsize 800;

matrix subs = (0,1,2,3 \
               1,0,1,2 \
               2,1,0,1 \
               3,2,1,0 );

use anondata;

oma  state1-state72, pwdist(bs)     subsmat(subs) indel(2) length(72);

reshape long state, i(id) j(m);

sqset state id m;

sqdes;
sqtab;

sqom, subcost(subs) indelcost(2) full standard(longer);

reshape wide;
clustermat wards bs, add;
cluster generate o6=groups(6);
tab o6;

This example uses TraMineR on the same data. See the TraMineR manual at http://mephisto.unige.ch/pub/TraMineR/Doc/1.4/TraMineR-1.4-Users-Guide.pdf for an explanation:

library(TraMineR)
library(foreign)
bs <- read.dta("anondata.dta")

attach(bs)
bs.lab <- seqstatl(bs[, 2:74])
bs.scode <- c("F", "p", "U", "n")
bs.seq <- seqdef(bs, 2:74, states=bs.scode, labels=bs.lab)



seqiplot(bs.seq, withlegend = T, title = "Index plot: 10 first sequences")
seqfplot(bs.seq, pbarw = T, withlegend = T, space = 0,  title = "Sequence frequency plot")
seqdplot(bs.seq, withlegend = T, title = "State distribution plot")

bs.turb <- seqST(bs.seq)
summary(bs.turb)
hist(bs.turb, col = "cyan")

submat  <- rbind(c(0, 1, 2, 3 ),
                 c(1, 0, 1, 2 ),
                 c(2, 1, 0, 1 ),
                 c(3, 2, 1, 0 ))

dist.om <- seqdist(bs.seq, method = "OM", indel = 2, sm = submat)

library(cluster)
clusterward <- agnes(dist.om, diss = TRUE, method = "ward")

cl.6 <- cutree(clusterward, k = 6)

table(cl.6)

Brendan Halpin
Department of Sociology
University of Limerick