Slides from my workshop on sequence analysis as part of the "Pathways to Adulthood" Workshop, Helsinki Collegium for Advanced Studies, 19 May 2010:
Sequence analysis is possible with Stata using the SQ-ados, with Stata using my C-plugin and using the TraMineR package with R. You can install the SQ add-on by issuing the command "
ssc install sq" from within Stata (but check that you do not need to access the internet via a proxy server; in that case set the appropriate variable -- see "help netio" within Stata for information). You can install my C-plugin using the instructions here, and you can install TraMineR in R using the command
install.package(TraMineR)within R.
Here are some examples, using this artificial data set.
This example uses the SQ add-on to fit pairwise distances between all pairs of sequences:
input id len s1 s2 s3 s4 s5
1 5 1 2 3 2 3
2 5 2 3 4 1 4
3 5 4 4 4 4 4
4 5 1 2 3 2 3
5 5 1 1 1 1 1
6 4 4 3 2 1 -1
end
#delimit ;
matrix subsmat = (0,1,2,3 \
1,0,1,2 \
2,1,0,1 \
3,2,1,0 );
#delimit cr
reshape long s, i(id) j(m)
sqset s id m
sqdes
sqtab
sqom if s!=-1, subcost(rawdistance) indelcost(2) full standard(longer)
matrix list SQdist
Use the command
help sqdemofor more information.
This example runs my plugin and SQOM on the same data:
#delimit ;
adopath + [[ path to where you have installed plugin and helper programs ]];
set matsize 800;
matrix subs = (0,1,2,3 \
1,0,1,2 \
2,1,0,1 \
3,2,1,0 );
use anondata;
oma state1-state72, pwdist(bs) subsmat(subs) indel(2) length(72);
reshape long state, i(id) j(m);
sqset state id m;
sqdes;
sqtab;
sqom, subcost(subs) indelcost(2) full standard(longer);
reshape wide;
clustermat wards bs, add;
cluster generate o6=groups(6);
tab o6;
This example uses TraMineR on the same data. See the TraMineR manual at http://mephisto.unige.ch/pub/TraMineR/Doc/1.4/TraMineR-1.4-Users-Guide.pdf for an explanation:
library(TraMineR)
library(foreign)
bs <- read.dta("anondata.dta")
attach(bs)
bs.lab <- seqstatl(bs[, 2:74])
bs.scode <- c("F", "p", "U", "n")
bs.seq <- seqdef(bs, 2:74, states=bs.scode, labels=bs.lab)
seqiplot(bs.seq, withlegend = T, title = "Index plot: 10 first sequences")
seqfplot(bs.seq, pbarw = T, withlegend = T, space = 0, title = "Sequence frequency plot")
seqdplot(bs.seq, withlegend = T, title = "State distribution plot")
bs.turb <- seqST(bs.seq)
summary(bs.turb)
hist(bs.turb, col = "cyan")
submat <- rbind(c(0, 1, 2, 3 ),
c(1, 0, 1, 2 ),
c(2, 1, 0, 1 ),
c(3, 2, 1, 0 ))
dist.om <- seqdist(bs.seq, method = "OM", indel = 2, sm = submat)
library(cluster)
clusterward <- agnes(dist.om, diss = TRUE, method = "ward")
cl.6 <- cutree(clusterward, k = 6)
table(cl.6)
Brendan Halpin