Given that determining substitution costs in sequence analysis is such a bone of contention, many researchers look for a way for the data to generate the costs. The typical way to do this is, is by pooling transition rates and defining the substitution cost to be:
2 – p(ij) – p(ji)
where p(ij) is the transition rate from state i to state j. Intuitively, states that are closer to each other will have higher transitions, and vice versa.
I don’t recommend this approach in general, for reasons which I will not go into here, but I do have a utility in my Stata package for sequence analysis, SADI, which calculates these quantities, trans2subs
.
This requires the data in long format, so we reshape first (by default the sequences are in wide format, as variables state1
to state
N).
reshape long state, i(id) j(t)
trans2subs state, id(id) distmat(trpr1)
trans2subs state, id(id) distmat(trpr2) diagincl
reshape wide
The transition rates are calculated by default without the diagonal (i.e., ignoring cases where the sequence remains in the same state from t to t+1), but this can be over-ridden by an option.
The command works by cross-tabulating state with its lag, putting the results in a matrix, and letting Mata do some simple calculations on the result. However, the trans2subs
command as distributed is fragile, and can break down in certain circumstances, for instance where a row or column has values only on the diagonal (i.e., a state that is only exited or is never exited, such as never-married or retired). Thanks to Anna Manzoni for alerting me to this problem.
As a short term solution, I present an alternative command here, t2s
, which is more robust. I will replace trans2subs
with this code when I next update the SADI package, but for now you can access it from this link, or by cutting and pasting from here:
mata:
void transition_driven_subsmat2(string matrix tabmat, scalar diagincl) {
// Read stata matrix into mata
G=st_matrix(tabmat)
if (rows(G)!=cols(G)) {
_error(“Table isn’t square”)
}
if (diagincl==0) {
G = G – diag(G)
}
Gr=G:/rowsum(G)
subsmat= trunc(0.5:+(J(rows(G),rows(G),2) – Gr – Gr’):*1000000):/1000000
subsmat = subsmat – diag(subsmat)
st_matrix(tabmat,subsmat)
}
end
capture program drop t2s
program define t2s
syntax varlist(min=1 max=1) [if] [in], IDvar(varname) SUBSmat(string) [DIAgincl]
if (“`diagincl'”==””) {
local diagincl 0
}
else {
local diagincl 1
}
marksample touse
local colvar `varlist’
tempvar rowvar
by `idvar’: gen `rowvar’=`colvar'[_n-1] if _n>1
di “Generating transition-driven substitution matrix”
qui tab `rowvar’ `colvar’ if `touse’, matcell(`subsmat’)
mata: transition_driven_subsmat2(“`subsmat'”,`diagincl’)
end