Table of Contents
A new command
In this blog I introduce a new utility command, sdmksubs
, which creates substitution cost matrices for sequence analysis, as part of the SADI Stata add-ons.
Most sequence-analysis distance commands require the specification of substitution costs, which describe the pattern of differences in the state space through which the sequences move. These can be derived from theory, from external data, or can be imposed by researcher fiat. It is also common to use the pattern of transitions in the sequence data to derive them, though this is not an unproblematically good idea. The existing trans2subs
command in SADI calculates simple transition-based substitution costs. The new sdmksubs
calculates this substitution cost structure, and a range of others, some simple “theoretical” ones, and some based on the transition pattern, but taking more of the data into account than the traditional trans2subs
matrix.
SADI is a Stata package sequence analysis of data such as lifecourse histories, and has been around for quite a while. Recent improvements includes fixes for internal changes in Stata 18, lifting limits on sequence length, etc., but here I focus on sdmksubs
only.
Simple matrices
The simplest possible parameterisation of substitution costs is to treat each state as equally different from all others. This yields a “flat” substitution cost matrix:
clear all use mvad, clear egen maxcheck = rowmax(state*) su maxcheck loc nstates=r(max) sdmksubs, dtype(flat) subsmat(flat) nstates(6) matlist flat, format(%6.0f)
Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- maxcheck | 712 4.724719 1.57425 1 6 Scaling to 0--1 range | c1 c2 c3 c4 c5 c6 -------------+------------------------------------------------ r1 | 0 r2 | 1 0 r3 | 1 1 0 r4 | 1 1 1 0 r5 | 1 1 1 1 0 r6 | 1 1 1 1 1 0
(Explanation: load the example MVAD data that comes with SADI; check the max of the state1-state72
state variables; generate the “flat” and “linear” matrices using this maximum in the nstates()
option.)
For N states, a flat matrix implies a state-space with N-1 dimensions. Another very simple scheme is to treat the numbers attached to the states as their location on a single dimension. Then the difference or distance is the absolute difference between the state values. This is the “linear” scheme, here scaled so the largest distance is 1.0:
sdmksubs, dtype(linear) subsmat(linear) nstates(6)
matlist linear, format(%6.3f)
Scaling to 0--1 range | c1 c2 c3 c4 c5 c6 -------------+------------------------------------------------ r1 | 0.000 r2 | 0.200 0.000 r3 | 0.400 0.200 0.000 r4 | 0.600 0.400 0.200 0.000 r5 | 0.800 0.600 0.400 0.200 0.000 r6 | 1.000 0.800 0.600 0.400 0.200 0.000
Data based matrices
We can also generate substitution costs from the sequence data, from the pattern of transitions. This much more popular than it should be, partly because analysts don’t want to add more assumptions (in the form of a theoretically-driven substitution cost scheme) and prefer to let the data talk, and partly because transition tables and substitution costs matrices have a very similar structure (but transitions and substitutions are ENTIRELY different things!). Nonetheless it is reasonably intuitive that states that are more similar to each other will have more transitions between them. (Whether we can recover the implicit spatial relationships from the transition rates is a research exercise that I am currently engaged on, and will write up soon!)
A number of different ways of using the transition data are available. These are based on the transitions in the data, which are tabulated. The data needs to be in long format for this, hence the reshape long
command in the example below. A corresponding reshape wide
command needs to be issued before running oma
, sdhamming
or twed
, etc.
Traditional transitions-based substitution matrix
The traditional transitions-based substitution matrix (seen from TDA onwards) calculates the substitution cost between states i and j as \(2 – p_{ij} – p_{ji}\) where \(p_{ij}\) is the outflow percentage from i to j. This has a theoretical range of 0 to 2. However, it lacks consistency, as in general the diagonal values will be greater than zero (i.e., \(2 – 2p_{ii}\)), where the substitution cost for like with like must be zero. Even if we impose zeros on the diagonal, it is possible for the resulting values to infringe on the triangle inequality, i.e., to be non-metric (I will demonstrate this in a forthcoming paper using brute-force simulation).
The following is the result from calculating this on the MVAD dataset that is included in SADI, scaled to have a maximum of 1.0. As is evident, most values are very close to 1 as transitions are quite rare in this data, so the \(p_{ij}\) values are small off the diagonal.
reshape long state, i(id) j(t)
sdmksubs, state(state) dtype(t2s) subsmat(t2s) id(id)
(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5 2 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72) Data Wide -> Long ----------------------------------------------------------------------------- Number of observations 712 -> 51,264 Number of variables 87 -> 17 j variable (72 values) -> t xij variables: state1 state2 ... state72 -> state ----------------------------------------------------------------------------- Generating substitution matrix type t2s Scaling to 0--1 range
matlist t2s, format(%6.3f)
| c1 c2 c3 c4 c5 c6 -------------+------------------------------------------------ r1 | 0.000 r2 | 0.984 0.000 r3 | 0.994 0.997 0.000 r4 | 0.993 0.994 0.992 0.000 r5 | 0.980 0.996 1.000 0.998 0.000 r6 | 0.976 0.982 0.998 0.993 0.986 0.000
One common strategy to deal with this is to suppress the diagonal (i.e., \(p_{ij} = n_{ij}/(n_{++} – n_{ii})\)). This makes for greater differences among the states.
sdmksubs, state(state) dtype(t2s) subsmat(t2sn) dropdiag id(id)
matlist t2sn, format(%6.3f)
Generating substitution matrix type t2s Scaling to 0--1 range | c1 c2 c3 c4 c5 c6 -------------+------------------------------------------------ r1 | 0.000 r2 | 0.585 0.000 r3 | 0.543 0.944 0.000 r4 | 0.839 0.897 0.853 0.000 r5 | 0.604 0.941 1.000 0.970 0.000 r6 | 0.616 0.778 0.934 0.920 0.821 0.000
Another strategy is to increase the lag in the transition table, i.e., t-lag to t, instead of t-1 to t. Here, for a lag of 12 months rather than one:
sdmksubs, state(state) dtype(t2s) subsmat(t2s12) lag(12) id(id)
matlist t2s12, format(%6.3f)
Generating substitution matrix type t2s Scaling to 0--1 range | c1 c2 c3 c4 c5 c6 -------------+------------------------------------------------ r1 | 0.000 r2 | 0.861 0.000 r3 | 0.934 0.951 0.000 r4 | 0.943 0.945 0.893 0.000 r5 | 0.799 0.959 1.000 0.985 0.000 r6 | 0.847 0.946 0.988 0.982 0.917 0.000
Other metrics
This measure uses only two of the cells in the transition table. We can formulate a number of others that use more of the table. For instance, we can compare i and j in terms of their outflow patterns to all destinations, no just each other. This fits with the intuition that two states are more similar, the more similar their whole transition pattern. We can define this distance as: \[ D^{pct}_{ij} = \sqrt{\sum_k(p_{ik}-p_{jk})^2} \]
That is, as an N-dimensional Euclidean combination of the difference between the outflow for i to k and j to k. This is clearly a metric distance, with \(D_{ii} = 0\).
Another approach would be to use residuals between observed and expected values, instead of percentages. Given the conventional definition of the expected value under independence: \[ E_{ij} = \frac{n_{i+}n_{+j}}{n_{++}} = \frac{R_i C_j}{T} \] we can define a relative residual as \(RR = \frac{O-E}{E}\), or a Pearson residual as \(PR = \frac{O-E}{\sqrt{E}}\).
For each of these we can calculate a Euclidean sum of the \(RR_{ik} – RR_{jk}\) or \(PR_{ik} – PR_{jk}\) differences: \[ D^{RR}_{ij} = \sqrt{\sum_k(RR_{ik}-RR_{jk})^2} \] \[ D^{PR}_{ij} = \sqrt{\sum_k(PR_{ik}-PR_{jk})^2} \]
The intuition behind these is that the pattern of transitions is well captured by the residuals, either relative (the residual divided by the expected value) or Pearson (divided by the root of the expected value; squaring this quantity and summing it across the table gives the Pearson Chi-square). Residuals incorporate information on both row and column totals, unlike percentages, which have as denominator only the row total (or column total, for colum percentages).
Finally, there is a measure described explicitly in the literature as a chi-square measure: \[ D^{chisq}_ij = \sqrt{\sum_k\frac{(p_{ik}-p_{jk})^2}{n_{+k}}} \] In effect, this is the same as the \(D^{pct}\) measure, but with each \((p_{ik}-p_{jk})^2\) distance divided by the column total for K. Therefore this also incorporates row and column total information.
sdmksubs, state(state) dtype(pct) subsmat(dpct) id(id) matlist dpct, format(%6.3f) sdmksubs, state(state) dtype(relres) subsmat(drelres) id(id) matlist drelres, format(%6.3f) sdmksubs, state(state) dtype(pearson) subsmat(dpearson) id(id) matlist dpearson, format(%6.3f) sdmksubs, state(state) dtype(chisq) subsmat(dchisq) id(id) matlist dchisq, format(%6.3f)
Generating substitution matrix type pct Scaling to 0--1 range | c1 c2 c3 c4 c5 c6 -------------+------------------------------------------------ r1 | 0.000 r2 | 0.972 0.000 r3 | 1.000 0.988 0.000 r4 | 0.979 0.965 0.981 0.000 r5 | 0.964 0.966 0.988 0.965 0.000 r6 | 0.939 0.931 0.966 0.940 0.931 0.000 Generating substitution matrix type relres Scaling to 0--1 range | c1 c2 c3 c4 c5 c6 -------------+------------------------------------------------ r1 | 0.000 r2 | 0.397 0.000 r3 | 0.551 0.650 0.000 r4 | 0.741 0.818 0.899 0.000 r5 | 0.608 0.700 0.799 0.939 0.000 r6 | 0.703 0.779 0.875 1.000 0.901 0.000 Generating substitution matrix type pearson Scaling to 0--1 range | c1 c2 c3 c4 c5 c6 -------------+------------------------------------------------ r1 | 0.000 r2 | 0.969 0.000 r3 | 0.977 0.996 0.000 r4 | 0.960 0.986 0.996 0.000 r5 | 0.954 0.984 1.000 0.990 0.000 r6 | 0.921 0.948 0.977 0.964 0.953 0.000 Generating substitution matrix type chisq Scaling to 0--1 range | c1 c2 c3 c4 c5 c6 -------------+------------------------------------------------ r1 | 0.000 r2 | 0.609 0.000 r3 | 0.712 0.822 0.000 r4 | 0.796 0.894 0.962 0.000 r5 | 0.726 0.837 0.915 0.980 0.000 r6 | 0.757 0.856 0.941 1.000 0.942 0.000
Correlations between the measures
How do the measures compare? We can use the corrsqm
command to look at the correlation of the distances (using one half of the matrices only, since they are symmetrical).
If we retaining the diagonal, we get higher correlations, because each state’s distance to itself is zero:
matrix corr1 = J(9,9,1) local measures = "flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq" forvalues i = 1/9 { local row : word `i' of `measures' forvalues j = `=`i'+1'/9 { local col : word `j' of `measures' qui corrsqm `row' `col' matrix corr1[`i',`j'] = r(rho) matrix corr1[`j',`i'] = r(rho) } } set linesize 150 matrix rownames corr1 = `measures' matrix colnames corr1 = `measures' matlist corr1, nohalf format(%4.3f)
| flat lin~r t2s t2sn t2s12 dpct dre~s dpe~n dch~q -------------+--------------------------------------------------------------- flat | 1.000 0.707 1.000 0.946 0.994 0.999 0.933 0.999 0.972 linear | 0.707 1.000 0.702 0.593 0.677 0.696 0.627 0.687 0.643 t2s | 1.000 0.702 1.000 0.949 0.995 0.999 0.935 1.000 0.974 t2sn | 0.946 0.593 0.949 1.000 0.968 0.945 0.964 0.953 0.981 t2s12 | 0.994 0.677 0.995 0.968 1.000 0.994 0.949 0.995 0.983 dpct | 0.999 0.696 0.999 0.945 0.994 1.000 0.927 1.000 0.969 drelres | 0.933 0.627 0.935 0.964 0.949 0.927 1.000 0.934 0.989 dpearson | 0.999 0.687 1.000 0.953 0.995 1.000 0.934 1.000 0.975 dchisq | 0.972 0.643 0.974 0.981 0.983 0.969 0.989 0.975 1.000
What is notable here is that the linear matrix differs strongly from most of the others: the linear formulation is not appropriate for this data (it requires at least an ordinal space). Also notable: the correlation between the flat and t2s matrices is almost perfect. The dpearson matrix is also very close to the flat.
However, the correlations are strongly affected by the inclusion of the diagonal, all zero. If we drop the diagonal we look only at differences between different states. The flat matrix now has no variation, so the correlation cannot be calculated, and the correlations are much farther from 1.0:
matrix corr2 = J(8,8,1) local measures = "linear t2s t2sn t2s12 dpct drelres dpearson dchisq" forvalues i = 1/8 { local row : word `i' of `measures' forvalues j = `=`i'+1'/8 { local col : word `j' of `measures' qui corrsqm `row' `col' matrix corr2[`i',`j'] = r(rho) matrix corr2[`j',`i'] = r(rho) } } set linesize 150 matrix rownames corr2 = `measures' matrix colnames corr2 = `measures' matlist corr2, nohalf format(%4.3f)
| lin~r t2s t2sn t2s12 dpct dre~s dpe~n dch~q -------------+-------------------------------------------------------- linear | 1.000 0.702 0.593 0.677 0.696 0.627 0.687 0.643 t2s | 0.702 1.000 0.949 0.995 0.999 0.935 1.000 0.974 t2sn | 0.593 0.949 1.000 0.968 0.945 0.964 0.953 0.981 t2s12 | 0.677 0.995 0.968 1.000 0.994 0.949 0.995 0.983 dpct | 0.696 0.999 0.945 0.994 1.000 0.927 1.000 0.969 drelres | 0.627 0.935 0.964 0.949 0.927 1.000 0.934 0.989 dpearson | 0.687 1.000 0.953 0.995 1.000 0.934 1.000 0.975 dchisq | 0.643 0.974 0.981 0.983 0.969 0.989 0.975 1.000
Again the linear distance is the outlier, but there is some disagreement between the measures: relres and pct, for instance, disagree quite strongly about which states are similar and which are different (despite agreeing almost perfectly when the diagonal is included).
Correlations between sequence distances
What do the results look like if we do OMA with the different substitution matrices? Note that while long format is needed to calculate the data-based substitution costs, oma
and related commands require the data in wide format.
reshape wide local measures = "flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq" foreach measure of local measures { oma state*, len(72) subs(`measure') indel(0.5) pwd(pwd`measure') }
(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5 2 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72) Data Long -> Wide ----------------------------------------------------------------------------- Number of observations 51,264 -> 712 Number of variables 17 -> 87 j variable (72 values) t -> (dropped) xij variables: state -> state1 state2 ... state72 ----------------------------------------------------------------------------- Not normalising distances with respect to length 557 unique observations Not normalising distances with respect to length 557 unique observations Not normalising distances with respect to length 557 unique observations Not normalising distances with respect to length 557 unique observations Not normalising distances with respect to length 557 unique observations Not normalising distances with respect to length 557 unique observations Not normalising distances with respect to length 557 unique observations Not normalising distances with respect to length 557 unique observations Not normalising distances with respect to length 557 unique observations
matrix corr3 = J(9,9,1) local measures = "flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq" forvalues i = 1/9 { local row : word `i' of `measures' forvalues j = `=`i'+1'/9 { local col : word `j' of `measures' qui corrsqm pwd`row' pwd`col' matrix corr3[`i',`j'] = r(rho) matrix corr3[`j',`i'] = r(rho) } } set linesize 150 matrix rownames corr3 = `measures' matrix colnames corr3 = `measures' matlist corr3, nohalf format(%3.2f)
| flat line t2s t2sn t2s1 dpct drel dpea dchi -------------+------------------------------------------------------ flat | 1.00 0.72 1.00 0.95 0.99 1.00 0.93 1.00 0.97 linear | 0.72 1.00 0.71 0.69 0.69 0.70 0.77 0.70 0.74 t2s | 1.00 0.71 1.00 0.95 1.00 1.00 0.93 1.00 0.98 t2sn | 0.95 0.69 0.95 1.00 0.96 0.94 0.98 0.95 0.99 t2s12 | 0.99 0.69 1.00 0.96 1.00 1.00 0.95 1.00 0.98 dpct | 1.00 0.70 1.00 0.94 1.00 1.00 0.93 1.00 0.97 drelres | 0.93 0.77 0.93 0.98 0.95 0.93 1.00 0.93 0.99 dpearson | 1.00 0.70 1.00 0.95 1.00 1.00 0.93 1.00 0.98 dchisq | 0.97 0.74 0.98 0.99 0.98 0.97 0.99 0.98 1.00
The distances between sequences show largely the same pattern as the distances between the states.
Agreement between cluster solutions
From experience, I know that relatively small differences between distances can yield quite different cluster solutions. Let’s check, doing a cluster analysis and creating an 8-cluster solution for each pairwise distance matrix:
local measures = "flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq" foreach measure of local measures { clustermat wards pwd`measure', add name(`measure') cluster gen g`measure'=groups(8), name(`measure') }
matrix corr4 = J(9,9,1) local measures = "flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq" forvalues i = 1/9 { local row : word `i' of `measures' forvalues j = `=`i'+1'/9 { local col : word `j' of `measures' qui ari g`row' g`col' matrix corr4[`i',`j'] = r(ari) matrix corr4[`j',`i'] = r(ari) } } set linesize 150 matrix rownames corr4 = `measures' matrix colnames corr4 = `measures' matlist corr4, nohalf format(%3.2f)
| flat line t2s t2sn t2s1 dpct drel dpea dchi -------------+------------------------------------------------------ flat | 1.00 0.56 0.57 0.63 0.52 0.57 0.62 0.57 0.60 linear | 0.56 1.00 0.45 0.57 0.48 0.45 0.59 0.45 0.55 t2s | 0.57 0.45 1.00 0.54 0.79 1.00 0.55 1.00 0.68 t2sn | 0.63 0.57 0.54 1.00 0.60 0.54 0.81 0.54 0.74 t2s12 | 0.52 0.48 0.79 0.60 1.00 0.79 0.65 0.79 0.78 dpct | 0.57 0.45 1.00 0.54 0.79 1.00 0.55 1.00 0.68 drelres | 0.62 0.59 0.55 0.81 0.65 0.55 1.00 0.55 0.75 dpearson | 0.57 0.45 1.00 0.54 0.79 1.00 0.55 1.00 0.68 dchisq | 0.60 0.55 0.68 0.74 0.78 0.68 0.75 0.68 1.00
Row and column focus
By default, all these measures are row oriented, and use values as illustrated in this table (the T2S measure only uses x42 and x24 here):
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
1 | |||||
2 | x21 | x22 | x23 | x24 | x25 |
3 | |||||
4 | x41 | x42 | x43 | x44 | x45 |
5 |
There is no particular reason not to use column data as in this table (for T2S, x42 and x24 are now calculated as inflow percentages):
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
1 | x12 | x14 | |||
2 | x22 | x24 | |||
3 | x32 | x34 | |||
4 | x42 | x44 | |||
5 | x52 | x54 |
The rcb()
option allows you to use either row data or column data, and offers a “both” option to give the Euclidean sum of the two.
reshape long sdmksubs, state(state) dtype(pct) subsmat(dpctr) id(id) rcb(r) sdmksubs, state(state) dtype(pct) subsmat(dpctc) id(id) rcb(c) sdmksubs, state(state) dtype(pct) subsmat(dpctb) id(id) rcb(b) corrsqm dpctr dpctc corrsqm dpctr dpctb corrsqm dpctc dpctb
(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5 2 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72) Data Wide -> Long ----------------------------------------------------------------------------- Number of observations 712 -> 51,264 Number of variables 123 -> 53 j variable (72 values) -> t xij variables: state1 state2 ... state72 -> state ----------------------------------------------------------------------------- Generating substitution matrix type pct Scaling to 0--1 range Generating substitution matrix type pct Scaling to 0--1 range Generating substitution matrix type pct Scaling to 0--1 range VECH correlation between dpctr and dpctc: 0.9997 VECH correlation between dpctr and dpctb: 0.9999 VECH correlation between dpctc and dpctb: 0.9999
Installation
sdmksubs
is part of the latest version of SADI. For now this is available only from http://teaching.sociology.ul.ie/sadi. In due course, a stable update of SADI will be uploaded to SCC.
To install, do the following from within Stata:
net from http://teaching.sociology.ul.ie/sadi net install sadi net get sadi