# SDMKSUBS: A new SADI command for substitution costs

## A new command

In this blog I introduce a new utility command, sdmksubs, which creates substitution cost matrices for sequence analysis, as part of the SADI Stata add-ons.

Most sequence-analysis distance commands require the specification of substitution costs, which describe the pattern of differences in the state space through which the sequences move. These can be derived from theory, from external data, or can be imposed by researcher fiat. It is also common to use the pattern of transitions in the sequence data to derive them, though this is not an unproblematically good idea. The existing trans2subs command in SADI calculates simple transition-based substitution costs. The new sdmksubs calculates this substitution cost structure, and a range of others, some simple “theoretical” ones, and some based on the transition pattern, but taking more of the data into account than the traditional trans2subs matrix.

SADI is a Stata package sequence analysis of data such as lifecourse histories, and has been around for quite a while. Recent improvements includes fixes for internal changes in Stata 18, lifting limits on sequence length, etc., but here I focus on sdmksubs only.

## Simple matrices

The simplest possible parameterisation of substitution costs is to treat each state as equally different from all others. This yields a “flat” substitution cost matrix:

clear all
egen maxcheck = rowmax(state*)
su maxcheck
loc nstates=r(max)
sdmksubs, dtype(flat) subsmat(flat) nstates(6)
matlist flat, format(%6.0f)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
maxcheck |        712    4.724719     1.57425          1          6
Scaling to 0--1 range
|     c1      c2      c3      c4      c5      c6
-------------+------------------------------------------------
r1 |      0
r2 |      1       0
r3 |      1       1       0
r4 |      1       1       1       0
r5 |      1       1       1       1       0
r6 |      1       1       1       1       1       0


(Explanation: load the example MVAD data that comes with SADI; check the max of the state1-state72 state variables; generate the “flat” and “linear” matrices using this maximum in the nstates() option.)

For N states, a flat matrix implies a state-space with N-1 dimensions. Another very simple scheme is to treat the numbers attached to the states as their location on a single dimension. Then the difference or distance is the absolute difference between the state values. This is the “linear” scheme, here scaled so the largest distance is 1.0:

sdmksubs, dtype(linear) subsmat(linear) nstates(6)
matlist linear, format(%6.3f)

Scaling to 0--1 range
|     c1      c2      c3      c4      c5      c6
-------------+------------------------------------------------
r1 |  0.000
r2 |  0.200   0.000
r3 |  0.400   0.200   0.000
r4 |  0.600   0.400   0.200   0.000
r5 |  0.800   0.600   0.400   0.200   0.000
r6 |  1.000   0.800   0.600   0.400   0.200   0.000


## Data based matrices

We can also generate substitution costs from the sequence data, from the pattern of transitions. This much more popular than it should be, partly because analysts don’t want to add more assumptions (in the form of a theoretically-driven substitution cost scheme) and prefer to let the data talk, and partly because transition tables and substitution costs matrices have a very similar structure (but transitions and substitutions are ENTIRELY different things!). Nonetheless it is reasonably intuitive that states that are more similar to each other will have more transitions between them. (Whether we can recover the implicit spatial relationships from the transition rates is a research exercise that I am currently engaged on, and will write up soon!)

A number of different ways of using the transition data are available. These are based on the transitions in the data, which are tabulated. The data needs to be in long format for this, hence the reshape long command in the example below. A corresponding reshape wide command needs to be issued before running oma, sdhamming or twed, etc.

The traditional transitions-based substitution matrix (seen from TDA onwards) calculates the substitution cost between states i and j as $$2 – p_{ij} – p_{ji}$$ where $$p_{ij}$$ is the outflow percentage from i to j. This has a theoretical range of 0 to 2. However, it lacks consistency, as in general the diagonal values will be greater than zero (i.e., $$2 – 2p_{ii}$$), where the substitution cost for like with like must be zero. Even if we impose zeros on the diagonal, it is possible for the resulting values to infringe on the triangle inequality, i.e., to be non-metric (I will demonstrate this in a forthcoming paper using brute-force simulation).

The following is the result from calculating this on the MVAD dataset that is included in SADI, scaled to have a maximum of 1.0. As is evident, most values are very close to 1 as transitions are quite rare in this data, so the $$p_{ij}$$ values are small off the diagonal.

reshape long state, i(id) j(t)
sdmksubs, state(state) dtype(t2s) subsmat(t2s) id(id)

(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5
2 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72)
Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations              712   ->   51,264
Number of variables                  87   ->   17
j variable (72 values)                    ->   t
xij variables:
state1 state2 ... state72   ->   state
-----------------------------------------------------------------------------
Generating substitution matrix type t2s
Scaling to 0--1 range

matlist t2s, format(%6.3f)

             |     c1      c2      c3      c4      c5      c6
-------------+------------------------------------------------
r1 |  0.000
r2 |  0.984   0.000
r3 |  0.994   0.997   0.000
r4 |  0.993   0.994   0.992   0.000
r5 |  0.980   0.996   1.000   0.998   0.000
r6 |  0.976   0.982   0.998   0.993   0.986   0.000


One common strategy to deal with this is to suppress the diagonal (i.e., $$p_{ij} = n_{ij}/(n_{++} – n_{ii})$$). This makes for greater differences among the states.

sdmksubs, state(state) dtype(t2s) subsmat(t2sn) dropdiag id(id)
matlist t2sn, format(%6.3f)

Generating substitution matrix type t2s
Scaling to 0--1 range
|     c1      c2      c3      c4      c5      c6
-------------+------------------------------------------------
r1 |  0.000
r2 |  0.585   0.000
r3 |  0.543   0.944   0.000
r4 |  0.839   0.897   0.853   0.000
r5 |  0.604   0.941   1.000   0.970   0.000
r6 |  0.616   0.778   0.934   0.920   0.821   0.000


Another strategy is to increase the lag in the transition table, i.e., t-lag to t, instead of t-1 to t. Here, for a lag of 12 months rather than one:

sdmksubs, state(state) dtype(t2s) subsmat(t2s12) lag(12) id(id)
matlist t2s12, format(%6.3f)

Generating substitution matrix type t2s
Scaling to 0--1 range
|     c1      c2      c3      c4      c5      c6
-------------+------------------------------------------------
r1 |  0.000
r2 |  0.861   0.000
r3 |  0.934   0.951   0.000
r4 |  0.943   0.945   0.893   0.000
r5 |  0.799   0.959   1.000   0.985   0.000
r6 |  0.847   0.946   0.988   0.982   0.917   0.000


### Other metrics

This measure uses only two of the cells in the transition table. We can formulate a number of others that use more of the table. For instance, we can compare i and j in terms of their outflow patterns to all destinations, no just each other. This fits with the intuition that two states are more similar, the more similar their whole transition pattern. We can define this distance as: $D^{pct}_{ij} = \sqrt{\sum_k(p_{ik}-p_{jk})^2}$

That is, as an N-dimensional Euclidean combination of the difference between the outflow for i to k and j to k. This is clearly a metric distance, with $$D_{ii} = 0$$.

Another approach would be to use residuals between observed and expected values, instead of percentages. Given the conventional definition of the expected value under independence: $E_{ij} = \frac{n_{i+}n_{+j}}{n_{++}} = \frac{R_i C_j}{T}$ we can define a relative residual as $$RR = \frac{O-E}{E}$$, or a Pearson residual as $$PR = \frac{O-E}{\sqrt{E}}$$.

For each of these we can calculate a Euclidean sum of the $$RR_{ik} – RR_{jk}$$ or $$PR_{ik} – PR_{jk}$$ differences: $D^{RR}_{ij} = \sqrt{\sum_k(RR_{ik}-RR_{jk})^2}$ $D^{PR}_{ij} = \sqrt{\sum_k(PR_{ik}-PR_{jk})^2}$

The intuition behind these is that the pattern of transitions is well captured by the residuals, either relative (the residual divided by the expected value) or Pearson (divided by the root of the expected value; squaring this quantity and summing it across the table gives the Pearson Chi-square). Residuals incorporate information on both row and column totals, unlike percentages, which have as denominator only the row total (or column total, for colum percentages).

Finally, there is a measure described explicitly in the literature as a chi-square measure: $D^{chisq}_ij = \sqrt{\sum_k\frac{(p_{ik}-p_{jk})^2}{n_{+k}}}$ In effect, this is the same as the $$D^{pct}$$ measure, but with each $$(p_{ik}-p_{jk})^2$$ distance divided by the column total for K. Therefore this also incorporates row and column total information.

sdmksubs, state(state) dtype(pct) subsmat(dpct) id(id)
matlist dpct, format(%6.3f)
sdmksubs, state(state) dtype(relres) subsmat(drelres) id(id)
matlist drelres, format(%6.3f)
sdmksubs, state(state) dtype(pearson) subsmat(dpearson) id(id)
matlist dpearson, format(%6.3f)
sdmksubs, state(state) dtype(chisq) subsmat(dchisq) id(id)
matlist dchisq, format(%6.3f)

Generating substitution matrix type pct
Scaling to 0--1 range
|     c1      c2      c3      c4      c5      c6
-------------+------------------------------------------------
r1 |  0.000
r2 |  0.972   0.000
r3 |  1.000   0.988   0.000
r4 |  0.979   0.965   0.981   0.000
r5 |  0.964   0.966   0.988   0.965   0.000
r6 |  0.939   0.931   0.966   0.940   0.931   0.000
Generating substitution matrix type relres
Scaling to 0--1 range
|     c1      c2      c3      c4      c5      c6
-------------+------------------------------------------------
r1 |  0.000
r2 |  0.397   0.000
r3 |  0.551   0.650   0.000
r4 |  0.741   0.818   0.899   0.000
r5 |  0.608   0.700   0.799   0.939   0.000
r6 |  0.703   0.779   0.875   1.000   0.901   0.000
Generating substitution matrix type pearson
Scaling to 0--1 range
|     c1      c2      c3      c4      c5      c6
-------------+------------------------------------------------
r1 |  0.000
r2 |  0.969   0.000
r3 |  0.977   0.996   0.000
r4 |  0.960   0.986   0.996   0.000
r5 |  0.954   0.984   1.000   0.990   0.000
r6 |  0.921   0.948   0.977   0.964   0.953   0.000
Generating substitution matrix type chisq
Scaling to 0--1 range
|     c1      c2      c3      c4      c5      c6
-------------+------------------------------------------------
r1 |  0.000
r2 |  0.609   0.000
r3 |  0.712   0.822   0.000
r4 |  0.796   0.894   0.962   0.000
r5 |  0.726   0.837   0.915   0.980   0.000
r6 |  0.757   0.856   0.941   1.000   0.942   0.000


## Correlations between the measures

How do the measures compare? We can use the corrsqm command to look at the correlation of the distances (using one half of the matrices only, since they are symmetrical).

If we retaining the diagonal, we get higher correlations, because each state’s distance to itself is zero:

matrix corr1 = J(9,9,1)
local measures = "flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq"
forvalues i = 1/9 {
local row : word i' of measures'
forvalues j = =i'+1'/9 {
local col : word j' of measures'
qui corrsqm row' col'
matrix corr1[i',j'] = r(rho)
matrix corr1[j',i'] = r(rho)
}
}
set linesize 150
matrix rownames corr1 = measures'
matrix colnames corr1 = measures'
matlist corr1, nohalf format(%4.3f)

             |  flat  lin~r    t2s   t2sn  t2s12   dpct  dre~s  dpe~n  dch~q
-------------+---------------------------------------------------------------
flat | 1.000  0.707  1.000  0.946  0.994  0.999  0.933  0.999  0.972
linear | 0.707  1.000  0.702  0.593  0.677  0.696  0.627  0.687  0.643
t2s | 1.000  0.702  1.000  0.949  0.995  0.999  0.935  1.000  0.974
t2sn | 0.946  0.593  0.949  1.000  0.968  0.945  0.964  0.953  0.981
t2s12 | 0.994  0.677  0.995  0.968  1.000  0.994  0.949  0.995  0.983
dpct | 0.999  0.696  0.999  0.945  0.994  1.000  0.927  1.000  0.969
drelres | 0.933  0.627  0.935  0.964  0.949  0.927  1.000  0.934  0.989
dpearson | 0.999  0.687  1.000  0.953  0.995  1.000  0.934  1.000  0.975
dchisq | 0.972  0.643  0.974  0.981  0.983  0.969  0.989  0.975  1.000


What is notable here is that the linear matrix differs strongly from most of the others: the linear formulation is not appropriate for this data (it requires at least an ordinal space). Also notable: the correlation between the flat and t2s matrices is almost perfect. The dpearson matrix is also very close to the flat.

However, the correlations are strongly affected by the inclusion of the diagonal, all zero. If we drop the diagonal we look only at differences between different states. The flat matrix now has no variation, so the correlation cannot be calculated, and the correlations are much farther from 1.0:

matrix corr2 = J(8,8,1)
local measures = "linear t2s t2sn t2s12 dpct drelres dpearson dchisq"
forvalues i = 1/8 {
local row : word i' of measures'
forvalues j = =i'+1'/8 {
local col : word j' of measures'
qui corrsqm row' col'
matrix corr2[i',j'] = r(rho)
matrix corr2[j',i'] = r(rho)
}
}
set linesize 150
matrix rownames corr2 = measures'
matrix colnames corr2 = measures'
matlist corr2, nohalf format(%4.3f)

             | lin~r    t2s   t2sn  t2s12   dpct  dre~s  dpe~n  dch~q
-------------+--------------------------------------------------------
linear | 1.000  0.702  0.593  0.677  0.696  0.627  0.687  0.643
t2s | 0.702  1.000  0.949  0.995  0.999  0.935  1.000  0.974
t2sn | 0.593  0.949  1.000  0.968  0.945  0.964  0.953  0.981
t2s12 | 0.677  0.995  0.968  1.000  0.994  0.949  0.995  0.983
dpct | 0.696  0.999  0.945  0.994  1.000  0.927  1.000  0.969
drelres | 0.627  0.935  0.964  0.949  0.927  1.000  0.934  0.989
dpearson | 0.687  1.000  0.953  0.995  1.000  0.934  1.000  0.975
dchisq | 0.643  0.974  0.981  0.983  0.969  0.989  0.975  1.000


Again the linear distance is the outlier, but there is some disagreement between the measures: relres and pct, for instance, disagree quite strongly about which states are similar and which are different (despite agreeing almost perfectly when the diagonal is included).

## Correlations between sequence distances

What do the results look like if we do OMA with the different substitution matrices? Note that while long format is needed to calculate the data-based substitution costs, oma and related commands require the data in wide format.

reshape wide
local measures = "flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq"
foreach measure of local measures  {
oma state*, len(72) subs(measure') indel(0.5) pwd(pwdmeasure')
}

(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5
2 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72)
Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations           51,264   ->   712
Number of variables                  17   ->   87
j variable (72 values)                t   ->   (dropped)
xij variables:
state   ->   state1 state2 ... state72
-----------------------------------------------------------------------------
Not normalising distances with respect to length
557 unique observations
Not normalising distances with respect to length
557 unique observations
Not normalising distances with respect to length
557 unique observations
Not normalising distances with respect to length
557 unique observations
Not normalising distances with respect to length
557 unique observations
Not normalising distances with respect to length
557 unique observations
Not normalising distances with respect to length
557 unique observations
Not normalising distances with respect to length
557 unique observations
Not normalising distances with respect to length
557 unique observations

matrix corr3 = J(9,9,1)
local measures = "flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq"
forvalues i = 1/9 {
local row : word i' of measures'
forvalues j = =i'+1'/9 {
local col : word j' of measures'
qui corrsqm pwdrow' pwdcol'
matrix corr3[i',j'] = r(rho)
matrix corr3[j',i'] = r(rho)
}
}
set linesize 150
matrix rownames corr3 = measures'
matrix colnames corr3 = measures'
matlist corr3, nohalf format(%3.2f)

             | flat  line   t2s  t2sn  t2s1  dpct  drel  dpea  dchi
-------------+------------------------------------------------------
flat | 1.00  0.72  1.00  0.95  0.99  1.00  0.93  1.00  0.97
linear | 0.72  1.00  0.71  0.69  0.69  0.70  0.77  0.70  0.74
t2s | 1.00  0.71  1.00  0.95  1.00  1.00  0.93  1.00  0.98
t2sn | 0.95  0.69  0.95  1.00  0.96  0.94  0.98  0.95  0.99
t2s12 | 0.99  0.69  1.00  0.96  1.00  1.00  0.95  1.00  0.98
dpct | 1.00  0.70  1.00  0.94  1.00  1.00  0.93  1.00  0.97
drelres | 0.93  0.77  0.93  0.98  0.95  0.93  1.00  0.93  0.99
dpearson | 1.00  0.70  1.00  0.95  1.00  1.00  0.93  1.00  0.98
dchisq | 0.97  0.74  0.98  0.99  0.98  0.97  0.99  0.98  1.00


The distances between sequences show largely the same pattern as the distances between the states.

## Agreement between cluster solutions

From experience, I know that relatively small differences between distances can yield quite different cluster solutions. Let’s check, doing a cluster analysis and creating an 8-cluster solution for each pairwise distance matrix:

local measures = "flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq"
foreach measure of local measures  {
clustermat wards pwdmeasure', add name(measure')
cluster gen gmeasure'=groups(8), name(measure')
}

matrix corr4 = J(9,9,1)
local measures = "flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq"
forvalues i = 1/9 {
local row : word i' of measures'
forvalues j = =i'+1'/9 {
local col : word j' of measures'
qui ari grow' gcol'
matrix corr4[i',j'] = r(ari)
matrix corr4[j',i'] = r(ari)
}
}
set linesize 150
matrix rownames corr4 = measures'
matrix colnames corr4 = measures'
matlist corr4, nohalf format(%3.2f)

             | flat  line   t2s  t2sn  t2s1  dpct  drel  dpea  dchi
-------------+------------------------------------------------------
flat | 1.00  0.56  0.57  0.63  0.52  0.57  0.62  0.57  0.60
linear | 0.56  1.00  0.45  0.57  0.48  0.45  0.59  0.45  0.55
t2s | 0.57  0.45  1.00  0.54  0.79  1.00  0.55  1.00  0.68
t2sn | 0.63  0.57  0.54  1.00  0.60  0.54  0.81  0.54  0.74
t2s12 | 0.52  0.48  0.79  0.60  1.00  0.79  0.65  0.79  0.78
dpct | 0.57  0.45  1.00  0.54  0.79  1.00  0.55  1.00  0.68
drelres | 0.62  0.59  0.55  0.81  0.65  0.55  1.00  0.55  0.75
dpearson | 0.57  0.45  1.00  0.54  0.79  1.00  0.55  1.00  0.68
dchisq | 0.60  0.55  0.68  0.74  0.78  0.68  0.75  0.68  1.00


## Row and column focus

By default, all these measures are row oriented, and use values as illustrated in this table (the T2S measure only uses x42 and x24 here):

Table 1: Standard row-wise view of data for states I and J
1 2 3 4 5
1
2 x21 x22 x23 x24 x25
3
4 x41 x42 x43 x44 x45
5

There is no particular reason not to use column data as in this table (for T2S, x42 and x24 are now calculated as inflow percentages):

Table 2: Corresponding column-wise view of data for states I and J
1 2 3 4 5
1   x12   x14
2   x22   x24
3   x32   x34
4   x42   x44
5   x52   x54

The rcb() option allows you to use either row data or column data, and offers a “both” option to give the Euclidean sum of the two.

reshape long
sdmksubs, state(state) dtype(pct) subsmat(dpctr) id(id) rcb(r)
sdmksubs, state(state) dtype(pct) subsmat(dpctc) id(id) rcb(c)
sdmksubs, state(state) dtype(pct) subsmat(dpctb) id(id) rcb(b)
corrsqm dpctr dpctc
corrsqm dpctr dpctb
corrsqm dpctc dpctb

(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5
2 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72)
Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations              712   ->   51,264
Number of variables                 123   ->   53
j variable (72 values)                    ->   t
xij variables:
state1 state2 ... state72   ->   state
-----------------------------------------------------------------------------
Generating substitution matrix type pct
Scaling to 0--1 range
Generating substitution matrix type pct
Scaling to 0--1 range
Generating substitution matrix type pct
Scaling to 0--1 range
VECH correlation between dpctr and dpctc: 0.9997
VECH correlation between dpctr and dpctb: 0.9999
VECH correlation between dpctc and dpctb: 0.9999


## Installation

sdmksubs is part of the latest version of SADI. For now this is available only from http://teaching.sociology.ul.ie/sadi. In due course, a stable update of SADI will be uploaded to SCC.

To install, do the following from within Stata:

net from http://teaching.sociology.ul.ie/sadi