# Generating transition-based substitution costs: SADI vs SQ

Sequence analysts often use substitution costs based on transition rates. While I believe that using transition rates to define substitution costs is not always a good strategy, it can be useful and is implemented in SADI (via the trans2subs command). It is also available in SQ (via the subs(meanprobdistance) option).

The usual transition-based definition of the cost of a substitution between state $$i$$ and $$j$$ is:
$$d_{ij} = d_{ji} = 2 – p_{ij} – p_{ji}$$
for $$i \ne j$$, and zero for $$i = j$$. It has a maximum of 2 and a theoretical off-diagonal minimum of zero, if $$p_{ij} = p_{ji} = 1$$. In practice values will remain reasonably close to 2 unless transitions are very high.

However, the SQ and SADI implementations do not agree in their results. Examination of the SQ code suggests this is because SQ defines $$p_{ij}$$ as $$n_{ij}/n_{++}$$, i.e., the number of transitions divided by the total number of $$t \rightarrow t+1$$ observations (the grand total), instead of the more conventional $$n_{ij}/n_{i+}$$, the number of transitions divided by the number of pairs starting in $$i$$ (the row total). The difference has two main effects: first, the SQ results will tend to be much closer to 2 as $$p_{ij}$$ is smaller because its denominator is bigger, and second, less common states will have values closer to 2 than more common states, because the denominator is a constant.

We can demonstrate this using the MVAD dataset that comes with SADI and SQ. We start by re-shaping the data into long format, and using trans2subs to generate the transition probability matrix (note the use of the diag option, to include $$t \rightarrow t+1$$ observations that stay in the same state in the denominator; the default is to exclude these).

use mvad
reshape long state, i(id) j(t)
trans2subs state, id(id) subsmat(t2s) diag


In long format, we can view the pattern of transitions by creating the variable last as the lag of state, and tabulating them:

. by id: gen last = state[_n-1]
(712 missing values generated)

. tab last state

|                               state
last |         E          F          H          S          T          U |     Total
-----------+------------------------------------------------------------------+----------
1 |    22,039        115         56         39         58        146 |    22,453
2 |       227      7,927         54          8         33         73 |     8,322
3 |        60          1      5,787          0          3         11 |     5,862
4 |        59         50         74      4,120         19         23 |     4,345
5 |       197         21          0          4      4,973         69 |     5,264
6 |       182        120          9         39         64      3,892 |     4,306
-----------+------------------------------------------------------------------+----------
Total |    22,764      8,234      5,980      4,210      5,150      4,214 |    50,552


We can calculate the transition-based substitution values by hand from this table. Using the conventional definition, the probabilities are the number of transitions divided by the corresponding row totals. The hand-calculated figures correspond with the trans2subs results.

. matlist t2s

|        c1         c2         c3         c4         c5         c6
-------------+------------------------------------------------------------------
r1 |         0
r2 |  1.967601          0
r3 |   1.98727   1.993341          0
r4 |  1.984684   1.987531   1.982969          0
r5 |  1.959993   1.992045   1.999488   1.994867          0
r6 |  1.951231    1.96336   1.996033   1.985649   1.972029          0

. di 2 - 115/22453 - 227/8322 // 1<->2
1.9676011

. di 2 -  56/22453 -  60/5862 // 1<->3
1.9872705

. di 2 -  39/22453 -  59/4345 // 1<->4
1.9846842



We can run sqom with the sub(meanprobdistance) option, and it will show us the substitution matrix it calculates. We can reproduce the values in this matrix using the figures in the transition table, and the $$p_{ij} = n_{ij}/n_{++}$$ formula:

. sqset state id t

element variable:  state, 1 to 6
identifier variable:  id, 1 to 712
order variable:  t, 1 to 72

. sqom, full sub(meanprobdistance) indel(2) standard(none)

symmetric SQsubcost[6,6]
c1         c2         c3         c4         c5         c6
r1          0
r2  1.9932347          0
r3  1.9977053   1.998912          0
r4  1.9980614  1.9988527  1.9985362          0
r5  1.9949557  1.9989318  1.9999407   1.999545          0
r6  1.9935116  1.9961821  1.9996044  1.9987735   1.997369          0
Perform 154846 Comparisons with Needleman-Wunsch Algorithm
Running mata function
Distance matrix saved as SQdist


. di 2 - 115/50552 - 227/50552 // 1<->2
1.9932347

. di 2 -  56/50552 -  60/50552 // 1<->3
1.9977053

. di 2 -  39/50552 -  59/50552 // 1<->4
1.9980614



Thus we see that SADI and SQ use different definitions of $$p_{ij}$$. SADI’s is the more common definition of a transition probability, using the number of cases at risk of a transition as the denominator. The net result of the difference (apart from incompatibility) is that SQ’s transition-based substitution costs will tend to be very undifferentiated (all off-diagonal values very close to 2) while SADI’s will show more variance. (Note that SADI’s default operation of trans2subs without the diag option removes $$i \rightarrow i$$ transitions, i.e, remaining in the same state, from the denominator, yielding values even further from 2.)