Sequence analysts often use substitution costs based on transition rates. While I believe that using transition rates to define substitution costs is not always a good strategy, it can be useful and is implemented in SADI (via the `trans2subs`

command). It is also available in SQ (via the `subs(meanprobdistance)`

option).

The usual transition-based definition of the cost of a substitution between state \(i\) and \(j\) is:

\(

d_{ij} = d_{ji} = 2 – p_{ij} – p_{ji}

\)

for \(i \ne j\), and zero for \(i = j\). It has a maximum of 2 and a theoretical off-diagonal minimum of zero, if \(p_{ij} = p_{ji} = 1\). In practice values will remain reasonably close to 2 unless transitions are very high.

However, the SQ and SADI implementations do not agree in their results. Examination of the SQ code suggests this is because SQ defines \(p_{ij}\) as \(n_{ij}/n_{++}\), i.e., the number of transitions divided by the total number of \(t \rightarrow t+1\) observations (the grand total), instead of the more conventional \(n_{ij}/n_{i+}\), the number of transitions divided by the number of pairs starting in \(i\) (the row total). The difference has two main effects: first, the SQ results will tend to be much closer to 2 as \(p_{ij}\) is smaller because its denominator is bigger, and second, less common states will have values closer to 2 than more common states, because the denominator is a constant.

We can demonstrate this using the MVAD dataset that comes with SADI and SQ. We start by re-shaping the data into long format, and using `trans2subs`

to generate the transition probability matrix (note the use of the `diag`

option, to include \(t \rightarrow t+1\) observations that stay in the same state in the denominator; the default is to exclude these).

use mvad reshape long state, i(id) j(t) trans2subs state, id(id) subsmat(t2s) diag

In long format, we can view the pattern of transitions by creating the variable `last`

as the lag of `state`

, and tabulating them:

. by id: gen last = state[_n-1] (712 missing values generated) . tab last state | state last | E F H S T U | Total -----------+------------------------------------------------------------------+---------- 1 | 22,039 115 56 39 58 146 | 22,453 2 | 227 7,927 54 8 33 73 | 8,322 3 | 60 1 5,787 0 3 11 | 5,862 4 | 59 50 74 4,120 19 23 | 4,345 5 | 197 21 0 4 4,973 69 | 5,264 6 | 182 120 9 39 64 3,892 | 4,306 -----------+------------------------------------------------------------------+---------- Total | 22,764 8,234 5,980 4,210 5,150 4,214 | 50,552

We can calculate the transition-based substitution values by hand from this table. Using the conventional definition, the probabilities are the number of transitions divided by the corresponding row totals. The hand-calculated figures correspond with the `trans2subs`

results.

. matlist t2s | c1 c2 c3 c4 c5 c6 -------------+------------------------------------------------------------------ r1 | 0 r2 | 1.967601 0 r3 | 1.98727 1.993341 0 r4 | 1.984684 1.987531 1.982969 0 r5 | 1.959993 1.992045 1.999488 1.994867 0 r6 | 1.951231 1.96336 1.996033 1.985649 1.972029 0 . di 2 - 115/22453 - 227/8322 // 1<->2 1.9676011 . di 2 - 56/22453 - 60/5862 // 1<->3 1.9872705 . di 2 - 39/22453 - 59/4345 // 1<->4 1.9846842

We can run `sqom`

with the `sub(meanprobdistance)`

option, and it will show us the substitution matrix it calculates. We can reproduce the values in this matrix using the figures in the transition table, and the \(p_{ij} = n_{ij}/n_{++}\) formula:

. sqset state id t element variable: state, 1 to 6 identifier variable: id, 1 to 712 order variable: t, 1 to 72 . sqom, full sub(meanprobdistance) indel(2) standard(none) symmetric SQsubcost[6,6] c1 c2 c3 c4 c5 c6 r1 0 r2 1.9932347 0 r3 1.9977053 1.998912 0 r4 1.9980614 1.9988527 1.9985362 0 r5 1.9949557 1.9989318 1.9999407 1.999545 0 r6 1.9935116 1.9961821 1.9996044 1.9987735 1.997369 0 Perform 154846 Comparisons with Needleman-Wunsch Algorithm Running mata function Distance matrix saved as SQdist

. di 2 - 115/50552 - 227/50552 // 1<->2 1.9932347 . di 2 - 56/50552 - 60/50552 // 1<->3 1.9977053 . di 2 - 39/50552 - 59/50552 // 1<->4 1.9980614

Thus we see that SADI and SQ use different definitions of \(p_{ij}\). SADI’s is the more common definition of a transition probability, using the number of cases at risk of a transition as the denominator. The net result of the difference (apart from incompatibility) is that SQ’s transition-based substitution costs will tend to be very undifferentiated (all off-diagonal values very close to 2) while SADI’s will show more variance. (Note that SADI’s default operation of `trans2subs`

without the `diag`

option removes \(i \rightarrow i\) transitions, i.e, remaining in the same state, from the denominator, yielding values even further from 2.)