{"id":507,"date":"2018-07-10T22:49:48","date_gmt":"2018-07-10T22:49:48","guid":{"rendered":"http:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=507"},"modified":"2018-07-11T09:30:07","modified_gmt":"2018-07-11T09:30:07","slug":"generating-transition-based-substitution-costs-sadi-vs-sq","status":"publish","type":"post","link":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=507","title":{"rendered":"Generating transition-based substitution costs: SADI vs SQ"},"content":{"rendered":"<div class=\"outline-text-2\" id=\"text-1\">\n<p>\nSequence analysts often use substitution costs based on transition rates. While I believe that using transition rates to define substitution costs is not always a good strategy, it can be useful and is implemented in SADI (via the <code>trans2subs<\/code> command). It is also available in SQ (via the <code>subs(meanprobdistance)<\/code> option).\n<\/p>\n<p><!--more--><\/p>\n<p>\nThe usual transition-based definition of the cost of a substitution between state \\(i\\) and \\(j\\) is:<br \/>\n\\(<br \/>\nd_{ij} = d_{ji} = 2 &#8211; p_{ij} &#8211; p_{ji}<br \/>\n\\)<br \/>\nfor \\(i \\ne j\\), and zero for \\(i = j\\). It has a maximum of 2 and a theoretical off-diagonal minimum of zero, if \\(p_{ij} = p_{ji} = 1\\). In practice values will remain reasonably close to 2 unless transitions are very high.\n<\/p>\n<p>\nHowever, the SQ and SADI implementations do not agree in their results. Examination of the SQ code suggests this is because SQ defines \\(p_{ij}\\) as \\(n_{ij}\/n_{++}\\), i.e., the number of transitions divided by the total number of \\(t \\rightarrow t+1\\) observations (the grand total), instead of the more conventional \\(n_{ij}\/n_{i+}\\), the number of transitions divided by the number of pairs starting in \\(i\\) (the row total). The difference has two main effects: first, the SQ results will tend to be much closer to 2 as \\(p_{ij}\\) is smaller because its denominator is bigger, and second, less common states will have values closer to 2 than more common states, because the denominator is a constant.\n<\/p>\n<p>\nWe can demonstrate this using the MVAD dataset that comes with SADI and SQ. We start by re-shaping the data into long format, and using <code>trans2subs<\/code> to generate the transition probability matrix (note the use of the <code>diag<\/code> option, to include \\(t \\rightarrow t+1\\) observations that stay in the same state in the denominator; the default is to exclude these).\n<\/p>\n<pre>\r\nuse mvad\r\nreshape long state, i(id) j(t)\r\ntrans2subs state, id(id) subsmat(t2s) diag\r\n<\/pre>\n<p>\nIn long format, we can view the pattern of transitions by creating the variable <code>last<\/code> as the lag of <code>state<\/code>, and tabulating them:\n<\/p>\n<pre>\r\n. by id: gen last = state[_n-1]\r\n(712 missing values generated)\r\n\r\n. tab last state\r\n\r\n           |                               state\r\n      last |         E          F          H          S          T          U |     Total\r\n-----------+------------------------------------------------------------------+----------\r\n         1 |    22,039        115         56         39         58        146 |    22,453 \r\n         2 |       227      7,927         54          8         33         73 |     8,322 \r\n         3 |        60          1      5,787          0          3         11 |     5,862 \r\n         4 |        59         50         74      4,120         19         23 |     4,345 \r\n         5 |       197         21          0          4      4,973         69 |     5,264 \r\n         6 |       182        120          9         39         64      3,892 |     4,306 \r\n-----------+------------------------------------------------------------------+----------\r\n     Total |    22,764      8,234      5,980      4,210      5,150      4,214 |    50,552 \r\n<\/pre>\n<p>\nWe can calculate the transition-based substitution values by hand from this table. Using the conventional definition, the probabilities are the number of transitions divided by the corresponding row totals. The hand-calculated figures correspond with the <code>trans2subs<\/code> results.\n<\/p>\n<pre>\r\n. matlist t2s\r\n\r\n             |        c1         c2         c3         c4         c5         c6 \r\n-------------+------------------------------------------------------------------\r\n          r1 |         0                                                        \r\n          r2 |  1.967601          0                                             \r\n          r3 |   1.98727   1.993341          0                                  \r\n          r4 |  1.984684   1.987531   1.982969          0                       \r\n          r5 |  1.959993   1.992045   1.999488   1.994867          0            \r\n          r6 |  1.951231    1.96336   1.996033   1.985649   1.972029          0 \r\n\r\n. di 2 - 115\/22453 - 227\/8322 \/\/ 1<->2\r\n1.9676011\r\n\r\n. di 2 -  56\/22453 -  60\/5862 \/\/ 1<->3\r\n1.9872705\r\n\r\n. di 2 -  39\/22453 -  59\/4345 \/\/ 1<->4\r\n1.9846842\r\n\r\n<\/pre>\n<p>\nWe can run <code>sqom<\/code> with the <code>sub(meanprobdistance)<\/code> option, and it will show us the substitution matrix it calculates. We can reproduce the values in this matrix using the figures in the transition table, and the \\(p_{ij} = n_{ij}\/n_{++}\\) formula:\n<\/p>\n<pre>\r\n. sqset state id t\r\n\r\n       element variable:  state, 1 to 6\r\n       identifier variable:  id, 1 to 712\r\n       order variable:  t, 1 to 72\r\n\r\n. sqom, full sub(meanprobdistance) indel(2) standard(none)\r\n\r\nsymmetric SQsubcost[6,6]\r\n           c1         c2         c3         c4         c5         c6\r\nr1          0\r\nr2  1.9932347          0\r\nr3  1.9977053   1.998912          0\r\nr4  1.9980614  1.9988527  1.9985362          0\r\nr5  1.9949557  1.9989318  1.9999407   1.999545          0\r\nr6  1.9935116  1.9961821  1.9996044  1.9987735   1.997369          0\r\nPerform 154846 Comparisons with Needleman-Wunsch Algorithm\r\nRunning mata function\r\nDistance matrix saved as SQdist\r\n\r\n<\/pre>\n<pre>\r\n. di 2 - 115\/50552 - 227\/50552 \/\/ 1<->2\r\n1.9932347\r\n\r\n. di 2 -  56\/50552 -  60\/50552 \/\/ 1<->3\r\n1.9977053\r\n\r\n. di 2 -  39\/50552 -  59\/50552 \/\/ 1<->4\r\n1.9980614\r\n\r\n<\/pre>\n<p>\nThus we see that SADI and SQ use different definitions of \\(p_{ij}\\). SADI&#8217;s is the more common definition of a transition probability, using the number of cases at risk of a transition as the denominator. The net result of the difference (apart from incompatibility) is that SQ&#8217;s transition-based substitution costs will tend to be very undifferentiated (all off-diagonal values very close to 2) while SADI&#8217;s will show more variance. (Note that SADI&#8217;s default operation of <code>trans2subs<\/code> without the <code>diag<\/code> option removes \\(i \\rightarrow i\\) transitions, i.e, remaining in the same state, from the denominator, yielding values even further from 2.)\n<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Sequence analysts often use substitution costs based on transition rates. While I believe that using transition rates to define substitution costs is not always a good strategy, it can be useful and is implemented in SADI (via the trans2subs command). It is also available in SQ (via the subs(meanprobdistance) option).<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/507"}],"collection":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=507"}],"version-history":[{"count":18,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/507\/revisions"}],"predecessor-version":[{"id":526,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/507\/revisions\/526"}],"wp:attachment":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=507"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=507"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=507"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}