{"id":796,"date":"2023-07-16T10:07:01","date_gmt":"2023-07-16T10:07:01","guid":{"rendered":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=796"},"modified":"2023-07-16T10:07:01","modified_gmt":"2023-07-16T10:07:01","slug":"sdmksubs-a-new-sadi-command-for-substitution-costs","status":"publish","type":"post","link":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=796","title":{"rendered":"SDMKSUBS: A new SADI command for substitution costs"},"content":{"rendered":"\n<div id=\"content\" class=\"content\">\n<div id=\"table-of-contents\" role=\"doc-toc\">\n<h2>Table of Contents<\/h2>\n<div id=\"text-table-of-contents\" role=\"doc-toc\">\n<ul>\n<li><a href=\"#orgdb4d1df\">A new command<\/a><\/li>\n<li><a href=\"#orgc3620fb\">Simple matrices<\/a><\/li>\n<li><a href=\"#org778c51f\">Data based matrices<\/a>\n<ul>\n<li><a href=\"#orgc2fe8df\">Traditional transitions-based substitution matrix<\/a><\/li>\n<li><a href=\"#orgcdff383\">Other metrics<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"#orgde4c383\">Correlations between the measures<\/a><\/li>\n<li><a href=\"#org7d4d697\">Correlations between sequence distances<\/a><\/li>\n<li><a href=\"#orge3c1436\">Agreement between cluster solutions<\/a><\/li>\n<li><a href=\"#orga7a3dc4\">Row and column focus<\/a><\/li>\n<li><a href=\"#org0c9a9f6\">Installation<\/a><\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n\n<div id=\"outline-container-orgdb4d1df\" class=\"outline-2\">\n<h2 id=\"orgdb4d1df\">A new command<\/h2>\n<div class=\"outline-text-2\" id=\"text-orgdb4d1df\">\n<p>\nIn this blog I introduce a new utility command, <code>sdmksubs<\/code>, which creates substitution cost matrices for sequence analysis, as part of the SADI Stata add-ons.\n<\/p>\n\n<p>\nMost sequence-analysis distance commands require the specification of substitution costs, which describe the pattern of differences in the state space through which the sequences move. These can be derived from theory, from external data, or can be imposed by researcher fiat. It is also common to use the pattern of transitions in the sequence data to derive them, though this is not an unproblematically good idea. The existing <code>trans2subs<\/code> command in SADI calculates simple transition-based substitution costs. The new <code>sdmksubs<\/code> calculates this substitution cost structure, and a range of others, some simple &#8220;theoretical&#8221; ones, and some based on the transition pattern, but taking more of the data into account than the traditional <code>trans2subs<\/code> matrix.\n<\/p>\n\n<p>\nSADI is a Stata package sequence analysis of data such as lifecourse histories, and has been around for quite a while. Recent improvements includes fixes for internal changes in Stata 18, lifting limits on sequence length, etc., but here I focus on <code>sdmksubs<\/code> only.\n<\/p>\n<\/div>\n<\/div><\/div>\n\n\n\n<!--more-->\n\n\n\n<div id=\"outline-container-orgc3620fb\" class=\"outline-2\">\n<h2 id=\"orgc3620fb\">Simple matrices<\/h2>\n<div class=\"outline-text-2\" id=\"text-orgc3620fb\">\n<p>\nThe simplest possible parameterisation of substitution costs is to treat each state as equally different from all others. This yields a &#8220;flat&#8221; substitution cost matrix:\n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\"><span style=\"color: #a020f0\">clear<\/span> all\n<span style=\"color: #a020f0\">use<\/span> mvad, <span style=\"color: #a020f0\">clear<\/span>\n<span style=\"color: #a020f0\">egen<\/span> maxcheck = rowmax(state*)\nsu maxcheck\nloc nstates=r(max)\nsdmksubs, dtype(flat) subsmat(flat) nstates(6)\nmatlist flat, <span style=\"color: #a020f0\">format<\/span>(%6.0f)\n<\/pre>\n<\/div>\n<pre class=\"example\" id=\"orgb2086a7\">\n    Variable |        Obs        Mean    Std. dev.       Min        Max\n-------------+---------------------------------------------------------\n    maxcheck |        712    4.724719     1.57425          1          6\nScaling to 0--1 range\n             |     c1      c2      c3      c4      c5      c6\n-------------+------------------------------------------------\n          r1 |      0\n          r2 |      1       0\n          r3 |      1       1       0\n          r4 |      1       1       1       0\n          r5 |      1       1       1       1       0\n          r6 |      1       1       1       1       1       0\n<\/pre>\n<p>\n(Explanation: load the example MVAD data that comes with SADI; check the max of the <code>state1-state72<\/code> state variables; generate the &#8220;flat&#8221; and &#8220;linear&#8221; matrices using this maximum in the <code>nstates()<\/code> option.)\n<\/p>\n<p>\nFor N states, a flat matrix implies a state-space with N-1 dimensions. Another very simple scheme is to treat the numbers attached to the states as their location on a single dimension. Then the difference or distance is the absolute difference between the state values. This is the &#8220;linear&#8221; scheme, here scaled so the largest distance is 1.0:\n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\">sdmksubs, dtype(linear) subsmat(linear) nstates(6)\nmatlist linear, <span style=\"color: #a020f0\">format<\/span>(%6.3f)\n<\/pre>\n<\/div>\n<pre class=\"example\" id=\"orgb410fa4\">\nScaling to 0--1 range\n             |     c1      c2      c3      c4      c5      c6\n-------------+------------------------------------------------\n          r1 |  0.000\n          r2 |  0.200   0.000\n          r3 |  0.400   0.200   0.000\n          r4 |  0.600   0.400   0.200   0.000\n          r5 |  0.800   0.600   0.400   0.200   0.000\n          r6 |  1.000   0.800   0.600   0.400   0.200   0.000\n<\/pre>\n<\/div>\n<\/div>\n<div id=\"outline-container-org778c51f\" class=\"outline-2\">\n<h2 id=\"org778c51f\">Data based matrices<\/h2>\n<div class=\"outline-text-2\" id=\"text-org778c51f\">\n<p>\nWe can also generate substitution costs from the sequence data, from the pattern of transitions. This much more popular than it should be, partly because analysts don&#8217;t want to add more assumptions (in the form of a theoretically-driven substitution cost scheme) and prefer to let the data talk, and partly because transition tables and substitution costs matrices have a very similar structure (but transitions and substitutions are <b>ENTIRELY<\/b> different things!). Nonetheless it is reasonably intuitive that states that are more similar to each other will have more transitions between them. (Whether we can recover the implicit spatial relationships from the transition rates is a research exercise that I am currently engaged on, and will write up soon!)\n<\/p>\n<p>\nA number of different ways of using the transition data are available. These are based on the transitions in the data, which are tabulated. The data needs to be in long format for this, hence the <code>reshape long<\/code> command in the example below. A corresponding <code>reshape wide<\/code> command needs to be issued before running <code>oma<\/code>, <code>sdhamming<\/code> or <code>twed<\/code>, etc.\n<\/p>\n<\/div>\n<div id=\"outline-container-orgc2fe8df\" class=\"outline-3\">\n<h3 id=\"orgc2fe8df\">Traditional transitions-based substitution matrix<\/h3>\n<div class=\"outline-text-3\" id=\"text-orgc2fe8df\">\n<p>\nThe traditional transitions-based substitution matrix (seen from TDA onwards) calculates the substitution cost between states i and j as \\(2 &#8211; p_{ij} &#8211; p_{ji}\\) where \\(p_{ij}\\) is the outflow percentage from i to j. This has a theoretical range of 0 to 2. However, it lacks consistency, as in general the diagonal values will be greater than zero (i.e.,  \\(2 &#8211; 2p_{ii}\\)), where the substitution cost for like with like must be zero. Even if we impose zeros on the diagonal, it is possible for the resulting values to infringe on the triangle inequality, i.e., to be non-metric (I will demonstrate this in a forthcoming paper using brute-force simulation). \n<\/p>\n<p>\nThe following is the result from calculating this on the MVAD dataset that is included in SADI, scaled to have a maximum of 1.0. As is evident, most values are very close to 1 as transitions are quite rare in this data, so the \\(p_{ij}\\) values are small off the diagonal. \n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\">reshape <span style=\"color: #228b22\">long<\/span> state, i(id) j(t)\nsdmksubs, state(state) dtype(t2s) subsmat(t2s) id(id)\n<\/pre>\n<\/div>\n<pre class=\"example\" id=\"org9b563af\">\n(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5\n2 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72)\nData                               Wide   -&gt;   Long\n-----------------------------------------------------------------------------\nNumber of observations              712   -&gt;   51,264\nNumber of variables                  87   -&gt;   17\nj variable (72 values)                    -&gt;   t\nxij variables:\n              state1 state2 ... state72   -&gt;   state\n-----------------------------------------------------------------------------\nGenerating substitution matrix type t2s\nScaling to 0--1 range\n<\/pre>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\">matlist t2s, <span style=\"color: #a020f0\">format<\/span>(%6.3f)\n<\/pre>\n<\/div>\n<pre class=\"example\">\n             |     c1      c2      c3      c4      c5      c6\n-------------+------------------------------------------------\n          r1 |  0.000\n          r2 |  0.984   0.000\n          r3 |  0.994   0.997   0.000\n          r4 |  0.993   0.994   0.992   0.000\n          r5 |  0.980   0.996   1.000   0.998   0.000\n          r6 |  0.976   0.982   0.998   0.993   0.986   0.000\n<\/pre>\n<p>\nOne common strategy to deal with this is to suppress the diagonal (i.e., \\(p_{ij} = n_{ij}\/(n_{++} &#8211; n_{ii})\\)). This makes for greater differences among the states.\n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\">sdmksubs, state(state) dtype(t2s) subsmat(t2sn) dropdiag id(id)\nmatlist t2sn, <span style=\"color: #a020f0\">format<\/span>(%6.3f)\n<\/pre>\n<\/div>\n<pre class=\"example\" id=\"orgddffe2c\">\nGenerating substitution matrix type t2s\nScaling to 0--1 range\n             |     c1      c2      c3      c4      c5      c6\n-------------+------------------------------------------------\n          r1 |  0.000\n          r2 |  0.585   0.000\n          r3 |  0.543   0.944   0.000\n          r4 |  0.839   0.897   0.853   0.000\n          r5 |  0.604   0.941   1.000   0.970   0.000\n          r6 |  0.616   0.778   0.934   0.920   0.821   0.000\n<\/pre>\n<p>\nAnother strategy is to increase the lag in the transition table, i.e., t-lag to t, instead of t-1 to t. Here, for a lag of 12 months rather than one:\n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\">sdmksubs, state(state) dtype(t2s) subsmat(t2s12) lag(12) id(id)\nmatlist t2s12, <span style=\"color: #a020f0\">format<\/span>(%6.3f)\n<\/pre>\n<\/div>\n<pre class=\"example\" id=\"org920a976\">\nGenerating substitution matrix type t2s\nScaling to 0--1 range\n             |     c1      c2      c3      c4      c5      c6\n-------------+------------------------------------------------\n          r1 |  0.000\n          r2 |  0.861   0.000\n          r3 |  0.934   0.951   0.000\n          r4 |  0.943   0.945   0.893   0.000\n          r5 |  0.799   0.959   1.000   0.985   0.000\n          r6 |  0.847   0.946   0.988   0.982   0.917   0.000\n<\/pre>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgcdff383\" class=\"outline-3\">\n<h3 id=\"orgcdff383\">Other metrics<\/h3>\n<div class=\"outline-text-3\" id=\"text-orgcdff383\">\n<p>\nThis measure uses only two of the cells in the transition table. We can formulate a number of others that use more of the table. For instance, we can compare i and j in terms of their outflow patterns to all destinations, no just each other. This fits with the intuition that two states are more similar, the more similar their whole transition pattern. We can define this distance as:\n\\[\nD^{pct}_{ij} = \\sqrt{\\sum_k(p_{ik}-p_{jk})^2}\n\\]\n<\/p>\n<p>\nThat is, as an N-dimensional Euclidean combination of the difference between the outflow for i to k and j to k. This is clearly a metric distance, with \\(D_{ii} = 0\\). \n<\/p>\n<p>\nAnother approach would be to use residuals between observed and expected values, instead of percentages. Given the conventional definition of the expected value under independence:\n\\[\nE_{ij} = \\frac{n_{i+}n_{+j}}{n_{++}} = \\frac{R_i C_j}{T}\n\\]\nwe can define a relative residual as \\(RR = \\frac{O-E}{E}\\), or a Pearson residual as \\(PR = \\frac{O-E}{\\sqrt{E}}\\).\n<\/p>\n<p>\nFor each of these we can calculate a Euclidean sum of the \\(RR_{ik} &#8211; RR_{jk}\\) or \\(PR_{ik} &#8211; PR_{jk}\\) differences:\n\\[\nD^{RR}_{ij} = \\sqrt{\\sum_k(RR_{ik}-RR_{jk})^2}\n\\]\n\\[\nD^{PR}_{ij} = \\sqrt{\\sum_k(PR_{ik}-PR_{jk})^2}\n\\]\n<\/p>\n<p>\nThe intuition behind these is that the pattern of transitions is well captured by the residuals, either relative (the residual divided by the expected value) or Pearson (divided by the root of the expected value; squaring this quantity and summing it across the table gives the Pearson Chi-square). Residuals incorporate information on both row and column totals, unlike percentages, which have as denominator only the row total (or column total, for colum percentages). \n<\/p>\n<p>\nFinally, there is a measure described explicitly in the literature as a chi-square measure:\n\\[\nD^{chisq}_ij = \\sqrt{\\sum_k\\frac{(p_{ik}-p_{jk})^2}{n_{+k}}}\n\\]\nIn effect, this is the same as the \\(D^{pct}\\) measure, but with each \\((p_{ik}-p_{jk})^2\\) distance divided by the column total for K. Therefore this also incorporates row and column total information. \n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\">sdmksubs, state(state) dtype(pct) subsmat(dpct) id(id)\nmatlist dpct, <span style=\"color: #a020f0\">format<\/span>(%6.3f)\nsdmksubs, state(state) dtype(relres) subsmat(drelres) id(id)\nmatlist drelres, <span style=\"color: #a020f0\">format<\/span>(%6.3f)\nsdmksubs, state(state) dtype(pearson) subsmat(dpearson) id(id)\nmatlist dpearson, <span style=\"color: #a020f0\">format<\/span>(%6.3f)\nsdmksubs, state(state) dtype(chisq) subsmat(dchisq) id(id)\nmatlist dchisq, <span style=\"color: #a020f0\">format<\/span>(%6.3f)\n<\/pre>\n<\/div>\n<pre class=\"example\" id=\"orgec163bf\">\nGenerating substitution matrix type pct\nScaling to 0--1 range\n             |     c1      c2      c3      c4      c5      c6\n-------------+------------------------------------------------\n          r1 |  0.000\n          r2 |  0.972   0.000\n          r3 |  1.000   0.988   0.000\n          r4 |  0.979   0.965   0.981   0.000\n          r5 |  0.964   0.966   0.988   0.965   0.000\n          r6 |  0.939   0.931   0.966   0.940   0.931   0.000\nGenerating substitution matrix type relres\nScaling to 0--1 range\n             |     c1      c2      c3      c4      c5      c6\n-------------+------------------------------------------------\n          r1 |  0.000\n          r2 |  0.397   0.000\n          r3 |  0.551   0.650   0.000\n          r4 |  0.741   0.818   0.899   0.000\n          r5 |  0.608   0.700   0.799   0.939   0.000\n          r6 |  0.703   0.779   0.875   1.000   0.901   0.000\nGenerating substitution matrix type pearson\nScaling to 0--1 range\n             |     c1      c2      c3      c4      c5      c6\n-------------+------------------------------------------------\n          r1 |  0.000\n          r2 |  0.969   0.000\n          r3 |  0.977   0.996   0.000\n          r4 |  0.960   0.986   0.996   0.000\n          r5 |  0.954   0.984   1.000   0.990   0.000\n          r6 |  0.921   0.948   0.977   0.964   0.953   0.000\nGenerating substitution matrix type chisq\nScaling to 0--1 range\n             |     c1      c2      c3      c4      c5      c6\n-------------+------------------------------------------------\n          r1 |  0.000\n          r2 |  0.609   0.000\n          r3 |  0.712   0.822   0.000\n          r4 |  0.796   0.894   0.962   0.000\n          r5 |  0.726   0.837   0.915   0.980   0.000\n          r6 |  0.757   0.856   0.941   1.000   0.942   0.000\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgde4c383\" class=\"outline-2\">\n<h2 id=\"orgde4c383\">Correlations between the measures<\/h2>\n<div class=\"outline-text-2\" id=\"text-orgde4c383\">\n<p>\nHow do the measures compare? We can use the <code>corrsqm<\/code> command to look at the correlation of the distances (using one half of the matrices only, since they are symmetrical). \n<\/p>\n<p>\nIf we retaining the diagonal, we get higher correlations, because each state&#8217;s distance to itself is zero:\n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\">matrix corr1 = J(9,9,1)\nlocal<span style=\"color: #a0522d\"> measures<\/span> = <span style=\"color: #8b2252\">\"flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq\"<\/span>\nforvalues i = 1\/9 {\n  local<span style=\"color: #a0522d\"> row<\/span> : word `<span style=\"color: #a0522d\">i<\/span>' of `<span style=\"color: #a0522d\">measures<\/span>'\n  forvalues j = `=`<span style=\"color: #a0522d\">i<\/span>'+1'\/9 {\n    local<span style=\"color: #a0522d\"> col<\/span> : word `<span style=\"color: #a0522d\">j<\/span>' of `<span style=\"color: #a0522d\">measures<\/span>'\n    qui corrsqm `<span style=\"color: #a0522d\">row<\/span>' `<span style=\"color: #a0522d\">col<\/span>'\n    matrix corr1[`<span style=\"color: #a0522d\">i<\/span>',`<span style=\"color: #a0522d\">j<\/span>'] = r(rho)\n    matrix corr1[`<span style=\"color: #a0522d\">j<\/span>',`<span style=\"color: #a0522d\">i<\/span>'] = r(rho)\n  }\n}\nset l<span style=\"color: #228b22\">inesize<\/span> 150\nmatrix rownames corr1 = `<span style=\"color: #a0522d\">measures<\/span>'\nmatrix colnames corr1 = `<span style=\"color: #a0522d\">measures<\/span>'\nmatlist corr1, nohalf <span style=\"color: #a020f0\">format<\/span>(%4.3f)\n<\/pre>\n<\/div>\n<pre class=\"example\" id=\"org27e3bee\">\n             |  flat  lin~r    t2s   t2sn  t2s12   dpct  dre~s  dpe~n  dch~q\n-------------+---------------------------------------------------------------\n        flat | 1.000  0.707  1.000  0.946  0.994  0.999  0.933  0.999  0.972\n      linear | 0.707  1.000  0.702  0.593  0.677  0.696  0.627  0.687  0.643\n         t2s | 1.000  0.702  1.000  0.949  0.995  0.999  0.935  1.000  0.974\n        t2sn | 0.946  0.593  0.949  1.000  0.968  0.945  0.964  0.953  0.981\n       t2s12 | 0.994  0.677  0.995  0.968  1.000  0.994  0.949  0.995  0.983\n        dpct | 0.999  0.696  0.999  0.945  0.994  1.000  0.927  1.000  0.969\n     drelres | 0.933  0.627  0.935  0.964  0.949  0.927  1.000  0.934  0.989\n    dpearson | 0.999  0.687  1.000  0.953  0.995  1.000  0.934  1.000  0.975\n      dchisq | 0.972  0.643  0.974  0.981  0.983  0.969  0.989  0.975  1.000\n<\/pre>\n<p>\nWhat is notable here is that the linear matrix differs strongly from most of the others: the linear formulation is not appropriate for this data (it requires at least an ordinal space). Also notable: the correlation between the flat and t2s matrices is almost perfect. The dpearson matrix is also very close to the flat.\n<\/p>\n<p>\nHowever, the correlations are strongly affected by the inclusion of the diagonal, all zero. If we drop the diagonal we look only at differences between different states. The flat matrix now has no variation, so the correlation cannot be calculated, and the correlations are much farther from 1.0:\n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\">matrix corr2 = J(8,8,1)\nlocal<span style=\"color: #a0522d\"> measures<\/span> = <span style=\"color: #8b2252\">\"linear t2s t2sn t2s12 dpct drelres dpearson dchisq\"<\/span>\nforvalues i = 1\/8 {\n  local<span style=\"color: #a0522d\"> row<\/span> : word `<span style=\"color: #a0522d\">i<\/span>' of `<span style=\"color: #a0522d\">measures<\/span>'\n  forvalues j = `=`<span style=\"color: #a0522d\">i<\/span>'+1'\/8 {\n    local<span style=\"color: #a0522d\"> col<\/span> : word `<span style=\"color: #a0522d\">j<\/span>' of `<span style=\"color: #a0522d\">measures<\/span>'\n    qui corrsqm `<span style=\"color: #a0522d\">row<\/span>' `<span style=\"color: #a0522d\">col<\/span>'\n    matrix corr2[`<span style=\"color: #a0522d\">i<\/span>',`<span style=\"color: #a0522d\">j<\/span>'] = r(rho)\n    matrix corr2[`<span style=\"color: #a0522d\">j<\/span>',`<span style=\"color: #a0522d\">i<\/span>'] = r(rho)\n  }\n}\nset l<span style=\"color: #228b22\">inesize<\/span> 150\nmatrix rownames corr2 = `<span style=\"color: #a0522d\">measures<\/span>'\nmatrix colnames corr2 = `<span style=\"color: #a0522d\">measures<\/span>'\nmatlist corr2, nohalf <span style=\"color: #a020f0\">format<\/span>(%4.3f)\n<\/pre>\n<\/div>\n<pre class=\"example\" id=\"org9a17064\">\n             | lin~r    t2s   t2sn  t2s12   dpct  dre~s  dpe~n  dch~q\n-------------+--------------------------------------------------------\n      linear | 1.000  0.702  0.593  0.677  0.696  0.627  0.687  0.643\n         t2s | 0.702  1.000  0.949  0.995  0.999  0.935  1.000  0.974\n        t2sn | 0.593  0.949  1.000  0.968  0.945  0.964  0.953  0.981\n       t2s12 | 0.677  0.995  0.968  1.000  0.994  0.949  0.995  0.983\n        dpct | 0.696  0.999  0.945  0.994  1.000  0.927  1.000  0.969\n     drelres | 0.627  0.935  0.964  0.949  0.927  1.000  0.934  0.989\n    dpearson | 0.687  1.000  0.953  0.995  1.000  0.934  1.000  0.975\n      dchisq | 0.643  0.974  0.981  0.983  0.969  0.989  0.975  1.000\n<\/pre>\n<p>\nAgain the linear distance is the outlier, but there is some disagreement between the measures: relres and pct, for instance, disagree quite strongly about which states are similar and which are different (despite agreeing almost perfectly when the diagonal is included). \n<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-org7d4d697\" class=\"outline-2\">\n<h2 id=\"org7d4d697\">Correlations between sequence distances<\/h2>\n<div class=\"outline-text-2\" id=\"text-org7d4d697\">\n<p>\nWhat do the results look like if we do OMA with the different substitution matrices? Note that while long format is needed to calculate the data-based substitution costs, <code>oma<\/code> and related commands require the data in wide format. \n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\">reshape wide\nlocal<span style=\"color: #a0522d\"> measures<\/span> = <span style=\"color: #8b2252\">\"flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq\"<\/span>\nforeach measure of local measures  {\n  oma state*, len(72) subs(`<span style=\"color: #a0522d\">measure<\/span>') indel(0.5) pwd(pwd`<span style=\"color: #a0522d\">measure<\/span>')\n}\n<\/pre>\n<\/div>\n<pre class=\"example\" id=\"orgd399ab9\">\n(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5\n2 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72)\nData                               Long   -&gt;   Wide\n-----------------------------------------------------------------------------\nNumber of observations           51,264   -&gt;   712\nNumber of variables                  17   -&gt;   87\nj variable (72 values)                t   -&gt;   (dropped)\nxij variables:\n                                  state   -&gt;   state1 state2 ... state72\n-----------------------------------------------------------------------------\nNot normalising distances with respect to length\n557 unique observations\nNot normalising distances with respect to length\n557 unique observations\nNot normalising distances with respect to length\n557 unique observations\nNot normalising distances with respect to length\n557 unique observations\nNot normalising distances with respect to length\n557 unique observations\nNot normalising distances with respect to length\n557 unique observations\nNot normalising distances with respect to length\n557 unique observations\nNot normalising distances with respect to length\n557 unique observations\nNot normalising distances with respect to length\n557 unique observations\n<\/pre>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\">matrix corr3 = J(9,9,1)\nlocal<span style=\"color: #a0522d\"> measures<\/span> = <span style=\"color: #8b2252\">\"flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq\"<\/span>\nforvalues i = 1\/9 {\n  local<span style=\"color: #a0522d\"> row<\/span> : word `<span style=\"color: #a0522d\">i<\/span>' of `<span style=\"color: #a0522d\">measures<\/span>'\n  forvalues j = `=`<span style=\"color: #a0522d\">i<\/span>'+1'\/9 {\n    local<span style=\"color: #a0522d\"> col<\/span> : word `<span style=\"color: #a0522d\">j<\/span>' of `<span style=\"color: #a0522d\">measures<\/span>'\n    qui corrsqm pwd`<span style=\"color: #a0522d\">row<\/span>' pwd`<span style=\"color: #a0522d\">col<\/span>'\n    matrix corr3[`<span style=\"color: #a0522d\">i<\/span>',`<span style=\"color: #a0522d\">j<\/span>'] = r(rho)\n    matrix corr3[`<span style=\"color: #a0522d\">j<\/span>',`<span style=\"color: #a0522d\">i<\/span>'] = r(rho)\n  }\n}\nset l<span style=\"color: #228b22\">inesize<\/span> 150\nmatrix rownames corr3 = `<span style=\"color: #a0522d\">measures<\/span>'\nmatrix colnames corr3 = `<span style=\"color: #a0522d\">measures<\/span>'\nmatlist corr3, nohalf <span style=\"color: #a020f0\">format<\/span>(%3.2f)\n<\/pre>\n<\/div>\n<pre class=\"example\" id=\"org6486026\">\n             | flat  line   t2s  t2sn  t2s1  dpct  drel  dpea  dchi\n-------------+------------------------------------------------------\n        flat | 1.00  0.72  1.00  0.95  0.99  1.00  0.93  1.00  0.97\n      linear | 0.72  1.00  0.71  0.69  0.69  0.70  0.77  0.70  0.74\n         t2s | 1.00  0.71  1.00  0.95  1.00  1.00  0.93  1.00  0.98\n        t2sn | 0.95  0.69  0.95  1.00  0.96  0.94  0.98  0.95  0.99\n       t2s12 | 0.99  0.69  1.00  0.96  1.00  1.00  0.95  1.00  0.98\n        dpct | 1.00  0.70  1.00  0.94  1.00  1.00  0.93  1.00  0.97\n     drelres | 0.93  0.77  0.93  0.98  0.95  0.93  1.00  0.93  0.99\n    dpearson | 1.00  0.70  1.00  0.95  1.00  1.00  0.93  1.00  0.98\n      dchisq | 0.97  0.74  0.98  0.99  0.98  0.97  0.99  0.98  1.00\n<\/pre>\n<p>\nThe distances between sequences show largely the same pattern as the distances between the states.\n<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orge3c1436\" class=\"outline-2\">\n<h2 id=\"orge3c1436\">Agreement between cluster solutions<\/h2>\n<div class=\"outline-text-2\" id=\"text-orge3c1436\">\n<p>\nFrom experience, I know that relatively small differences between distances can yield quite different cluster solutions. Let&#8217;s check, doing a cluster analysis and creating an 8-cluster solution for each pairwise distance matrix:\n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\">local<span style=\"color: #a0522d\"> measures<\/span> = <span style=\"color: #8b2252\">\"flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq\"<\/span>\nforeach measure of local measures  {\n  clustermat wards pwd`<span style=\"color: #a0522d\">measure<\/span>', add name(`<span style=\"color: #a0522d\">measure<\/span>')\n  cluster <span style=\"color: #a020f0\">gen<\/span> <span style=\"color: #a020f0\">g<\/span>`<span style=\"color: #a0522d\">measure<\/span>'=groups(8), name(`<span style=\"color: #a0522d\">measure<\/span>')\n}\n<\/pre>\n<\/div>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\">matrix corr4 = J(9,9,1)\nlocal<span style=\"color: #a0522d\"> measures<\/span> = <span style=\"color: #8b2252\">\"flat linear t2s t2sn t2s12 dpct drelres dpearson dchisq\"<\/span>\nforvalues i = 1\/9 {\n  local<span style=\"color: #a0522d\"> row<\/span> : word `<span style=\"color: #a0522d\">i<\/span>' of `<span style=\"color: #a0522d\">measures<\/span>'\n  forvalues j = `=`<span style=\"color: #a0522d\">i<\/span>'+1'\/9 {\n    local<span style=\"color: #a0522d\"> col<\/span> : word `<span style=\"color: #a0522d\">j<\/span>' of `<span style=\"color: #a0522d\">measures<\/span>'\n    qui ari <span style=\"color: #a020f0\">g<\/span>`<span style=\"color: #a0522d\">row<\/span>' <span style=\"color: #a020f0\">g<\/span>`<span style=\"color: #a0522d\">col<\/span>'\n    matrix corr4[`<span style=\"color: #a0522d\">i<\/span>',`<span style=\"color: #a0522d\">j<\/span>'] = r(ari)\n    matrix corr4[`<span style=\"color: #a0522d\">j<\/span>',`<span style=\"color: #a0522d\">i<\/span>'] = r(ari)\n  }\n}\nset l<span style=\"color: #228b22\">inesize<\/span> 150\nmatrix rownames corr4 = `<span style=\"color: #a0522d\">measures<\/span>'\nmatrix colnames corr4 = `<span style=\"color: #a0522d\">measures<\/span>'\nmatlist corr4, nohalf <span style=\"color: #a020f0\">format<\/span>(%3.2f)\n<\/pre>\n<\/div>\n<pre class=\"example\" id=\"org3fc2a01\">\n             | flat  line   t2s  t2sn  t2s1  dpct  drel  dpea  dchi\n-------------+------------------------------------------------------\n        flat | 1.00  0.56  0.57  0.63  0.52  0.57  0.62  0.57  0.60\n      linear | 0.56  1.00  0.45  0.57  0.48  0.45  0.59  0.45  0.55\n         t2s | 0.57  0.45  1.00  0.54  0.79  1.00  0.55  1.00  0.68\n        t2sn | 0.63  0.57  0.54  1.00  0.60  0.54  0.81  0.54  0.74\n       t2s12 | 0.52  0.48  0.79  0.60  1.00  0.79  0.65  0.79  0.78\n        dpct | 0.57  0.45  1.00  0.54  0.79  1.00  0.55  1.00  0.68\n     drelres | 0.62  0.59  0.55  0.81  0.65  0.55  1.00  0.55  0.75\n    dpearson | 0.57  0.45  1.00  0.54  0.79  1.00  0.55  1.00  0.68\n      dchisq | 0.60  0.55  0.68  0.74  0.78  0.68  0.75  0.68  1.00\n<\/pre>\n<\/div>\n<\/div>\n<div id=\"outline-container-orga7a3dc4\" class=\"outline-2\">\n<h2 id=\"orga7a3dc4\">Row and column focus<\/h2>\n<div class=\"outline-text-2\" id=\"text-orga7a3dc4\">\n<p>\nBy default, all these measures are row oriented, and use values as illustrated in this table (the T2S measure only uses x42 and x24 here):\n<\/p>\n<table border=\"2\" cellspacing=\"0\" cellpadding=\"6\" rules=\"all\">\n<caption class=\"t-above\"><span class=\"table-number\">Table 1:<\/span> Standard row-wise view of data for states I and J<\/caption>\n<colgroup>\n<col class=\"org-right\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<\/colgroup>\n<thead>\n<tr>\n<th scope=\"col\" class=\"org-right\">&#xa0;<\/th>\n<th scope=\"col\" class=\"org-left\">1<\/th>\n<th scope=\"col\" class=\"org-left\">2<\/th>\n<th scope=\"col\" class=\"org-left\">3<\/th>\n<th scope=\"col\" class=\"org-left\">4<\/th>\n<th scope=\"col\" class=\"org-left\">5<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"org-right\">1<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<\/tr>\n<tr>\n<td class=\"org-right\">2<\/td>\n<td class=\"org-left\">x21<\/td>\n<td class=\"org-left\">x22<\/td>\n<td class=\"org-left\">x23<\/td>\n<td class=\"org-left\">x24<\/td>\n<td class=\"org-left\">x25<\/td>\n<\/tr>\n<tr>\n<td class=\"org-right\">3<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<\/tr>\n<tr>\n<td class=\"org-right\">4<\/td>\n<td class=\"org-left\">x41<\/td>\n<td class=\"org-left\">x42<\/td>\n<td class=\"org-left\">x43<\/td>\n<td class=\"org-left\">x44<\/td>\n<td class=\"org-left\">x45<\/td>\n<\/tr>\n<tr>\n<td class=\"org-right\">5<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<td class=\"org-left\">&#xa0;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nThere is no particular reason not to use column data as in this table (for T2S, x42 and x24 are now calculated as inflow percentages):\n<\/p>\n<table border=\"2\" cellspacing=\"0\" cellpadding=\"6\" rules=\"all\">\n<caption class=\"t-above\"><span class=\"table-number\">Table 2:<\/span> Corresponding column-wise view of data for states I and J<\/caption>\n<colgroup>\n<col class=\"org-right\" \/>\n<col class=\"org-right\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-right\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-right\" \/>\n<\/colgroup>\n<thead>\n<tr>\n<th scope=\"col\" class=\"org-right\">&#xa0;<\/th>\n<th scope=\"col\" class=\"org-right\">1<\/th>\n<th scope=\"col\" class=\"org-left\">2<\/th>\n<th scope=\"col\" class=\"org-right\">3<\/th>\n<th scope=\"col\" class=\"org-left\">4<\/th>\n<th scope=\"col\" class=\"org-right\">5<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"org-right\">1<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<td class=\"org-left\">x12<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<td class=\"org-left\">x14<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<\/tr>\n<tr>\n<td class=\"org-right\">2<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<td class=\"org-left\">x22<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<td class=\"org-left\">x24<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<\/tr>\n<tr>\n<td class=\"org-right\">3<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<td class=\"org-left\">x32<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<td class=\"org-left\">x34<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<\/tr>\n<tr>\n<td class=\"org-right\">4<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<td class=\"org-left\">x42<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<td class=\"org-left\">x44<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<\/tr>\n<tr>\n<td class=\"org-right\">5<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<td class=\"org-left\">x52<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<td class=\"org-left\">x54<\/td>\n<td class=\"org-right\">&#xa0;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nThe <code>rcb()<\/code> option allows you to use either row data or column data, and offers a &#8220;both&#8221; option to give the Euclidean sum of the two.\n<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-stata\">reshape long\nsdmksubs, state(state) dtype(pct) subsmat(dpctr) id(id) rcb(r)\nsdmksubs, state(state) dtype(pct) subsmat(dpctc) id(id) rcb(c)\nsdmksubs, state(state) dtype(pct) subsmat(dpctb) id(id) rcb(b)\ncorrsqm dpctr dpctc\ncorrsqm dpctr dpctb\ncorrsqm dpctc dpctb\n<\/pre>\n<\/div>\n<pre class=\"example\" id=\"org0f6ea87\">\n(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5\n2 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72)\nData                               Wide   -&gt;   Long\n-----------------------------------------------------------------------------\nNumber of observations              712   -&gt;   51,264\nNumber of variables                 123   -&gt;   53\nj variable (72 values)                    -&gt;   t\nxij variables:\n              state1 state2 ... state72   -&gt;   state\n-----------------------------------------------------------------------------\nGenerating substitution matrix type pct\nScaling to 0--1 range\nGenerating substitution matrix type pct\nScaling to 0--1 range\nGenerating substitution matrix type pct\nScaling to 0--1 range\nVECH correlation between dpctr and dpctc: 0.9997\nVECH correlation between dpctr and dpctb: 0.9999\nVECH correlation between dpctc and dpctb: 0.9999\n<\/pre>\n<\/div>\n<\/div>\n<div id=\"outline-container-org0c9a9f6\" class=\"outline-2\">\n<h2 id=\"org0c9a9f6\">Installation<\/h2>\n<div class=\"outline-text-2\" id=\"text-org0c9a9f6\">\n<p>\n<code>sdmksubs<\/code> is part of the latest version of SADI. For now this is available only from <a href=\"http:\/\/teaching.sociology.ul.ie\/sadi\">http:\/\/teaching.sociology.ul.ie\/sadi<\/a>. In due course, a stable update of SADI will be uploaded to SCC.\n<\/p>\n<p>\nTo install, do the following from within Stata:\n<\/p>\n<pre class=\"example\" id=\"org8ef5f8c\">\nnet from http:\/\/teaching.sociology.ul.ie\/sadi\nnet install sadi\nnet get sadi\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Table of Contents A new command Simple matrices Data based matrices Traditional transitions-based substitution matrix Other metrics Correlations between the measures Correlations between sequence distances Agreement between cluster solutions Row and column focus Installation A new command In this blog I introduce a new utility command, sdmksubs, which creates substitution cost matrices for sequence analysis, &hellip; <a href=\"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/?p=796\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">SDMKSUBS: A new SADI command for substitution costs<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7,4],"tags":[29,30],"_links":{"self":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/796"}],"collection":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=796"}],"version-history":[{"count":2,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/796\/revisions"}],"predecessor-version":[{"id":798,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/796\/revisions\/798"}],"wp:attachment":[{"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=796"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=796"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/teaching.sociology.ul.ie\/bhalpin\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=796"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}