# SDMKSUBS: A new SADI command for substitution costs

## A new command

In this blog I introduce a new utility command, sdmksubs, which creates substitution cost matrices for sequence analysis, as part of the SADI Stata add-ons.

Most sequence-analysis distance commands require the specification of substitution costs, which describe the pattern of differences in the state space through which the sequences move. These can be derived from theory, from external data, or can be imposed by researcher fiat. It is also common to use the pattern of transitions in the sequence data to derive them, though this is not an unproblematically good idea. The existing trans2subs command in SADI calculates simple transition-based substitution costs. The new sdmksubs calculates this substitution cost structure, and a range of others, some simple “theoretical” ones, and some based on the transition pattern, but taking more of the data into account than the traditional trans2subs matrix.

SADI is a Stata package sequence analysis of data such as lifecourse histories, and has been around for quite a while. Recent improvements includes fixes for internal changes in Stata 18, lifting limits on sequence length, etc., but here I focus on sdmksubs only.

# Writing a Stata Command

In an idle moment this afternoon, I wrote a Stata command.

It was to create a light-weight implemention of the “percentogram” described at https://statmodeling.stat.columbia.edu/2023/04/13/the-percentogram-a-histogram-binned-by-percentages-of-the-cumulative-distribution-rather-than-using-fixed-bin-widths/, and I like the result, but it struck me that it is a good example of how practical and useful it can be to engage in Stata programming. Also, it’s a good example of how writing code in Stata (in a programmable command language) is very different from writing code in a stats-capable programming language like R, Python or Julia.

Continue reading Writing a Stata Command

# Twitter activity after the ElMuskalypse

## Premise

Did Twitter lose activity since the ElMuskalypse? Is my timeline quieter than it used to be when I was reading it? I mothballed my account around the end of November, after Elon Musk took over. How much activity have I been missing? To what extent have the people I followed also stepped back from Twitter?

How would you measure activity of your Twitter (ex-)timeline? Using the Twitter API, how would you assess whether the people you follow are more or less active? The simplest idea is to download the tweeting history of everyone you followed, and tot up their tweets by day. In principle that’s easy to do, if you have access to the Twitter API (and it still works). But it turns out it’s a bit more complicated than that, if you want to use this data to characterise how your timeline would have behaved in the interim.

## Premise

Did Twitter lose activity since the ElMuskalypse? Is my timeline quieter than it used to be when I was reading it? I mothballed my account around the end of November, after Elon Musk took over. How much activity have I been missing? To what extent have the people I followed also stepped back from Twitter?

How would you measure activity of your Twitter (ex-)timeline? Using the Twitter API, how would you assess whether the people you follow are more or less active? The simplest idea is to download the tweeting history of everyone you followed, and tot up their tweets by day. In principle that’s easy to do, if you have access to the Twitter API (and it still works). But it turns out it’s a bit more complicated than that, if you want to use this data to characterise how your timeline would have behaved in the interim.

# Pinging Mastodon instances

Mastodon is less discoverable than Twitter, particularly because it doesn’t have full-text search, and because it has multiple instances. How do we know what’s going on, or which instances are particularly active?

There is a webpage of instance data somewhere. That is a good starting point.

But what about propagation of individual posts? How do they get seen, how quickly do other instances pick them up, which other instances? Here I summarise a quick experiment that exploits a feaure of mastodon software’s behaviour to map out some details of how a post is seen by other instances. Continue reading Pinging Mastodon instances

# Viral Variants

## Scary mutant COVID-19

Since the news of a potentially more transmissible strain of C19 broke in the UK, I’ve been thinking about the mechanics.

I was initially sceptical because it seemed to help politicians evade blame. Given the earlier fuss about a strain spread by holidaymakers returning from Spain in the summer, how much of the emergence of a new strain was down to network effects (common because it was, for example, present in a number of super-spreader events) versus inherent infectiousness?

So I simulated, with a workbench of 160000 agents in a 2D grid network, 1600 seed infections, and daily contact between 23 of the nearest 24 neighbours, and one random remote case. Base infectiousness, death and recovery rates are set to yield an R0 of about 3 and a SIR plot as in Figure 1.

# Genetics and Sociology

I was googling terms related to social science and genetics, and I happened on TCD’s research website, where they have a theme for “Genetics and Society“, selling their research in the area. Mostly anodyne boosterism but I tripped up at what they had to say about sociology:

In Trinity, multi-disciplinary teams work together to address key questions;

[…]

In Sociology: how will individuals and society benefit from this information?

Such weak sauce. They obviously didn’t talk to any knowledgeable sociologists.

It triggered a rant (though to be fair, about points I’d been thinking about already, relative to the relationship between the social sciences and the new biology, so this isn’t all really at TCD’s expense). Continue reading Genetics and Sociology

# Correlations, smoothed time-series and sewage sludge

A very nice idea: search for evidence of COVID-19 RNA in municipal wastewater, as a cheap and fast form of public health surveillance. A pre-print shows that this works well, in a trial in Connecticut. I think the evidence is in their favour, but they commit two cardinal errors: first, they report a correlation (well, a squared correlation) between time-series and second, they do it on smoothed data. Autocorrelation means time-series may have vastly inflated and/or spurious correlations, and stripping the noise out of variables removes the noise from the comparison, making it seem, well, much less noisy than it is.

This is one of their key results: the smoothed RNA curve looks just like the smoothed hospital admissions curve, with a lead of about 3 days:

They report an R2 of 0.99 for this relationship.

However, they also show the data. Given there are 2 series for 44 days, we can pick this off the graph without too much effort:

(This is prompted by @lycraolaoghaire’s tweets: https://twitter.com/lycraolaoghaire/status/1265251252239286272?s=20).

It turns out that the correlation between the RNA measurement and hospital admissions is 0.357 (R2 = 0.13). If we lag by one day, the R2 rises to a very respectable 0.45, but declines again if we lag by 2 (0.22) or 3 (0.22) days. In other words, there is a real signal here, but it is vastly overstated by R2 = 0.99, and the lead it gives is not as big as claimed.

Predicting hospital admissions using lagged RNA values, with lags of 1 to 5, and then all five lags together (green line) looks like this:

This is a much less impressive graph than the original, but it is picking up something. Most of the work is done by the one-day lag, which has a clear effect, and the combined 5-lag model isn’t better (by LR-test) than the L1 model only. However, using this technique very widely as a passive surveillance technique is going to pick up unexpected large shifts in disease RNA, which is much more important than being able to predict moderate changes in hospitalisation from moderate changes in RNA presence in sewage sludge.

Screen-picked data available here, no warranties.

# COVID-19 deaths: NI and IRL compared

Mike Tomlinson has created a certain amount of controversy by asserting that Northern Ireland’s COVID-19 death rate is disproportionate with that of the Republic (see article). In particular, he notes that the per capita rate of deaths in hospital settings (which is all that is reported for NI) is higher than that for the Republic (which normally reports all deaths, but for which the hospital deaths figure is also available). For instance, yesterday’s data says the cumulative figure for hospital deaths in the RoI is 386, while for NI it was 250.

Scaling by the relative populations, that suggests an expected NI hospital death rate of 386 * 1.891 / 4.904 = 148.8. 250 is a lot more than 149, even allowing for some incomparability in how the stats are collected.
Continue reading COVID-19 deaths: NI and IRL compared

# Shiny apps for distributions

For years I have taught students to read printed statistical tables: the Standard Normal Distribution, the t-Distribution, the chi-square Distribution. I want them to do certain tasks (e.g., construct a confidence interval) “by hand” a few times, rather than in Stata, so that they understand what it is doing. I also want them to be able to do it with no more than a calculator, in the final exam.

For the past few years I’ve been working with R-Shiny to develop web-apps, which allow exploration of a concept, self-learning exercises and self-marking assessments. I also use it increasingly in class to demonstrate ideas. I’ve been tempted to replace the paper distribution tables with online versions, but have been holding back because of the pen and paper exam.

# Bicycle schemes need big cities

In larger cities such as Lyon or even Dublin, bikeshare schemes are quite successful. In smaller ones like Limerick they struggle. I am convinced the problem is critical mass. As a scheme gets bigger, it provides disproportionately more possible useful journeys (as long as there is the population density to support it).

I want to model this. Let’s start by imagining cities that are big enough to sustain a square grid of bike stations, and let’s count the number of possible A-B journeys it provides (of different distances).
Continue reading Bicycle schemes need big cities