Sequence Analysis software and materials ─ Brendan Halpin

See below for papers and talk slides

Sequence analysis utilities for Stata

September 2017: SADI in the Stata Journal

The most up to date version of SADI is now described in the Stata Journal. A number of commands have changed names in order to be compatible with Stata Corp requirments, but otherwise there are no large changes. (Preprint at http://teaching.sociology.ul.ie/bhalpin/sadisjmain-local.pdf)

Autumn 2015: SADI on SSC

SADI is now available directly from the main Stata add-on archive, SSC:


  ssc install sadi

February 2015: Updated SADI

SADI has been updated to by-pass Stata's limitation on matrix size, meaning that now more than 11,000 sequences can be compared.

April 2014: Updated SADI

An update to the SADI package was released on 3 April. Many small improvements are included, including more stable plugins. Studer et al's discrepancy measure is one such addition.

Nov 9 2011: New Version

A new version of the sequence analysis add-ons for Stata is now available from http://teaching.sociology.ul.ie/sadi/sadi.pkg. (Use Stata to download.) There are two main differences over the previous implementation: first, the plugin it contains is now compiled for 32- and 64-bit Windows, and 32-bit Linux; second, duplicates are not used in the pairwise distance calculations, though complete N times N matrices are created. The latter change reduces time taken substantially if there are many duplicates.

Three distance measures are provided in this package, Hamming distance, standard OM and my OMv (see Halpin 'Optimal Matching Analysis and Life Course Data: the importance of duration', Sociological Methods and Research, 38 (3), 2010). Note that OMv is not guaranteed to generate metric distances.

Several utility functions are also included:

trans2subs creates substitution-cost matrices based on the observed pattern of transitions
stripe creates string representations of the sequences, which allows you to use Stata's regular-expression functions to summarise them
metricp tests the pairwise distance matrix for the triangle inequality (note OMv will often fail this test!)
Finally, permtab, permtabga and ari allow us to compare cluster solutions. ari calculates the Adjusted Rand Index, which indexes the level of agreement between two unlabelled classifications of the same size, while permtab and permtabga permute the values of one of the classifications to maximise the agreement, and return the permutation. permtabga uses a genetic algorithm to provide an approximate solution, as permutations of more than 8-10 elements take infeasibly long.

Installation

net from http://teaching.sociology.ul.ie/sadi
net install sadi

This code uses functions from the moremata package, so you may need to do ssc install moremata, and restart Stata, before using the sadi commands.

If you have any problems installing or running these utilities, please let me know at brendan.halpin@ul.ie.

Papers, talks and lectures

University of Umeå, March 2017

Notes from a short course on sequence analysis given at the University of Umeå, Sweden.

Academia Sinica, Taipei, August 2016

Notes from a short course on sequence analysis given to students of the Insitute of Sociology, Academia Sinica, Taipei, Taiwan.

Universities of Bergen & Oslo, June 2015

Notes from a short course on sequence analysis given to students of the Universities of Bergen and Oslo, in Oslo.

RSS/SLSS June 9 2015: MICT

Talk about Multiple Imputation for Categorical Time-series.

Non-self-identical missing values, Oxford Feb 2015

Slides from my talk to the Workshop on Algorithmic Social Research, Nuffield College, Oxford, Feb 27 2015 .

Stata German User Group, Hamburg June 2014

Slides from my talk to the Hamburg SUG.

More on imputing sequence data

In this paper I describe my imputation of missing data in sequences in greater detail.

Lausanne Conference on Sequence Analysis

I presented this paper on multiple imputation for gaps in lifecourse sequences at the Lausanne Conference on Sequence Analysis, in June 2012

Journées Trajectoires, Paris: "Simulating Sequences"

Slides from a one-day conference at the Université Paris-I in October 2011. There is also a recording of the presentation.

Helsinki talk, May 2010

Slides and other details from a presentation to the Helsinki Collegium for Advanced Studies, May 2010 are here.

OFPR Course, Paris May 2009

In May 2009 I spent a month in Paris as a guest of CREST, presenting an occasional course to PhD students from institutions across Paris, under the "Option Formation par la Recherche" scheme. Slides for my lectures are here.

QMSS2, Oslo

Slides from my talk to the QMSS2 conference in Oslo, October 2008

RC33, Naples

I gave two papers to the RC33 Conference in Naples, September 2008:

One arguing about substitution costs and
One presenting time-warping.

Frontiers in Social and Economic Mobility, Cornell, 2003

My paper to the "Frontiers in Social and Economic Mobility" conference in Cornell in March 2003 is available as Departmental Working Paper WP2003-01. I had the honour of sharing the session with Andrew Abbott and Larry Wu.

Older material: for reference only

Here I make available a number of utilities for Stata related to sequence analysis (including optimal matching). Some of this material relates to the short course I gave in the Essex Summer School in July 2007.

Copy the relevant files to a directory in Stata's "adopath". (All the relevant files in a single zip file are here.)

Older material: Duration-adjusted Optimal Matching

The adapted Needleman-Wunsch algorithm used by the omav command is designed to treat tokens differently according to the length of the spell in which they occur. This is intended to give better results than conventional OM when used with life-course data. A preliminary discussion of the algorithm is available in these talk slides.

Older material: Stata ado and help files

NB THESE FILES ARE OUTDATED

oma.ado (help file): Code to run optimal matching
omav.ado (help file): Implements optimal matching with a correction for continuous spells
combin.ado (help file): Implements Elzinga's X/T method (experimental implementation using different algorithm from Elzinga's CHESA software)
degenne.ado (help file): Implements so-called "Degenne" methods
hamming.ado (help file): Implements Hamming distance
permtab.ado (help file): Utility to compare pairs of cluster solutions
trans2subs.ado (help file): Utility to generate substitution matrices from observed transition rates

All the relevant files are available in a single zip file here.

Older material: Plugin files

The oma, omav and combin commands are based on C plugins for speed. Implementing them in C rather than in Stata's Mata matrix language yields a 40-fold speed increase. Two versions are presented, X.linux.plugin and X.w32.plugin. If you are using 32-bit Windows copy each X.w32.plugin to X.plugin. For Linux, do the same with X.linux.plugin. If you have another operating system, you may be able to compile the code yourself (see below).

Files related to creating the C plugins

Makefile: Commands to compile the plugins
elzspell.c: C code for combin command
omamatv3.c: C code for oma and omav commands
stplugin.c and stplugin.h : Stata C code for compiling plugins, check their site for more up to date versions, and for helpful info on compiling plugins
uthash.h: Code used by elzspell.c, from Troy Hanson

Older material: Essex course

The Essex course is described here, and files related to the course are available here. See in particular labs.pdf.

Brendan Halpin
Department of Sociology
University of Limerick