Sequence Analysis

Most methods for the analysis of longitudinal data focus on narrow aspects of the picture
- Wave-on-wave transitions
- Starting point versus current position
- single spells (perhaps pooled)
- summaries such as cumulated duration
It is inherently difficult to model the whole long trajectory
- potentially many spells in different states
- the outcome of continuously operating processes over an extended period
- There are many reasons we might want to analyse or summarise whole trajectories
- Mostly exploratory: to get an overview of longitudinal data like means/distributions give an overview of cross-sectional data
Incidence of different sorts of trajectories
- defined a priori by researcher
- defined interactively by researcher
- defined by some formal method
We can simple define rules to classify all sequences - useful where the issue is clear, but complicated where the state space is large, and may rule out finding unexpected sequence types.
We can define groups interactively: first look at the overall distribution of all sequences, and then group them according to their distribution
- Good where the average length is short and the state space is small
- Impractical with long, variable sequences with many states
Or we can use some automatic algorithm:
- most often some means of defining a similarity score between pairs of sequences,
- and using the pairwise similarity matrix to conduct a cluster analysis.
This allows us to create an `empirical typology' of sequences
We can think of many ways of computing similarity scores between sequences
- count matches on an element-by-element basis
- compare cumulated duration in all states
- look for longest common subsequence

One general method is `optimal matching', originating in computer science, much used in molecular biology
- an efficient algorithm which counts how many operations are needed to turn one sequence into another
- by insertions and deletions: ABCDE becomes ABDE by deletion of one element; it becomes ABDDE by a deletion followed by an insertion
- Each indel needed `costs' a unit, and the distance between pairs of sequences is the total cost of the cheapest route between them
- Substitutions are also allowed: we may wish to consider given pairs of states as particularly similar, such that a substitution of one for the other should be `cheaper' than insertion of one and deletion of the other
- Depending on the cost settings, can allow gaps which permit matching subsequences in different parts of the sequences
Given a matrix of pairwise costs or distances, cluster analysis will generate a set of groups: an empirical typology
- This typology can then be investigated in interaction with other covariates
- The clusters also can be viewed, which gives a valuable overview of the sample of sequences - too many to view without some order imposed
Optimal matching analysis has been used for many different types of sequence
- Most use in molecular biology: comparing DNA and proteins
- Used to analyse bird song
- Analysis of careers of baroque musicians
- Analysis of morris dancing
Relatively little software can do this with social science data: TDA has a sequence module, includes OMA
Work by Andrew Abbott has popularised the approach in sociology (including historical sociology); see debate in Sociological Methods and Research (2000) vol 21, no 1, for an overview and bibliographic details
Halpin and Chan, `Class careers as sequences', European Sociological Review 14, 2, 1998
Scherer, `Early career patterns: A comparison of Great Britain and West Germany', European Sociological Review 17, 2, 2001

© Brendan Halpin (e-mail)	23-Apr-2012
Department of Sociology, University of Limerick
Taught programme: MA in Sociology (Applied Social Research),
Short course, May 14/15 2012: Categorical Data Analysis for Social Scientists