Most methods for the analysis of longitudinal data focus on
narrow aspects of the picture
Wave-on-wave transitions
Starting point versus current position
single spells (perhaps pooled)
summaries such as cumulated duration
It is inherently difficult to model the whole long trajectory
potentially many spells in different states
the outcome of continuously operating processes over an
extended period
There are many reasons we might want to analyse or
summarise whole trajectories
Mostly exploratory: to get an overview of longitudinal data
like means/distributions give an overview of cross-sectional
data
Incidence of different sorts of trajectories
defined a priori by researcher
defined interactively by researcher
defined by some formal method
We can simple define rules to classify all sequences -
useful where the issue is clear, but complicated where the
state space is large, and may rule out finding unexpected
sequence types.
We can define groups interactively: first look at the
overall distribution of all sequences, and then group them
according to their distribution
Good where the average length is short and the state
space is small
Impractical with long, variable sequences with many states
Or we can use some automatic algorithm:
most often some
means of defining a similarity score between pairs of
sequences,
and using the pairwise similarity matrix to conduct
a cluster analysis.
This allows us to create an `empirical typology' of sequences
We can think of many ways of computing similarity scores
between sequences
count matches on an element-by-element basis
compare cumulated duration in all states
look for longest common subsequence
One general method is `optimal matching', originating in
computer science, much used in molecular biology
an efficient algorithm which counts how many operations are
needed to turn one sequence into another
by insertions and deletions: ABCDE becomes
ABDE by deletion of one element; it becomes
ABDDE by a deletion followed by an insertion
Each indel needed `costs' a unit, and the
distance between pairs of sequences is the total cost of the
cheapest route between them
Substitutions are also allowed: we may wish to consider
given pairs of states as particularly similar, such that a
substitution of one for the other should be `cheaper' than
insertion of one and deletion of the other
Depending on the cost settings, can allow gaps which permit
matching subsequences in different parts of the sequences
Given a matrix of pairwise costs or distances, cluster
analysis will generate a set of groups: an empirical typology
This typology can then be investigated in interaction with
other covariates
The clusters also can be viewed, which gives a valuable
overview of the sample of sequences - too many to view without
some order imposed
Optimal matching analysis has been used for many different
types of sequence
Most use in molecular biology: comparing DNA and proteins
Used to analyse bird song
Analysis of careers of baroque musicians
Analysis of morris dancing
Relatively little software can do this with social science
data: TDA has a sequence module, includes OMA
Work by Andrew Abbott has popularised the approach in
sociology (including historical sociology); see debate in
Sociological Methods and Research (2000) vol 21, no 1, for
an overview and
bibliographic details
Halpin and Chan, `Class careers as sequences', European Sociological Review
14, 2, 1998
Scherer, `Early career patterns: A comparison of Great
Britain and West Germany', European Sociological Review
17, 2, 2001