Discussion
I had the great pleasure of acting as opponent for Aleksi Karhula’s successful PhD defence in Turku last Friday. Two of the papers presented in his PhD use sequence analysis to compare siblings’ lifecourses (Karhula, Erola, Raab and Fasang, and Raab, Fasang, Karhula and Erola, the latter already published in Demography), and I naturally found these particularly interesting. I have never done dyadic sequence analysis in the course of research, but I have written some code to deal with it using SADI, for teaching purposes. This note arises out of that experience, in reaction to Aleksi and colleagues’ approach.
In the papers, similarity between siblings’ family formation and early socio-economic careers is analysed by comparing distances between their trajectories with distances between a pair composed of the reference sibling and a single randomly chosen other. The inter-career distance is calculated by a sequence analysis algorithm. If this algorithm calculates all pairwise distances between individuals, you are throwing away potentially useful information if you compare each sibling dyad with only one matched unrelated dyad. One alternative would be to compare each individual’s distance to its sibling to that individual’s distance to all other members of the sample. This presents two issues: how to compare a single distance to a set of distances, and how to deal with the re-use of individuals across cases. Rather than deal with these issues (which are not necessarily intractable), I propose an intermediate approach, which is to look at a small random sample of all alters for each individual ego. The approach in the two papers in effect samples one alter per ego, and can be thought of as a sort of hot-deck single imputation. In those terms, I am suggesting multiple rather than single imputation, and propose using multiple imputation estimates of the difference between sibling and non-sibling distance.
Using the multiple imputation framework is one way of generating a single coherent estimate of the sibling effect from the multiple comparisons. Other plausible ways of taking account of the multiple observations would be to treat each individual ego as a cluster, and to use regression with clustered standard errors, or cluster random effects.
Simulation can demonstrate how this works, in terms of both the code required and the statistical performance.
In the code below, I first generate simulated data where siblings are correlated and non-siblings are independent on a single quantitative variable. I then calculate a matrix of all pairwise distances on this variable, which puts us in a position analagous to obtaining pairwise distances from a sequence analysis algorithm.
The code then retains the distances between each ego and all alters (of whom one is the sibling). Thus each family provides one set of observations (and the framework will also work for directed dyads such as heterosexual spouse pairs as well as symmetric dyads such as sibling pairs). I then sample a pre-defined number, M, of non-sibling pairs, and structure it with M cases per individual, with sibling and non-sibling distances. I then do a t-test on the difference, in a number of ways
- A simple t-test, taking no account of nesting
- A simple regression, which is equivalent to the t-test
- A regression with standard errors clustered on family
- A panel regression with family-level random effects
- A regression within Stata’s
mi estimate
framework - A simple regression in each of the M pairs
Results from an example run:
Summary: ------- N: 1000 N-matches: 20 Correlation: 0.175 Difference: -0.215 T-stats: ------- T-test: -13.095 Regression: -13.095 Cluster regression: -4.075 Random effects regression: -4.075 Multiple imputation: -2.421 1 match at a time across 20 matches: ------- Average diff: -0.215 Minimum diff: -0.373 Maximum diff: -0.138 Average t: -2.923 Minimum t: -4.941 Maximum t: -1.925
The simulation creates 1000 families, and matches 20 times. The correlation between siblings on the underlying variable is 0.175, and the average difference between sibling distances and unrelated distances is -0.215. (To reproduce this result set the random seed, set seed 12345
, before running the simulation.)
The t-test and simple regression (where only the constant is estimated) take no account of the repeated measurement or clustering, and yield very large (and identical) t-statistics. Taking account of the clustering via regress diff, cluster(famid)
or xtreg diff, i(famid)
also yield convergent results, but with rather smaller t-statistics. Using mi estimate: reg diff
gives the most conservative result, with a t-statistic of -2.421.
To compare this result to the single-match approach, the final block summarises the results from doing a simple regression or t-test on each of the matches, one at a time. The average difference is the same at -0.215, but it ranges from -0.373 to -0.138, showing the single-match results are somewhat volatile.
I think the volatility clearly suggests the multiple match approach is more reliable than the single. However, at present I can’t figure out if the multiple imputation approach is in any way better than the clustered regressions. Both deal with the multiple nature of the match.
In defence of Aleksi and his colleagues: they are working with quite large data sets, so they can probably afford to throw away some information. I think this approach might be more valuable with smaller data sets.
The code is intended to be run multiple times with different settings (changing sample size, number of matches, strength of sibling correlation). I haven’t encapsulated it in Stata programs or Mata functions, so that the intermediate data objects are available for inspection.
Annotated code
This section explains the simulation code step by step.
The first three lines are macros which set parameters: these values can be changed, and the simulation re-run.
local size 2000 // N of families local ndists 10 // N of unrelated cases to match local noise 2 // Higher values mean lower correlation between siblings set obs `size'
Set up a sibling-pair data structure (pairs of cases are siblings).
gen famid = _n gen x = rnormal() expand 2 sort famid by famid: gen sib = _n
Create a sibling variable, based on x
, which is the same for both siblings in a family, and rnormal()
which varies by individual. The noise
macro affects the standard deviation of the individual component, and higher values give lower sibling correlation.
gen y = x + rnormal(0,`noise')
Temporarily reshape the data to wide format to get the sibling correlation on the y
variable.
reshape wide y, i(famid) j(sib) corr y1 y2 local correlation = r(rho) reshape long
Generate the matrix of pairwise euclidean distances. Since this is 1-dimensional data, L2 distances are simple absolute differences. (Note that matrix dissim dd = y, L2
is equivalent, and more general, Stata code to the Mata calculation, but is limited to 11000 cases).
// Put the y-variable into Mata putmata y // Make a matrix of pairwise distances from y mata size = rows(y) y1 = y # J(1,size,1) y2 = y' # J(size,1,1) dd = abs(y1 :- y2) mata drop y1 y2
Remaining in Mata, drop the redundant second case, keeping distance to other sibling. This command keeps odd-numbered rows and even-numbered columns.
dd1 = dd[range(1,size-1,2),range(2,size,2)] nfams = rows(dd1)
The diagonal of the resulting matrix represents sibling distances. This extracts them into a separate vector.
sibdist = diagonal(dd1)
Snip out the diagonal, creating a nfams*(nfams-1) matrix of distances to non-siblings. It is possibly unnecessary to exclude siblings from the pool to sample from.
dd2 = lowertriangle(dd1,0)[1..nfams,1..nfams-1] + /// (uppertriangle(dd1[1..nfams,2..nfams])\J(1,nfams-1,0))
Sample distances from unrelated alters. This finishes the Mata section.
ndists = `ndists' samp = runiformint(nfams, ndists, 1, nfams-1) unrel = J(nfams, ndists, .) for (i=1; i<=nfams; i++) { unrel[i,] = dd2[i, samp[i,]] } end // End of Mata section
Back in Stata code, we reduce the data to one case per family, to correspond with what has been already been done with the distance matrices above.
keep if sib==1
Get the distances from Mata as Stata variables. The variable sibdist
will hold the true dyad distances, and unrel1
to unrelM
the M imputed dyads.
getmata sibdist = sibdist getmata unrel* = unrel
Reshape to one case per match, and generate a variable diff
holding the difference between sibdist
and each matched dyad distance.
reshape long unrel, i(famid) j(case) gen diff = sibdist-unrel
To set up for MI, this code duplicates the first case to stand for the original (unimputed) data, setting case
to zero to mark it as unimputed.
expand 2 if case == 1 sort famid case by famid: replace case=0 if _n==1
We now can test the difference, using the following tests:
1: t-test
2: regression, equivalent to t-test, not taking account of repetitions
3: regression, taking account of repetitions via cluster() option
4: regression, taking account of repetitions via random effects
5: MI regression
As we can see from the summary results above, 1 & 2 are clearly too optimistic, 3 & 4 are practically identical and more conservative, with 5 clearly the most conservative.
ttest diff == 0 if case>0 local ttestt = r(t) reg diff if case>0 local regt = _b[_cons]/_se[_cons] reg diff if case>0, cluster(famid) local regclt = _b[_cons]/_se[_cons] xtreg diff if case>0, i(famid) re local regxtt = _b[_cons]/_se[_cons]
Stata multiple imputation can deal with data in many "styles". Style flong
contains the observed data (which will normally have missing values to be imputed) in one case or row, and the M imputations in other cases. Import the data into Stata's MI framework, in flong
style (one case for observed data, one case for each of M imputations):
mi import flong, id(famid) m(case) imputed(unrel) clear mi estimate: reg diff mat V = e(V_mi) mat B = e(b_mi) local mit = B[1,1]/sqrt(V[1,1])
Assemble some more summary information, and print the summary
qui { su diff if case>0 local meandiff = r(mean) // Capture results from analysing one match at a time across // the multiple matches local bsum = 0 local bmax = -99999 local bmin = 99999 local tsum = 0 local tmax = -99999 local tmin = 99999 forvalues x = 1/`ndists' { reg diff if case==`x' local t = _b[_cons]/_se[_cons] local tsum = `tsum' + `t' if (`tmax'<`t') { local tmax = `t' } if (`tmin'>`t') { local tmin = `t' } local b = _b[_cons] local bsum = `bsum' + `b' if (`bmax'<`b') { local bmax = `b' } if (`bmin'>`b') { local bmin = `b' } } local meant = `tsum'/`ndists' local meanb = `bsum'/`ndists' } di _newline "Summary:" _newline "-------" /// _newline "N: " %7.0f `size' /// _newline "N-matches: " %7.0f `ndists' /// _newline "Correlation: " %7.3f `correlation' /// _newline "Difference: " %7.3f `meandiff' /// _newline _newline "T-stats:" _newline "-------" /// _newline "T-test: " %7.3f `ttestt' /// _newline "Regression: " %7.3f `regt' /// _newline "Cluster regression: " %7.3f `regclt' /// _newline "Random effects regression: " %7.3f `regxtt' /// _newline "Multiple imputation: " %7.3f `mit' /// _newline _newline "1 match at a time across `ndists' matches:" _newline "-------" /// _newline "Average diff: " %7.3f `meanb' /// _newline "Minimum diff: " %7.3f `bmin' /// _newline "Maximum diff: " %7.3f `bmax' /// _newline "Average t: " %7.3f `meant' /// _newline "Minimum t: " %7.3f `tmin' /// _newline "Maximum t: " %7.3f `tmax'