Comparing sibling and unrelated dyads: one or many?

Discussion

I had the great pleasure of acting as opponent for Aleksi Karhula’s successful PhD defence in Turku last Friday. Two of the papers presented in his PhD use sequence analysis to compare siblings’ lifecourses (Karhula, Erola, Raab and Fasang, and Raab, Fasang, Karhula and Erola, the latter already published in Demography), and I naturally found these particularly interesting. I have never done dyadic sequence analysis in the course of research, but I have written some code to deal with it using SADI, for teaching purposes. This note arises out of that experience, in reaction to Aleksi and colleagues’ approach.

In the papers, similarity between siblings’ family formation and early socio-economic careers is analysed by comparing distances between their trajectories with distances between a pair composed of the reference sibling and a single randomly chosen other. The inter-career distance is calculated by a sequence analysis algorithm. If this algorithm calculates all pairwise distances between individuals, you are throwing away potentially useful information if you compare each sibling dyad with only one matched unrelated dyad. One alternative would be to compare each individual’s distance to its sibling to that individual’s distance to all other members of the sample. This presents two issues: how to compare a single distance to a set of distances, and how to deal with the re-use of individuals across cases. Rather than deal with these issues (which are not necessarily intractable), I propose an intermediate approach, which is to look at a small random sample of all alters for each individual ego. The approach in the two papers in effect samples one alter per ego, and can be thought of as a sort of hot-deck single imputation. In those terms, I am suggesting multiple rather than single imputation, and propose using multiple imputation estimates of the difference between sibling and non-sibling distance.

Using the multiple imputation framework is one way of generating a single coherent estimate of the sibling effect from the multiple comparisons. Other plausible ways of taking account of the multiple observations would be to treat each individual ego as a cluster, and to use regression with clustered standard errors, or cluster random effects.

Simulation can demonstrate how this works, in terms of both the code required and the statistical performance.

In the code below, I first generate simulated data where siblings are correlated and non-siblings are independent on a single quantitative variable. I then calculate a matrix of all pairwise distances on this variable, which puts us in a position analagous to obtaining pairwise distances from a sequence analysis algorithm.

The code then retains the distances between each ego and all alters (of whom one is the sibling). Thus each family provides one set of observations (and the framework will also work for directed dyads such as heterosexual spouse pairs as well as symmetric dyads such as sibling pairs). I then sample a pre-defined number, M, of non-sibling pairs, and structure it with M cases per individual, with sibling and non-sibling distances. I then do a t-test on the difference, in a number of ways

  • A simple t-test, taking no account of nesting
  • A simple regression, which is equivalent to the t-test
  • A regression with standard errors clustered on family
  • A panel regression with family-level random effects
  • A regression within Stata’s mi estimate framework
  • A simple regression in each of the M pairs

Results from an example run:

Summary:
-------
N:              1000
N-matches:        20
Correlation:   0.175
Difference:   -0.215

T-stats:
-------
T-test:                     -13.095
Regression:                 -13.095
Cluster regression:          -4.075
Random effects regression:   -4.075
Multiple imputation:         -2.421

1 match at a time across 20 matches:
-------
Average diff:  -0.215
Minimum diff:  -0.373
Maximum diff:  -0.138
Average t:     -2.923
Minimum t:     -4.941
Maximum t:     -1.925

The simulation creates 1000 families, and matches 20 times. The correlation between siblings on the underlying variable is 0.175, and the average difference between sibling distances and unrelated distances is -0.215. (To reproduce this result set the random seed, set seed 12345, before running the simulation.)

The t-test and simple regression (where only the constant is estimated) take no account of the repeated measurement or clustering, and yield very large (and identical) t-statistics. Taking account of the clustering via regress diff, cluster(famid) or xtreg diff, i(famid) also yield convergent results, but with rather smaller t-statistics. Using mi estimate: reg diff gives the most conservative result, with a t-statistic of -2.421.

To compare this result to the single-match approach, the final block summarises the results from doing a simple regression or t-test on each of the matches, one at a time. The average difference is the same at -0.215, but it ranges from -0.373 to -0.138, showing the single-match results are somewhat volatile.

I think the volatility clearly suggests the multiple match approach is more reliable than the single. However, at present I can’t figure out if the multiple imputation approach is in any way better than the clustered regressions. Both deal with the multiple nature of the match.

In defence of Aleksi and his colleagues: they are working with quite large data sets, so they can probably afford to throw away some information. I think this approach might be more valuable with smaller data sets.

The code is intended to be run multiple times with different settings (changing sample size, number of matches, strength of sibling correlation). I haven’t encapsulated it in Stata programs or Mata functions, so that the intermediate data objects are available for inspection.

Annotated code

Link to Stata code

This section explains the simulation code step by step.

The first three lines are macros which set parameters: these values can be changed, and the simulation re-run.

local size 2000 // N of families
local ndists 10 // N of unrelated cases to match
local noise   2 // Higher values mean lower correlation between siblings

set obs `size'

Set up a sibling-pair data structure (pairs of cases are siblings).

gen famid = _n
gen x = rnormal()
expand 2
sort famid
by famid: gen sib = _n

Create a sibling variable, based on x, which is the same for both siblings in a family, and rnormal() which varies by individual. The noise macro affects the standard deviation of the individual component, and higher values give lower sibling correlation.

gen y =  x + rnormal(0,`noise')

Temporarily reshape the data to wide format to get the sibling correlation on the y variable.

reshape wide y, i(famid) j(sib)
corr y1 y2
local correlation = r(rho)
reshape long

Generate the matrix of pairwise euclidean distances. Since this is 1-dimensional data, L2 distances are simple absolute differences. (Note that matrix dissim dd = y, L2 is equivalent, and more general, Stata code to the Mata calculation, but is limited to 11000 cases).

// Put the y-variable into Mata
putmata y

// Make a matrix of pairwise distances from y
mata
size = rows(y)
y1 = y # J(1,size,1)
y2 = y' # J(size,1,1)
dd = abs(y1 :- y2)
mata drop y1 y2

Remaining in Mata, drop the redundant second case, keeping distance to other sibling. This command keeps odd-numbered rows and even-numbered columns.

dd1 = dd[range(1,size-1,2),range(2,size,2)]
nfams = rows(dd1)

The diagonal of the resulting matrix represents sibling distances. This extracts them into a separate vector.

sibdist = diagonal(dd1)

Snip out the diagonal, creating a nfams*(nfams-1) matrix of distances to non-siblings. It is possibly unnecessary to exclude siblings from the pool to sample from.

dd2 = lowertriangle(dd1,0)[1..nfams,1..nfams-1] + ///
      (uppertriangle(dd1[1..nfams,2..nfams])\J(1,nfams-1,0))

Sample distances from unrelated alters. This finishes the Mata section.

ndists = `ndists'
samp = runiformint(nfams, ndists, 1, nfams-1)
unrel = J(nfams, ndists, .)
for (i=1; i<=nfams; i++) {
  unrel[i,] = dd2[i, samp[i,]]
}
end // End of Mata section

Back in Stata code, we reduce the data to one case per family, to correspond with what has been already been done with the distance matrices above.

keep if sib==1

Get the distances from Mata as Stata variables. The variable sibdist will hold the true dyad distances, and unrel1 to unrelM the M imputed dyads.

getmata sibdist = sibdist
getmata unrel* = unrel

Reshape to one case per match, and generate a variable diff holding the difference between sibdist and each matched dyad distance.

reshape long unrel, i(famid) j(case)
gen diff = sibdist-unrel

To set up for MI, this code duplicates the first case to stand for the original (unimputed) data, setting case to zero to mark it as unimputed.

expand 2 if case == 1
sort famid case
by famid: replace case=0 if _n==1

We now can test the difference, using the following tests:

1: t-test
2: regression, equivalent to t-test, not taking account of repetitions
3: regression, taking account of repetitions via cluster() option
4: regression, taking account of repetitions via random effects
5: MI regression

As we can see from the summary results above, 1 & 2 are clearly too optimistic, 3 & 4 are practically identical and more conservative, with 5 clearly the most conservative.

ttest diff == 0 if case>0
local ttestt = r(t)
reg diff if case>0
local regt = _b[_cons]/_se[_cons]
reg diff if case>0, cluster(famid)
local regclt = _b[_cons]/_se[_cons]
xtreg diff if case>0, i(famid) re
local regxtt = _b[_cons]/_se[_cons]

Stata multiple imputation can deal with data in many "styles". Style flong contains the observed data (which will normally have missing values to be imputed) in one case or row, and the M imputations in other cases. Import the data into Stata's MI framework, in flong style (one case for observed data, one case for each of M imputations):

mi import flong, id(famid) m(case) imputed(unrel) clear
mi estimate: reg diff
mat V = e(V_mi)
mat B = e(b_mi)
local mit = B[1,1]/sqrt(V[1,1])

Assemble some more summary information, and print the summary

qui {
  su diff if case>0
  local meandiff = r(mean)
  // Capture results from analysing one match at a time across
  // the multiple matches
  local bsum = 0
  local bmax = -99999
  local bmin =  99999
  local tsum = 0
  local tmax = -99999
  local tmin =  99999
  forvalues x = 1/`ndists' {
    reg diff if case==`x'
    local t    = _b[_cons]/_se[_cons]
    local tsum = `tsum' + `t'
    if (`tmax'<`t') {
      local tmax = `t'
    }
    if (`tmin'>`t') {
      local tmin = `t'
    }
    local b    = _b[_cons]
    local bsum = `bsum' + `b'
    if (`bmax'<`b') {
      local bmax = `b'
    }
    if (`bmin'>`b') {
      local bmin = `b'
    }
  }
  local meant = `tsum'/`ndists'
  local meanb = `bsum'/`ndists'
}

di _newline "Summary:" _newline "-------" ///
 _newline "N:           " %7.0f `size' ///
 _newline "N-matches:   " %7.0f `ndists' ///
 _newline "Correlation: " %7.3f `correlation' ///
 _newline "Difference:  " %7.3f `meandiff' ///
 _newline _newline "T-stats:" _newline "-------" ///
 _newline "T-test:                     " %7.3f `ttestt' ///
 _newline "Regression:                 " %7.3f `regt' ///
 _newline "Cluster regression:         " %7.3f `regclt' ///
 _newline "Random effects regression:  " %7.3f `regxtt' ///
 _newline "Multiple imputation:        " %7.3f `mit' ///
 _newline _newline "1 match at a time across `ndists' matches:" _newline "-------" ///
 _newline "Average diff: " %7.3f `meanb' ///
 _newline "Minimum diff: " %7.3f `bmin' ///
 _newline "Maximum diff: " %7.3f `bmax' ///
 _newline "Average t:    " %7.3f `meant' ///
 _newline "Minimum t:    " %7.3f `tmin' ///
 _newline "Maximum t:    " %7.3f `tmax'

Link to Stata code

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.