# Handling dyadic data in Stata

Sometimes when you are working with nested data (such as household surveys, with data on all individuals in the household), analysis focuses on dyads (such as spouse pairs) rather than individual cases. This means you need to link data in one observation with that in another. As long as the data includes information in ego’s record about where alter’s record is (e.g., by holding alter’s ID as a variable), the simplest way to do this is to create a separate data file, where the alter ID variable is renamed to ID, and the substantive variables are also renamed, and to match it back in to the original data. This is not terribly difficult, but it is messy, so I present here a more convenient method.

# Logistic Regression vs the Linear Probability Model

## Logit vs LPM with differing ranges of observation of X

The linear probability model (LPM) is increasingly being recommended as a robust alternative to the shortcomings of logistic regression. (See Jake Westfall’s blog for a good summary of some of the arguments, from a pro-logistic point of view.) However, while the LPM may be more robust in some senses, it is well-known that it does not deal with the fact that probability is restricted to the 0–1 range. This is not just a technical problem: as a result its estimates will differ if the range of X differs, even when the underlying process generating the data is the same. For instance, if X makes the outcome more likely, and we observe a moderate range of X we will get a certain positive slope coefficient from the LPM. If we supplement the sample with observations from a higher range of X (sufficiently high that the observed proportion with the outcome is close to 100%), the slope coefficient will tend to be depressed, necessarily to accommodate the observations with the higher X but the at-most marginally higher proportion of the outcome. The same is not true of the logistic model.

(I have already blogged about inconsistencies in the LPM in the face of changes in the data generating process; here, I am talking about inconsistencies of the LPM where the observed range of X changes under an unchanging data generation model.)

In other words, if there are different sub-populations where the true relationship between X and the outcome is the same, but the range of X is different, the LPM will give less consistent results than the logistic regression model.

# The triangle inequality

I have a Stata program, metricp, which takes a distance matrix (square, symmetric, zero diagonal) and tests for the triangle inequality (that for all i, j, there is no k such that d[i,j] > d[i,k] + d[j,k]).

I needed something similar today in R, and found Matthew Vavrek’s fossil package, which includes a tri.ineq() function. However, it turned out to be very slow on the large matrix I threw at it, so I decided to speed it up with some of the techniques in metricp.ado.

# What sort of Tweep am I? Who do people who follow me follow?

Twitter is a goldmine of relational info. Who follows whom, who retweets whom, who replies to whom, and so on. And a lot of this data is available for analysis (though some of it is on a drip-feed). I decided to try to understand my twitter identity by looking at who the people who follow me follow.

# Comparing sibling and unrelated dyads: one or many?

## Discussion

I had the great pleasure of acting as opponent for Aleksi Karhula’s
successful PhD defence in Turku last Friday. Two of the papers presented
in his PhD use sequence analysis to compare siblings’ lifecourses
(Karhula, Erola, Raab and Fasang, and Raab, Fasang, Karhula and Erola,
the latter already published in Demography), and I naturally found these
particularly interesting. I have never done dyadic sequence analysis in
the course of research, but I have written some code to deal with it
using SADI, for teaching purposes. This note arises out of that
experience, in reaction to Aleksi and colleagues’ approach.

# Logit, Probit and the LPM

### Simulating and modelling binary outcomes

When we have a binary outcome and want to fit a regression model,
fitting a linear regression with the binary outcome (the so called Linear
Probability Model) is deprecated, and logistic and probit regression are
the standard practice.

But how well or poorly does the linear probability model function
relative to logistic or probit regression?

# Comparative perspective on migration from Eurostat population data

Migration is a big issue at the moment in Europe, not least in the UK
where public fears about migration (stoked by a racist press and a Home
Secretary and now PM who, let’s say, seems to have a visceral unreasoned
belief that the presence of foreigners in some way harms Britain) lie
behind the Brexit vote.

There are other reasons for being interested in migration, of course.
For instance, I’ve been wondering a bit about gender differentials in the age profile
of migration and re-migration. So I started looking for data, initially
on the Irish CSO site, where I found annual year-of-age
population estimates.

# Bug in Stata’s dendrogram code

Dendrograms are diagrams that have a tree-like structure, and they’re often used to represent the structure of clustering in a hierarchical (agglomerative) cluster analysis. Agglomerative clustering starts from the bottom up, joining the nearest pairs of objects into clusters, and then clusters with objects and finally clusters with clusters, until eventually everything is a single cluster. The single cluster is the root, the objects are the leaves, and in between is a binary tree, where objects and clusters are combined depending on their distance from each other.

This process depends on being able to define a distance between an object and a cluster, and between pairs of clusters, and there are various ways to do this. However, some algorithms may cause the distance between an object/cluster and another cluster to change after the amalgamation of other clusters. This permits “reversals”, which are difficult or impossible to represent in a dendrogram-like structure. But clustering algorithms or “linkages” such as Ward’s are not subject to this problem.

OK, so far so good. What’s the problem? Stata’s dendrogram code is slightly buggy, and can give an error:
currently can't handle dendrogram reversals
even when you are using a linkage that is not subject to reversals. The explanation is that it is comparing distances between pairs of clusters where one must be greater than or equal to the other for the dendrogram to be drawable (otherwise it’s a reversal), and due to numeric precision is finding pairs where one is fractionally less. The correct code should test the difference in the distances is not less than a very small number (e.g., 10^-7) to take account of precision.

I have had a number of people report this error to me, in connection with my SADI Stata ado package, and have been able to reproduce it. In cooperation with some of these respondents, we have been able to get a work-around from Stata Technical Support (this bug persists into Stata 14).

The following command displays the information that Stata holds about the current clustering (apologies for the linewrapping):

. char list _dta[_clus_1_info] _dta[_clus_1_info]: "t hierarchical"' "m wards"' "d user matrix omd"' "v id _clus_1_id"' "v order _clus_1_ord"' "v height _clus_1_hgt"' 

A small change to this will allow the dendrogram to be drawn:

. char _dta[_clus_1_info] ""t hierarchical"' "m wards"' "d user matrix omd"' "v id _clus_1_id"' "v order _clus_1_ord"' "v height _clus_1_hgt"'"'

# Gender change in Irish political representation

## Estimating the effect of the gender quota: The initial question

A gender quota for candidates was imposed in the 2016 Irish election (see e.g., http://www.thejournal.ie/readme/gender-quota). The question arises whether it had an impact. As a first pass consider the following table:1

           |        gender
dail |         f          m |     Total
-----------+----------------------+----------
2011 |        21        145 |       166
|     12.65      87.35 |    100.00
-----------+----------------------+----------
2016 |        32        116 |       148
|     21.62      78.38 |    100.00
-----------+----------------------+----------
Total |        53        261 |       314
|     16.88      83.12 |    100.00
`

This is clearly a big rise. Is it statistically significant?

# Pollsters’ 3% Margin of Error

The 3% margin of error often quoted in polling has the following logic. It uses a sample size of 1000, and calculates a confidence interval as follows: $$\hat \pi \pm 1.96 \times SE$$ where $$\hat\pi$$ is the sample proportion, and SE, the standard error, is the standard deviation divided by the square-root of the sample size.