All posts by Brendan Halpin

Stata and Make

J Scott Long has written an interesting book on Workflow and Data Analysis using Stata. It’s good stuff but I was disappointed to see he makes no mention of make and Makefiles.

What’s make? It is a simple and powerful way of describing projects, designed initially for building complex C programs on Unix, but capable of being adapted to many other uses. One is the data analysis workflow, where there are many many steps between the raw data and the final paper.

So I was pleased to see (from a Gary King tweet) that the current newsletter of the Political Methodology of the APSA is devoted to workflow matters, and it contains a mention of using make for managing data analysis projects in Fredrickson, Testa and Weidmann’s article (though in the context of R and LaTeX, rather than Stata).

Continue reading Stata and Make

Relative rates, odds ratios and the complementary log-log model

In a previous note, I used Stata to simulate 2*2 tables of a one-off outcome. The simulation shows that odds ratios (ORs) are a much better estimate of the underlying causal effect or statistical association than relative rates are, given certain assumptions. One key assumption is that it is a one-off outcome, where it is reasonable to model the propensity for the event with a normal or logistic distribution. Where the outcome is the result of potentially repeated exposure to a risk (such as being ever married or ever infected with a particular pathogen) the resulting propensity is not likely to be normal. That is, if you are exposed to many opportunities to marry, saying yes once means you become ever-married for ever after, and even if the propensity to marry at a specific opportunity is normally distributed, the combined distribution of propensity to be ever-married after an unknown number of opportunities is likely not to be well-described as normal.

Continue reading Relative rates, odds ratios and the complementary log-log model

Relative rates and odds ratios

A frequent theme in the medical statistics and epidemiological literature is that odds ratios (ORs) as effect measures for binary outcomes are counter intuitive and an impediment to understanding. Barros and Hirakata (2003), for instance, refer to the relative rate as the “measure of choice” and complain that the OR will “overestimate” the RR as the baseline probability rises. Clearly, ORs are less intuitive than relative rates (RRs), but in this note I take issue with the conclusion sometimes made, that models with relative-rate interpretations should be used instead of logistic regression and other OR models. This is because RRs are not measures of the size of the statistical association between a variable and an outcome since they also vary inversely with the baseline probability), and because, under certain assumptions, ORs and related measures are. That is, RRs may feel more real but they are likely to be misleading.

Continue reading Relative rates and odds ratios

Heatmaps in gnuplot and Stata

I use gnuplot and Stata to generate a heatmap representation of a square matrix containing a measure of closeness between 26 departments in a university. gnuplot is a general-purpose plotting program, and can be wheedled into doing a lot of things, but Stata’s graphics routines are also very general. Given data in i, j, n format (in blocks, that is with a blank line inserted before every change of value of i), gnuplot can generate a heatmap with code like the following:

Continue reading Heatmaps in gnuplot and Stata