Pseudo-R2 is pseudo

People like the R2 stat from linear regression so much that they re-invent it in places it doesn’t naturally arise, such as logistic regression. The true R2 has nice clean interpretations, as the proportion of variation explained or the square of the correlation between observed and predicted values. The fake or pseudo-R2 statistics are often based on relating the loglikelihood of the current model against that of the null model (intercept only) in some way. There is a good overview at UCLA.

One of the most popular pseudo-R2 is McFadden’s. This is defined as 1 – LLm/LL0 where LLm is the log-likelihood of the current model, and LL0 that of the null model. This appears to have the range 0-1 though 1 will never be reached in practice.

It is well known that if we fit linear regressions by maximum-likelihood, we get exactly the same parameter estimates as if we fit by ordinary least squares. We can demonstrate this in Stata:

. sysuse auto
. reg price headroom mpg
. glm price headroom mpg

Since the ML estimation of the linear regression gives us loglikelihoods, we can calculate pseudo-R2 and true R2 for the same model. This code does it for a range of simple models with Stata’s demonstration “auto” data set:

sysuse auto, clear
glm price
local basell = e(ll)
local vars "mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign"
local rhs = ""

gen r2 = .
gen mcf = .

local i 0
foreach var in `vars' {
local i = `i'+1
local rhs = "`rhs' `var'"
qui glm price `rhs'
local mcfad = 1 - (e(ll)/ `basell')
qui reg price `rhs'
di %6.3f `=e(r2)' %6.3f `mcfad' " : `rhs'"
qui replace r2 = `=e(r2)' in `i'
qui replace mcf = `mcfad' in `i'

label var mcf "McFadden Pseudo-R2"
label var r2 "R-squared"
scatter mcf r2

This generates the following graph, in which we see that there is a monotonic but non-linear relationship between the two measures. We can also see very clearly that pseudo-R2 is always substantially lower than R2. Thus it should be clear that while it emulates R2 in spirit, it doesn’t actually approximate it. So when people talk about proportion of variation explained in a logistic regression, shoot them down.

pseudo-R2 vs R2

UCAS, ethnicity and admission rates

UCAS, the UK university admissions clearing house, have released data relating to ethnicity and admissions to English universities, in part in response to Vikki Boliver‘s research in Sociology suggesting that members of ethnic minorities are less likely to be admitted to Russell Group universities.

The analysis note with the release is sober and correct, showing a mostly consistent pattern of offer rates for ethnic minority students being lower (but not far lower) than expected. However, UCAS’s press release seems to have suggested that the effect is almost explained away, and attributes it to ethnic minority students disproportionately applying to courses with low acceptance rates. This does not seem to be the case.

Update: see also next blog entry.
Continue reading

Substitution costs from transition rates

Given that determining substitution costs in sequence analysis is such a bone of contention, many researchers look for a way for the data to generate the costs. The typical way to do this is, is by pooling transition rates and defining the substitution cost to be:

2 – p(ij) – p(ji)

where p(ij) is the transition rate from state i to state j. Intuitively, states that are closer to each other will have higher transitions, and vice versa. Continue reading

Using Emacs to send mail later

There are lots of ways to schedule mail to be sent some time in the future, but it is easy, for those of us who write and send mail from Emacs, to use that program and the Unix atd batch system to do it. If you use message-mode to write messages, this approach means that creating mails for delayed sending is the same as for normal sending.

Continue reading

Mapping with Python and Stata

Elevation data for large swathes of the planet have been collected by NASA and are available to download from

The data is contained in binary files, each representing a 1-degree by 1-degree “square”. Here are five lines of Python and four lines of Stata that will turn the data into a simple graph:

import struct
file = open("data/N52W011.hgt", "r")
for y in range(1201):
for x in range(1201):
print y, x, struct.unpack(">h",[0]

Do python > map.dat. Then run this Stata code:

infile i j height using /tmp/ext.dat
gen h2 = int(sqrt(height))
replace h2 = 30 if h2<=0
hmap j i h2, nosc

Low res version of map

(Hi-res version.)

You may need to install Python’s struct package, and Stata’s hmap add on, but they’re available from the usual locations.

There are better ways of doing this, of course: it’s slow, the aspect ratio is wrong, the colours are not ideal and the axis labelling is bad. Even worse, it is a complete abuse of the hmap add-on. It’s a quick and dirty way to turn binary data into pictures, all the same.

Hedstrom’s Desires-Believes-Acts model in Emacs lisp

Emacs-lisp is a pretty functional language for managing Emacs and automating complex tasks within it, particularly to do with text processing. It’s probably not wise to use it for more general programming or analytical tasks, but every now and then (when I need to procrastinate, mostly) I get carried away.

A few years ago I was reading Peter Hedstrom’s book, Dissecting the Social, and realised his Desires-Believes-Acts model (a kind of cellular automaton) would be easy enough to implement. More recently, I noticed that Emacs’ tools for displaying simple games like Tetris (do “M-x tetris”) would permit a clean display.

In Hedstrom’s model, every cell in a grid may desire an outcome, and may believe they are able to achieve it. If they do both, they act. Belief and desire depend on the beliefs and desires of your neighbours. Generally, even starting from random and low distributions of belief and desire, within a number of iterations stable configurations emerge, with systematic segregation; often everyone acts in the end but sometime stable oscillating systems emerge.

Continue reading