Category Archives: Uncategorized

Bug in Stata’s dendrogram code

September 23, 2016Uncategorizedbrendan

Dendrograms are diagrams that have a tree-like structure, and they’re often used to represent the structure of clustering in a hierarchical (agglomerative) cluster analysis. Agglomerative clustering starts from the bottom up, joining the nearest pairs of objects into clusters, and then clusters with objects and finally clusters with clusters, until eventually everything is a single cluster. The single cluster is the root, the objects are the leaves, and in between is a binary tree, where objects and clusters are combined depending on their distance from each other.

This process depends on being able to define a distance between an object and a cluster, and between pairs of clusters, and there are various ways to do this. However, some algorithms may cause the distance between an object/cluster and another cluster to change after the amalgamation of other clusters. This permits “reversals”, which are difficult or impossible to represent in a dendrogram-like structure. But clustering algorithms or “linkages” such as Ward’s are not subject to this problem.

OK, so far so good. What’s the problem? Stata’s dendrogram code is slightly buggy, and can give an error:
currently can't handle dendrogram reversals
even when you are using a linkage that is not subject to reversals. The explanation is that it is comparing distances between pairs of clusters where one must be greater than or equal to the other for the dendrogram to be drawable (otherwise it’s a reversal), and due to numeric precision is finding pairs where one is fractionally less. The correct code should test the difference in the distances is not less than a very small number (e.g., 10^-7) to take account of precision.

I have had a number of people report this error to me, in connection with my SADI Stata ado package, and have been able to reproduce it. In cooperation with some of these respondents, we have been able to get a work-around from Stata Technical Support (this bug persists into Stata 14).

The following command displays the information that Stata holds about the current clustering (apologies for the linewrapping):

. char list _dta[_clus_1_info] _dta[_clus_1_info]: `"t hierarchical"' `"m wards"' `"d user matrix omd"' `"v id _clus_1_id"' `"v order _clus_1_ord"' `"v height _clus_1_hgt"'

A small change to this will allow the dendrogram to be drawn:

. char _dta[_clus_1_info] `"`"t hierarchical"' `"m wards"' `"d user matrix omd"' `"v id _clus_1_id"' `"v order _clus_1_ord"' `"v height _clus_1_hgt"'"'

Gender change in Irish political representation

February 29, 2016Uncategorizedbrendan

Estimating the effect of the gender quota: The initial question

A gender quota for candidates was imposed in the 2016 Irish election (see e.g., http://www.thejournal.ie/readme/gender-quota). The question arises whether it had an impact. As a first pass consider the following table:¹

           |        gender
      dail |         f          m |     Total
-----------+----------------------+----------
      2011 |        21        145 |       166 
           |     12.65      87.35 |    100.00 
-----------+----------------------+----------
      2016 |        32        116 |       148 
           |     21.62      78.38 |    100.00 
-----------+----------------------+----------
     Total |        53        261 |       314 
           |     16.88      83.12 |    100.00

This is clearly a big rise. Is it statistically significant?

Continue reading Gender change in Irish political representation →

Pollsters’ 3% Margin of Error

February 24, 2016Uncategorizedbrendan

The 3% margin of error often quoted in polling has the following logic. It uses a sample size of 1000, and calculates a confidence interval as follows: \( \hat \pi \pm 1.96 \times SE \) where \(\hat\pi\) is the sample proportion, and SE, the standard error, is the standard deviation divided by the square-root of the sample size.

Continue reading Pollsters’ 3% Margin of Error →

Pseudo-R2 is pseudo

January 17, 2016Uncategorizedbrendan

People like the R² stat from linear regression so much that they re-invent it in places it doesn’t naturally arise, such as logistic regression. The true R² has nice clean interpretations, as the proportion of variation explained or the square of the correlation between observed and predicted values. The fake or pseudo-R² statistics are often based on relating the loglikelihood of the current model against that of the null model (intercept only) in some way. There is a good overview at UCLA.

One of the most popular pseudo-R² is McFadden’s. This is defined as 1 – LLm/LL0 where LLm is the log-likelihood of the current model, and LL0 that of the null model. This appears to have the range 0-1 though 1 will never be reached in practice.

It is well known that if we fit linear regressions by maximum-likelihood, we get exactly the same parameter estimates as if we fit by ordinary least squares. We can demonstrate this in Stata:

. sysuse auto . reg price headroom mpg . glm price headroom mpg

Since the ML estimation of the linear regression gives us loglikelihoods, we can calculate pseudo-R2 and true R2 for the same model. This code does it for a range of simple models with Stata’s demonstration “auto” data set:

sysuse auto, clear glm price local basell = e(ll) local vars "mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign" local rhs = ""


gen r2 = .

gen mcf = .
local i 0

foreach var in `vars' {

  local i = `i'+1

  local rhs = "`rhs' `var'"

  qui glm price `rhs'

  local mcfad =  1 - (e(ll)/ `basell')

  qui reg price `rhs'

  di %6.3f `=e(r2)' %6.3f `mcfad' " : `rhs'"

  qui replace r2 = `=e(r2)' in `i'

  qui replace mcf = `mcfad' in `i'

}

label var mcf "McFadden Pseudo-R2" label var r2 "R-squared" scatter mcf r2

This generates the following graph, in which we see that there is a monotonic but non-linear relationship between the two measures. We can also see very clearly that pseudo-R2 is always substantially lower than R2. Thus it should be clear that while it emulates R2 in spirit, it doesn’t actually approximate it. So when people talk about proportion of variation explained in a logistic regression, shoot them down.

pseudo-R2 vs R2

Update on UCAS post

October 3, 2015Uncategorizedbrendan

I had an interesting exchange by e-mail with Maggie Smith, the analyst at UCAS responsible for the note discussed in my previous blog entry. She tells me that the conclusion I derived from the released data (the small but significant ethnicity effect) is very much attenuated when the data is broken down by provider, and the published UCAS analysis is based on such disaggregated data. Continue reading Update on UCAS post →

Multi-processor Stata without Stata-MP

October 13, 2014Uncategorizedbrendan

Exploit your cores!

If you don’t have Stata-MP, it can be difficult to benefit from all the cores on your computer. However, if your problem can be split up in parts that can run in parallel, it is easy to run multiple instances of Stata. In this note I demonstrate a simple case, using the example of a simulation I wish to run many times.

Continue reading Multi-processor Stata without Stata-MP →

Substitution costs from transition rates

September 24, 2014Uncategorizedbrendan

Given that determining substitution costs in sequence analysis is such a bone of contention, many researchers look for a way for the data to generate the costs. The typical way to do this is, is by pooling transition rates and defining the substitution cost to be:

2 – p(ij) – p(ji)

where p(ij) is the transition rate from state i to state j. Intuitively, states that are closer to each other will have higher transitions, and vice versa. Continue reading Substitution costs from transition rates →

New Sequence Analysis Tools

April 3, 2014Uncategorizedbrendan

I last released SADI, my sequence analysis tools for Stata, in November 2011. Since then I’ve made various improvements and additions, relating to ongoing work such as that reported in Dept Working Paper WP2012-02 and WP2013-05 (the latter is an early version of a paper that is coming out in the book of the LaCOSA conference, due shortly).
Continue reading New Sequence Analysis Tools →

Hardline materialism in the Irish Times letter page

January 27, 2014UncategorizedBrendan Halpin

The text of my letter published in the Irish Times today (at http://www.irishtimes.com/debate/letters/philosophy-and-science-1.1667425):

Sir, – William Reville (Science, January 16th) criticises materialism as excluding, without evidence, the possibility of the supernatural. Continue reading Hardline materialism in the Irish Times letter page →

Using Emacs to send mail later

January 27, 2014Uncategorizedbrendan

There are lots of ways to schedule mail to be sent some time in the future, but it is easy, for those of us who write and send mail from Emacs, to use that program and the Unix atd batch system to do it. If you use message-mode to write messages, this approach means that creating mails for delayed sending is the same as for normal sending.

Continue reading Using Emacs to send mail later →

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Sociology, Statistics and Software

Thoughts on computers, data analysis and the social sciences