Bug in Stata’s dendrogram code

Binary treeDendrograms are diagrams that have a tree-like structure, and they’re often used to represent the structure of clustering in a hierarchical (agglomerative) cluster analysis. Agglomerative clustering starts from the bottom up, joining the nearest pairs of objects into clusters, and then clusters with objects and finally clusters with clusters, until eventually everything is a single cluster. The single cluster is the root, the objects are the leaves, and in between is a binary tree, where objects and clusters are combined depending on their distance from each other.

This process depends on being able to define a distance between an object and a cluster, and between pairs of clusters, and there are various ways to do this. However, some algorithms may cause the distance between an object/cluster and another cluster to change after the amalgamation of other clusters. This permits “reversals”, which are difficult or impossible to represent in a dendrogram-like structure. But clustering algorithms or “linkages” such as Ward’s are not subject to this problem.

OK, so far so good. What’s the problem? Stata’s dendrogram code is slightly buggy, and can give an error:
currently can't handle dendrogram reversals
even when you are using a linkage that is not subject to reversals. The explanation is that it is comparing distances between pairs of clusters where one must be greater than or equal to the other for the dendrogram to be drawable (otherwise it’s a reversal), and due to numeric precision is finding pairs where one is fractionally less. The correct code should test the difference in the distances is not less than a very small number (e.g., 10^-7) to take account of precision.

I have had a number of people report this error to me, in connection with my SADI Stata ado package, and have been able to reproduce it. In cooperation with some of these respondents, we have been able to get a work-around from Stata Technical Support (this bug persists into Stata 14).

The following command displays the information that Stata holds about the current clustering (apologies for the linewrapping):

. char list _dta[_clus_1_info]
_dta[_clus_1_info]: `"t hierarchical"' `"m wards"' `"d user matrix omd"' `"v id _clus_1_id"' `"v order _clus_1_ord"' `"v height _clus_1_hgt"'

A small change to this will allow the dendrogram to be drawn:

. char _dta[_clus_1_info] `"`"t hierarchical"' `"m wards"' `"d user matrix omd"' `"v id _clus_1_id"' `"v order _clus_1_ord"' `"v height _clus_1_hgt"'"'

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.