Cluster analysis is unstable, we knew that!

Experience tells me that small changes in the data can lead to substantial changes in the solution of a cluster analysis. This is especially true when the space is sparsely populated, as is the case with sequence analysis of lifecourses. Small changes in parameterisation (e.g., substitution costs) can lead to substantial differences in the cluster solution.

However, recently I came across an extreme case of sensitivity. I was setting up a comparison of SQOM (ssc install sqom) and my Stata plugins (to try to understand how SQOM deals with duplicates) and wanted a starting point to show that under basic conditions (no duplicates in the data) both the pairwise distance matrices and the cluster solutions were identical for both implementations. Fundamentally the pairwise distances calculated are identical, but SQOM defaults to giving the raw (integer) distances whereas my plugins standardise by dividing by length of the longer sequence. Since all sequences were the same length this should make no difference.

However, I was puzzled to find that with a moderately small data set, the resulting cluster solutions were similar but distinctly not identical. Further investigation showed that the distance matrices were strictly proportional, and more, that dividing the SQOM matrix by the sequence length resulted in one that was strictly equal to the plugin one.

So I had a situation where an integer distance matrix yielded one solution, but if I divided it by a constant the resulting floating-point distance matrix yielded another. A puzzle that prompted me to e-mail Statalist. While I got no answer, mailing the list worked, as it caused me to think. Even there were no duplicates in the data, there are a lot of ties (pairs whose distance was exactly the same as other pairs). While the original distances have strictly the same rank when raw (integer) as when scaled (floating point), once the process of agglomeration into clusters starts, it is likely that there are also ties between pairs of pairs of clusters, where the tiny imprecision of floating point causes inconsistencies. These inter-cluster differences may be calculated by different sequences of operations, so it is possible that when done on integers the results are exact ties, but that the results on floating-point variables yield results that have tiny differences. This will break the tie (incorrectly) and therefore one particular agglomeration will occur first with floating-point distances, while the tied agglomerations are done at random with integer distances.

The following code demonstrates the problem, downloading a pre-prepared set of distances:

use http://teaching.sociology.ul.ie/bhalpin/dist                                                                   

mkmat d1-d42, mat(D)                                                                                               

clustermat wards D, name(D) add                                                                                    
cluster generate a4=groups(4)                                                                                      

mat D2 = D / 40                                                                                                    

clustermat wards D2, name(D2) add                                                                                  
cluster generate b4=groups(4)                                                                                      

tab a4 b4

This yields the following table:

           |                     b4                                                                                
        a4 |         1          2          3          4 |     Total                                                
-----------+--------------------------------------------+----------                                                
         1 |        17          0          0          0 |        17                                                
         2 |         1          0         12          0 |        13                                                
         3 |         0          8          0          0 |         8                                                
         4 |         0          0          0          4 |         4                                                
-----------+--------------------------------------------+----------                                                
     Total |        18          8         12          4 |        42

The inconsistency is small: only one case out of place (the ordering of the categories is irrelevant), but an utterly trivial difference in input has generated inconsistent outputs.

This illustrates that cluster analysis can be unstable. But is it unstable to an absurd degree? It has a problem with ties, such that (floating-point issues aside) the order of the data can affect the solution; this is well known. Here, the floating-point/integer contrast interacts with the ties problem, in an unexpected way. Moreover, I have observed this only with very small data sets. Larger data sets are less sparse, so that individual sequences have less impact, and the distance between pairs of pairs of larger clusters is less likely to be tied.