Collapsibility

With tabulated data, it is clear that adding a new variable to the table changes the dataset. The number of cells rises by a factor of , the number of categories in the new variable, and the counts are spread out over this larger number of cells.
Because of the risk of sparsity, and loss of parsimony, we may seek ways to exclude variables from the modelling process. But there are two ways of excluding variables: dropping them from the model (e.g., the /DESIGN statement) and not using them in constructing the table.
The criteria for excluding a variable from a model are clear: its exclusion does not raise significantly. When can we take bigger step of excluding it from the table? That is, when can we collapse the table along one variable?
In general we can collapse a table where the variable has no significant interaction with any other terms in the model. For instance, in an A*B*C table, if C has no significant interaction with the A*B association, then the marginal association is not different from the conditional association, and we lose no important information in modelling the marginal A*B table. As demonstrated earlier () if there is association the marginal table may be entirely misleading.
If one variable is clearly understood as a dependent variable, it may be appropriate to collapse the table along dimensions not related to the dependent variable, even when interactions are present between that dimension and other independent variables.
We use an example from Lindsey (p32ff) which he draws from Fingleton: a survey of shoppers, in particular a 3-way table of shopping trips by size of town visited (small, medium, large), mode of transport (walk, bus, car) and frequency (often, seldom). (See http://teaching.sociology.ul.ie/~brendan/CDA/datasets/travel.sps.)
If we consider frequency as the dependent variable, we can think in terms of a logit model: independence in this framework is given by /DESIGN freq mode*size. The model doesn't fit, with of 132.9 for 8. Adding a freq*size term drops the by about 50 for 2 df, which is a significant improvement. Adding the freq*mode term instead drops the deviance to 9.02 for 6 df: this is a far more important interaction. The combined model with both interactions has a of 4.14 for 4 df: adding freq*size at this stage is only marginally significant, .
So we ask the question: should we exclude size from both the model statement and the table? First we fit the model without the freq*size interaction on the uncollapsed table:
```
genlog freq size mode
  /print=est/plot=none
  /design = freq size mode mode*size.
```
and then collapse the table:
```
genlog freq mode
  /print=est/plot=none
  /design = freq mode.
```
We find the fit changing from 132.9 for 8 df to 123.9 for 2 df, . Not only does the fit not improve, but we are facing the opposite problem of sparsity, a table that is too small to fit interesting tables to (only $2\times3$ ). Therefore we retain size as a classifying variable.

Collapsing categories

© Brendan Halpin (e-mail)	23-Apr-2012
Department of Sociology, University of Limerick
Taught programme: MA in Sociology (Applied Social Research),
Short course, May 14/15 2012: Categorical Data Analysis for Social Scientists