With tabulated data, it is clear that adding a new variable
to the table changes the dataset. The number of cells rises by a
factor of , the number of categories in the new variable, and
the counts are spread out over this larger number of cells.
Because of the risk of sparsity, and loss of parsimony, we
may seek ways to exclude variables from the modelling process.
But there are two ways of excluding variables: dropping them from
the model (e.g., the /DESIGN statement) and not using them
in constructing the table.
The criteria for excluding a variable from a model are clear:
its exclusion does not raise significantly. When can we
take bigger step of excluding it from the table? That is, when
can we collapse the table along one variable?
In general we can collapse a table where the variable has no
significant interaction with any other terms in the model. For
instance, in an A*B*C table, if C has no significant interaction
with the A*B association, then the marginal association is not
different from the conditional association, and we lose no
important information in modelling the marginal A*B table. As
demonstrated earlier () if there is association the marginal table may
be entirely misleading.
If one variable is clearly understood as a dependent
variable, it may be appropriate to collapse the table along
dimensions not related to the dependent variable, even when
interactions are present between that dimension and other
independent variables.
We use an example from Lindsey (p32ff) which he draws from
Fingleton: a survey of shoppers, in particular a 3-way table of
shopping trips by size of town visited (small, medium, large),
mode of transport (walk, bus, car) and frequency (often, seldom).
(See
http://teaching.sociology.ul.ie/~brendan/CDA/datasets/travel.sps.)
If we consider frequency as the dependent variable, we can
think in terms of a logit model: independence in this framework
is given by /DESIGN freq mode*size. The model doesn't
fit, with of 132.9 for 8. Adding a freq*size term
drops the by about 50 for 2 df, which is a significant
improvement. Adding the freq*mode term instead drops the
deviance to 9.02 for 6 df: this is a far more important
interaction. The combined model with both interactions has a
of 4.14 for 4 df: adding freq*size at this stage
is only marginally significant, .
So we ask the question: should we exclude size from
both the model statement and the table? First we fit the model
without the freq*size interaction on the uncollapsed table:
We find the fit changing from 132.9 for 8 df to 123.9 for 2 df,
. Not only does the fit not improve, but we are facing the
opposite problem of sparsity, a table that is too small to fit
interesting tables to (only ). Therefore we retain
size as a classifying variable.