First, we introduce the use of color and shading to represent sign and magnitude of standardized residuals from a specified model. For unordered categorical variables, we show how the perception of patterns of association can be enhanced by reordering the categories. Second, we introduce sequential mosaics of marginal subtables, together with sequential models for these tables. For a class of sequential models of joint independence, the individual mosaics provide a graphic representation of a partition of the overall G ² for complete independence in the full table into portions attributable to hypotheses about the marginal subtables. These methods are illustrated for multidimensional tables and for models of quasi-independence and quasi-symmetry in square tables.
Keywords: categorical data, marginal tables, graphical display, contingency tables
Table 1 shows data on the relation between hair color and eye color among 592 subjects (students in a statistics course) collected by Snee (1974). The Pearson chi² for these data is 138.3 with 9 degrees of freedom, indicating substantial departure from independence. The question is how to understand the nature of the association between hair and eye color.
Hair Color Eye Color BLACK BROWN RED BLOND | Total | Brown 68 119 26 7 | 220 Blue 20 84 17 94 | 215 Hazel 15 54 14 10 | 93 Green 5 29 14 16 | 64 --------------------------------------------+------ Total 108 286 71 127 | 592
A mosaic represents each cell of the table by a rectangle (or "tile") whose area is proportional to the cell count. Figure 1 is a mosaic of the data in Table 1. The mosaic is constructed by dividing a unit square vertically by hair color, then horizontally by eye color within each hair color.
Figure 1: Column proportion mosaic for hair color and eye color. Column widths show the marginal proportions of hair colors. The heights of tiles show the conditional frequency of eye color given hair color. The area of each tile is proportional to cell frequency.
Further variables are introduced by recursively subdividing each tile by the conditional proportions of the categories of the next variable in each cell, alternating on the vertical and horizontal dimensions of the display. Note that like the scatterplot matrix, this scheme allows an arbitrary number of variables to be represented, with the main limitation being resolution in the display. The final tiles are separated slightly, with greater spacing at the earlier subdivisions, to make small counts more visible and to allow visual aggregation of the tiles for nested divisions on the horizontal and vertical dimensions. See Friendly (1992c) and Wang (1985) for detailed descriptions of algorithms for constructing the basic mosaic display.
The present paper extends the use of the mosaic display as a data-analytic tool in two ways. First, for a given display we can fit a baseline model of independence or partial independence and use color and shading of the tiles to reflect departures from that model. For unordered categorical variables, perception of the pattern of association can be enhanced by reordering the categories to put residuals of like signs in opposite corners. A general scheme for reordering categories is based on a singular value decomposition (SVD) of residuals from independence. Second, for multiway tables we find it useful to examine a sequence of mosaic displays of marginal subtables as successive variables are brought into the cross-classification. While any log-linear model can be fit to the full table, a class of sequential models of joint independence provide a graphic representation of a partition of the overall likelihood-ratio G ² for complete independence in the full table into portions attributable to hypotheses about the marginal subtables.
Figure 2: Enhanced mosaic,
reordered and shaded. Deviations from independence are shown by
color and shading. Positive deviations have solid outlines and are
shaded blue. Negative deviations have dashed outlines and are
shaded red. The two levels of shading density correspond to
standardized deviations greater than 2 and 4 in absolute value.
This form of the display generalizes readily to multi-way tables.
-----------------------
(1)Mosaic displays are most effective
in color. However the sign information is lost if the
figure is reproduced in monochrome, so we
represent the sign of the deviation redundantly. One
solution is add a stick-on dot with a + sign to cells with
positive residuals, which seems to work well. The software for
mosaic displays (Friendly, 1992c) provides a number of grayscale and cross-hatch
schemes as well.
-----------------------
While the categories in small tables can often be rearranged by inspection, a more general approach is based on correspondence analysis (CA) (e.g., Greenacre, 1984), which assigns scores to the categories so that the Pearson correlation of the optimally scaled variables is maximized. For a two-way table the scores for the row categories, namely x sub im, and column categories, y sub jm, on dimension m = 1, ... , M are derived from the SVD of Pearson residuals to account for the largest proportion of the chi² in a small number of dimensions. This decomposition may be expressed as d sub ij / sqrt ( n ) = Sigma from m=1 to M lambda sub m x sub im y sub jm, where lambda sub 1 >= lambda sub 2 >= ... >= lambda sub M, and M = min ( I-1 , J-1 ). In M dimensions, the decomposition is exact. A rank-d approximation in d dimensions is obtained from the first d singular values, and the proportion of chi² accounted for by this approximation is n Sigma to d < lambda sub m ² < / chi².
Thus, assigning the scores x sub i1 to the row categories and y sub j1 to the column categories provides the maximum Pearson (canonical) correlation, equal to lambda sub 1, between these optimally scaled variables. Therefore, rearranging row or column categories according to the CA scores x sub i1, or y sub j1 on the first (largest) dimension should provide an ordering for the mosaic display to best reveal the pattern of association. This ordering captures the nature of the association to the extent that lambda sub 1 ² / Sigma to M lambda sub m ² is large. For the hair-eye data, for example, the singular values are .456 (89%), .149, and .051. A plot of the row and column points for the first two dimensions, shown in Figure 3, confirms that the order of the scores for eye colors on the first dimension is precisely the order determined by inspection from Figure 1.
Figure 3: Correspondence analysis plot. Positions of points for hair color (all caps) and eye colors (initial cap) on the first (largest) dimension are used to rearrange categories in the mosaic display.
For example, the model of complete independence, the log-linear model [A] [B] [C] for a three-way table, puts all higher terms, and hence all association among the variables into the residuals. Another possibility is to fit the model in which variable C is jointly independent of variables A and B, the log-linear model [ A B ] [ C ]. Residuals from this model show the extent to which variable C is related to the combinations of variables A and B but they do not show any association between A and B. The simplest extension of joint independence to four variables is the model [ A B C ] [ D ].
For example, with the data from Table 1 broken down by sex, fitting the model [HairEye][Sex] allows us to see the extent to which the joint distribution of hair-color and eye-color is associated with sex. For this model, the likelihood-ratio G ² is 29.35 on 15 df (p = .015), indicating some lack of fit. The three-way mosaic, shown in Figure 4, highlights two cells: males are underrepresented among people with brown hair and brown eyes, and overrepresented among people with brown hair and blue eyes. Females in these cells have the opposite patterns, with residuals just shy of +- 2. The d sub ij ² for these four cells account for 15.3 of the chi² for the model [HairEye] [Sex]. Hence, except for these cells hair color and eye color appear unassociated with sex.
Figure 4: Mosaic display for hair color, eye color, and sex. Each tile from Figure 2 is divided in proportion to the frequencies of males and females in that cell (the division by sex is fictitious). Residuals from the model [Hair Eye] [Sex] are shown by shading.
In particular, the series of mosaic plots fitting models of joint independence to the marginal subtables can be viewed as partitioning the hypothesis of complete independence in the full table. For a three-way table, the hypothesis of complete independence, H sub { A otimes B otimes C } can be expressed as
(1)where H sub { A otimes B } denotes the hypothesis that A and B are independent in the marginal subtable formed by collapsing over variable C, and H sub { AB otimes C } denotes the hypothesis of joint independence of C from the AB combinations. When expected frequencies under each hypothesis are estimated by maximum likelihood, the likelihood ratio G ²s are additive:
(2)This partitioning scheme extends readily to higher-way tables. For the hair-eye data, the mosaic displays for the [Hair] [Eye] marginal table and the [HairEye] [Sex] table in Figure 2 and Figure 4 can be viewed as representing the partition
Model df G² [Hair] [Eye] 9 146.44 [Hair, Eye] [Sex] 15 29.35 ------------------------------------------ [Hair] [Eye] [Sex] 24 179.79
This sequence of models of joint independence has another interpretation when the ordering of the variables is based on a set of ordered causal hypotheses regarding the relationships among variables (Goodman, 1973). Suppose, for example, that the causal ordering of four variables is A --> B --> C --> D, where the arrow means "is antecedent to". Goodman suggests that the conditional joint probabilities of B, C, and D given A can be characterized by the recursive logit models which treat B as a response to A, C as a response to A and B jointly, and D as a response to A, B and C. These are equivalent to the log-linear models which we fit as the sequential baseline models of joint independence, namely [A] [B], [AB] [C], and [ABC] [D]. The combination of these models with the marginal probabilities of A gives a characterization of the joint probabilities of all four variables.
Two versions of the program have been developed. The generalized version applies to any log-linear model which can be estimated by iterative proportional fitting (IPF). It fits either a user-specified model or any of the sequential baseline models of mutual independence, joint independence, or conditional independence, calculates Pearson or deviance residuals, and produces mosaic displays of sequential subsets of variables up to the full nway table. This version of the program (mosaics.sas) is available for ftp transfer, together with a User's Guide . See my home page for further information.
A specialized version of the program, under development, produces the mosaic display from tables of counts and residuals calculated externally. This version is designed to be used with other software for fitting models which cannot be estimated by IPF, such as models for ordered variables or square tables which require Newton-Raphson iteration.
Pre- Extra- Still Gender Marital Marital Divorced Married ------------------------------------------------ Men N N 68 130 N Y 17 4 Y N 60 42 Y Y 28 11 Women N N 214 322 N Y 36 4 Y N 54 25 Y Y 17 4 ------------------------------------------------
In this analysis we consider the variables in the order G, P, E, and M. That is, in the first stage, we treat P as a response to G and examine the [Gender][Pre] mosaic to assess whether gender has an effect on pre-marital sex. The second stage treats E as a response to G and P jointly; we examine the mosaic for [Gender, Pre] [Extra] for evidence that extra-marital sex is related to either gender or pre-marital sex. Finally, the mosaic for [Gender, Pre, Extra] [Marital] is examined for evidence of the dependence of marital status on the three previous variables jointly.
Each stage results in a fitted model for the corresponding marginal table. As noted in Section 3, these models are equivalent to the recursive logit models whose path diagram is G --> P --> E --> M. The G ² values for these models shown below provide a decomposition of the G ² for the model of complete independence fit to the full table.
Model df G ----------------------------------- [G] [P] 1 75.259 [GP] [E] 3 48.929 [GPE] [M] 7 107.956 ----------------------------------- [G][P][E][M] 11 232.142
The [Gender] [Pre] mosaic is shown in Figure 5. G ² for the model [G] [P] indicates that gender and reported pre-marital sex are highly associated. The mosaic shows that men are much more likely to report pre-marital sex than are women; the sample odds ratio is 3.7. We also see that women are about twice as prevalent as men in this sample.
Figure 5: Mosaic display for gender and pre-martial sexual experience. Residuals from the model [G] [P] are shown by shading.
For the second stage, the [Gender, Pre] [Extra] mosaic is shown in Figure 6. G ² for the model [G P][E] is 48.93 on 3 df, indicating that extra-marital sex is not independent of gender and pre-marital sex jointly. From the pattern of deviations in Figure 6 we see that men and women who have reported pre-marital sex are far more likely to report extra-marital sex than those who have not. >From the marginal totals for the [GP] [E] table, the conditional odds ratio of extra-marital sex is 3.61 for men and 3.56 for women. The pattern of deviations in the mosaic suggests the need for a [P E] term in an explanatory model for extra-marital sex, but a [G E] term incorporating an association between gender and extra-marital sex, given pre-martial sex, appears unnecessary.
Figure 6: Mosaic display for gender, pre-martial and extra-marital sexual experience. Deviations from the model of joint independence, [GP] [E] are shown.
The mosaic for the model [Gender, Pre, Extra] [Marital] for the final stage is shown in Figure 7. G ² for this model is 107.96 on 7 df, indicating that marital status depends strongly on gender, pre-marital sex, and extra-marital sex jointly. The relationship displayed by the pattern of deviations in the mosaic is more complex than a single interaction. Among those reporting no pre-marital sex (bottom part of Figure 7), there is a similar pattern of cell sizes and deviations for marital status in relation to gender and extra-marital sex: Given that people did not report pre-marital sexual experience, they are more likely to still be married if they report no extra-marital sex and more likely to be divorced if they did. Among those who do report pre-marital sex (top part of Figure 7), there also a similar pattern of sign of deviations, positive for those who are divorced, negative for those who are married.
Figure 7: Four-way mosaic for gender, pre-marital sex, extra-marital sex and marital status. Deviations from the model of joint independence, [GPE] [M] are shown by shading. The pattern of residuals suggests some terms to be included in an explanatory model.
The four 2 x 2 blocks of the mosaic show the conditional relation of extra-marital sex to marital status. Comparing these, we see that the odds ratios of divorce in relation to reported extra-martial sex are considerably larger for men and women who also reported pre-marital sex. These observations imply the need to incorporate effects [PM] and [EM] of pre-marital and extra-marital sex on marital status, and probably the interaction [PEM] into an explanatory model.
Thus, Figure 7 suggests the relationship between marital status and gender, pre-marital sex and extra-marital sex can be modelled by including the two-way associations [PM] and [EM], or the three-way term [PEM]. Since this stage considers marital status as a response to gender, pre-marital sex and extra-marital sex, we would normally fit the [GPE] marginal table, and consider the models [GEP] [PM] [EM] or [GPE] [PEM] for the complete table.
The model [GPE] [PM] [EM] does not fit particularly well, producing G ² = 18.16 on 5 df ( p = .0028 ). To see why, we display the residuals from this model in Figure 8. Only one cell has a standardized residual which exceeds 2: there are more still-married men who reported both pre-marital sex and extra-marital sex than the model predicts. The contribution to chi² from this cell is d ² = 6.92, which is not large enough to account for the lack of fit. Examining the signs of residuals in each of the four corner blocks in Figure 8, we see that the relationship of extra-marital sex to marital status is opposite to each other in the NW and SE blocks and in the SW and NE blocks, suggesting an interaction between P and E in their effects on marital status, that is, the model [GPE] [PEM]. This model does indeed fit quite well, G ² = 5.25 with 4 df ( p = .26 ). Further examination reveals a small but significant main effect of gender, resulting in G ² = 0.70 on 3 df ( p = .87 ) for the model [GPE] [PEM] [GM].
Figure 8: Four-way mosaic for gender, pre-marital sex, extra-marital sex and marital status. Shading shows residuals from the model [GPE] [PM] [EM]. The signs of residuals suggest the need for a [PEM] term in the model.
The process of finding an acceptable model for these data could clearly be carried out numerically, by fitting all possible models, or using a method of forward- or backward-selection. For multi-dimensional tables with higher-order associations, the interpretation of the log-linear parameters for these associations is often difficult. The sequence of mosaic displays reveals the pattern of these associations as each variable is included. As we move from a baseline fit to an explanatory model these associations are eliminated from the mosaic. Hence, we can think of the process of finding an acceptable model as "cleaning the mosaic".
Having found an acceptable model, other displays may be useful for presentation purposes. If we focus solely on predicting marital status, the minimal adequate log-linear model [GPE] [PEM] is equivalent to the logit model
(3)
To interpret this model we plot the observed and predicted logits for the combinations of gender, pre-marital and extra-marital sex in Figure 9. The interaction of pre-martial and extra-marital sex is clear: Those who report extra-marital sex are much more likely to be divorced than those who do not. However, pre-marital sex with another partner makes the faithful spouses more likely to be be divorced, while it reduces the odds of divorce for those who strayed.
Figure 9: Effects of Pre- and Extra-marital sex on Marital status. Points for men are filled, points for women are unfilled.
While these methods are useful for unstructured tables, they do not serve to display the residuals from models less restrictive than complete independence that apply to square tables and tables with ordered categories. We illustrate the use of mosaic displays for assessing models of independence and symmetry in square tables.
Table 3 presents visual acuity data compiled from 7477 women employees, aged 30-39, in Britain's Royal Ordnance factories in 1943-1946 from Stuart (1953), analyzed by Bishop, Fienberg and Holland (1975) and many others. For each person the left and right eyes were classified into vision grades, from 1 (highest) to 4 (lowest).
Left Eye Grade Right Eye 1 2 3 4 Grade --------------------------------- 1 1520 266 124 66 2 234 1512 432 78 3 117 362 1772 205 4 36 82 179 492
Figure 10: Fit of independence model for vision data
It is more reasonable to ask whether eye grades are independent for the off-diagonal cells, that is, when the two eyes differ. The model of quasi-independence ignores the diagonal cells. Even so there is considerable lack of fit, G ² = 199 on 5 df. The mosaic display for this model ( Figure 11) shows mostly positive residuals along the principal sub-diagonals: when eye grades differ, they are most likely to differ by one eye grade.
Figure 11: Fit of quasi-independence model for vision data
Figure 12: Fit of symmetry model for vision data
The symmetry model, however, assumes marginal homogeneity: that the marginal distributions of eye grades are the same for both eyes. A model of quasi-symmetry tests for symmetry ignoring marginal differences. The fit of this model is just acceptable, G ² = 7.27 on 3 df (p = .06). The mosaic display ( Figure 13), however, shows a suspiciously consistent pattern of signs on the off-diagonal cells: there still a small indication of right-eye superiority, even allowing for marginal inhomogeneity.
Figure 13: Fit of quasi-symmetry model for vision data
The sequence of mosaic displays in Figure 10 to Figure 13 resemble standard residual plots for linear models in that, as gross structure in the data is moved from the residuals to the fit, systematic patterns in what remains may suggest more subtle patterns of association that we may wish to explain. More generally, the mosaic display provides a natural and direct graphic adjunct to log-linear modelling. Fitting a model shows whether or not variables are associated, while mosaic displays reveal how those variables are related.