Parallel z-plot is a graphical procedure to visualize a contingency table with an ordinal response variable. The z-plot, defined previously by the second author, is obtained as follows: take the difference between the observed cumulative function of an ordinal variable and the theoretical cumulative function under some null-hypothesis and plot this difference against the ordered categories of the variable. A parallel z-plot is a generalization of this procedure allowing the visualization of an ordinal response variable depending on a nominal variable. It consists of a multiple number of modified z-plots.
Tree-based methods for classification are a statistical procedure for automatic learning from data, its main characteristic is the simplicity of the obtained results and the fact that the constructed rules follow very well the human process for decision making. This is the reason for its increasing use in companies, banks, etc. Its virtue is also its fault, since the tree growing process is very dependent on data, small fluctuations on data may cause a big change in the tree- growing process. We distinguish the internal stability from the external stability. The external stability is assessed by means of a test sample. Our purpose is to define data diagnostics to prevent internal stability in the tree-growing process. We follow the CART methodology for tree-growing. This implies to measure the homogeneity of a node by an impurity index. The stability of a tree is related with the measure of impurity adopted. We will present a general formulation for the impurity of a node, function of the proximity between the individuals of the node and its representative, being the proximity measured in a larger sense. Then, it is easy to compute a stability measure of a node and to build not only optimum trees but stable trees and hence to increase their predictive power of them.
In this communication we propose new methodology for automatic clustering of answers to open-ended questions in surveys using textual data statistical methods (Lebart and Salem 1988) and quasi-segments. The quasi-segments (Bécue 1993) are automatically identified taking into account multiple word co- occurrences. Then, the contingency table response x quasi-segment is built up and submitted to correspondence analysis; finally, individual answers are classified into clusters from their factorial coordinates.
The obtained answer classes point out the different kinds of discourses. Their determinant lexical features are identified; then, the characteristics of individuals are analysed to detect significant categories associated to each class. The methodology is applied to questionnaire answers from an investigation work realised by Baudelot (1990) to study how the selection from the marks in mathematics, decisive in the French school system, leads to a sexual and social selection.
The voters' study in connection with the 1991 general election for which 2,691 interviews were conducted in the Flemish region in December 1991 and the early months of 1992 contains information about eight electorates and thirteen attitudes and value orientations (post materialism, individualism, powerlessness, social isolation, nationalism and nine attitudes dealing with aspects of right-wing extremism). What attitudes are the best predictors for belonging to which electorates (or political parties)? Logistic regression analysis is compared with a biplot representation which is based on an original view on canonical correlation analysis as a combination of projection and rotation methods. In this method, one set of categorical (dummy) variables (the electorates) visualizes the group structure in the data. The rotation to one of the groups against the others improves the visualization of the contrasting characteristics (attitudes). The group means as well as the individual values (the voters' positions) can be displayed, or a combination of both. This provides the opportunity to focus on one or two parties (e.g. extreme right and the ecologists). In this method, the linear relationships between the continuous variables are visualized. A biplot for complete categorical data (groups and attitudes) may be generated via a decent transformation of the attitudes into categorical variables, using the CANALS method (Gifi). The biplots are programmed under X-WINDOWS. This interactive application uses all the tools of the GUI-interface. It is developed on an IBM RS6000 under AIX, but also tested on other UNIX platforms.
Usually the identification of non-substantive responses is not considered to be a problem, although the subsequent treatment of such responses in the analysis always raises difficulties. We explore the distinction between substantive and non-substantive responses using the 1984 Canadian National Election study data. Our analysis focuses on a battery of political party images. Respondents were asked which political party would be best (and which one worst) in dealing with a variety of issues, such as pollution, controlling inflation, and relations with Quebec. A large minority of respondents chose a "don't know" response or a "no difference" response. We use MCA to explore the relation of these two responses to substantive responses in which a specific political party was named. The graphical presentation of the results shows that a "no difference" response is quite distinct from a "don't know" response; the former lies on the same axis as substantive responses, while the latter are located in a different dimension.
Whilst clustering methods are quite commonly applied to data and there is a well-established set of clustering criteria and algorithms (at least for the case of "classical" data types), the problem of finding a suitable visual representation of classification results is not yet fully solved or investigated, and the many isolated suggestions found in the literature allow no general theory or guidelines at the moment. Since suggestive visualizations are urgently needed for the evaluation and interpretation of data, this domain of research will certainly expand in the future, and even more so now that there exist very sophisticated graphical software tools.
This paper presents a -- by no means exhaustive -- view on this topic. It surveys and suggests a series of methods for representing classifications and clustering results.
In a first section we consider various tools for the (a posteriori) analysis of partitional classifications, e.g. tools for a visual cluster diagnosis, graphical displays of their relative locations in an attribute space, Euclidean embeddings and projection plots, representations showing the overlap structure between different classifications etc.
The second section considers combined approaches for clustering and (PCA- or MDS-like) data representation, either in terms of feature vectors or of dissimilarity matrices. This approach leads, for instance, to criterion-guided projection pursuit clustering methods as variants of the well-known k-means algorithm.
Our third section deals with hierarchical classifications, e.g. ordinary, pseudo- and generalized hierarchies, tree representations etc. and related representations of classes.
Finally, the last section combines various graphical methods related to spatial data, coloured displays, block clustering approaches, network and lattice representations and fuzzy clustering.
Over the years, latent structure models and other finite mixture models have proved to be a versatile class of models in the analysis of panel, survey, and experimentally derived choice data. Numerous applications in a marketing context demonstrate that this class of models can provide useful insights into market structure and brand perceptions by simultaneously segmenting and structuring a market at a particular point in time, or over time with the use of panel data.
A core assumption of latent structure choice models is that individual decisions are independent. In this talk I will discuss this constraint and show that it is too restrictive for most types of choice data. Instead, it seems natural to consider the outcome of previous choices as possible explanatory variables for current choices and to treat observed non- stationarities in the choice probabilities as a result of feedback effects from previous choices. Similarly, the different response formats in paired-comparison or rank-order tasks introduce local dependencies that should be taken into account in a latent-class analysis. Re-analyses of several published data sets demonstrate that by considering these dependencies many difficulties can be avoided in the interpretation and estimation of latent structure choice models.
Interpretations in MDS are typically for dimensions. Dimensions arise as special cases of regional interpretations. Within facet theory (FT), regional approaches are put on a systematic content-driven basis (Borg & Shye 1995). FT proposes to first design the objects of interest in terms of a coding system. Once objects are coded in this design, and represented as points in MDS, one looks for a correspondence of the design to regions of the space. The hypothesis is that the space can be partitioned, facet by facet, so that points classified differently by the respective facet under consideration, fall into distinct regions. Two examples are shown, a radex-cyclindrex partitioning of intelligence test items, and a multiplex partitioning of political protest behavior.
Facets typically play either an axial, a polar, or a modular role in partitioning MDS spaces. Various combinations of these facets lead to prototypical structures often found in real data (simplex of regions, radex, conex, cylindrex etc.). The usual Euclidean dimension system combines two ordered axial facets. The polar coordinate system is the limiting case of a radex.
Technically, falsifiability conditions for regional interpretations are discussed. Analytical methods for partitioning of MDS spaces are sketched. Ways to achieve a confirmatory MDS that maximizes partitionability (of some sort) are outlined.
Methods such as Correspondence Analysis or Fitting and Analysis of Log-linear Models aim at describing data tables which as a rule depart from independence. These departures, however, may be confined to a few cells of the table, that is, the data may be fitted by a multiplicative model with a few outliers (Feinberg's Quasi-additivity model). When possible, this is a simple and elegant description of the data, perhaps a most preferable one. Likewise, a rank two model may be fitted to a data table, up to a few outliers. Whenever this is possible, the rank two approximation may be studied by, say, correspondence analysis, and a final elegant description may be thus obtained.
We propose two methods of outlier identification in two-way tables, one for the rank one model, another for the rank two model. Both are based on robust versions of Hoeffding's U statistics, and can be extended to the case of a rank larger than two, provided more difficult computational problems can be solved. We may also mention two desirable features of our proposal: (1) the zero frequencies in a contingency table are not a difficulty, and (2) in arbitrary tables, our proposal can lead to identification of outliers from any rank two model, such as an Additive-, Mandel's Row or Column Regression-, or Tukey Concurrent Model.
Different extensions of correspondence analysis have been proposed in the past, but only a few of them (Dequier, Choulakian) consider the three-way tables in a symmetric way. The object of this paper is to illustrate the extension proposed by Carlier and Kroonenberg, being exposed in this congress in another paper. In this symmetrical approach, a preprocessed 3-way contingency table is decomposed and approximated using the Tucker-3 or the CANDECOMP decomposition and displayed by special biplots. These graphical representations differ from the symmetrical approach by choosing a reference way in the table presenting the two other ways in different manners.
We illustrate the method on a real data set that gives the number of male and female students in France in 1968 with socio- economic origin i (34 levels) attending the type of study k (7 levels). Using this example we will
To find out, to analyse, to make visible the structure of a data matrix two approaches are available:
Rudas, Clogg and Lindsay (1994) and Clogg, Rudas and Xi (1995) presented a new definition of residuals based on a mixture representation of a given model for a contingency table. For a given table P and a given model H, P is decomposed as a two-point mixture. In the first component or latent class model H holds; and the second component is unrestricted. The matrix of probabilities for the second latent class summarizes local structure not captured by H, and the relative size of this latent class is the fraction of the population where H does not apply. Simple tabular displays can sometimes serve as descriptions of model misfit. Simple graphical displays can convey the model misfit, or the local structure, also. We consider histogram-type plots that describe model misfit in terms of margins, cell combinations and sufficient statistics under model H. In this approach, the structure captured in either the tabular displays or the graphical displays, depends on the model chosen as a baseline model. Sometimes this model will be a simple baseline, such as independence or quasi-independence. But the model might also be a plausible behavioral or structural model for the data, in which case the approach sheds light on possible sources of unmodeled heterogeneity.
Suppose that we have a finite set of n objects and two nxn distance matrices, obtained from the observation of p and q (categorical) variables, respectively. For example, suppose that we want a graphic display of n researchers with respect to p different topics and at the same time, we have a similarity between them (e.g. the number of works published jointly).
Our aim is to obtain a simultaneous representation of both metrics by means of multidimensional scaling and related methods. The first matrix can be represented using a standard technique, e.g. correspondence analysis or (metric) multidimensional scaling. Several ways are proposed to represent the second matrix, which plays a secondary role. If the objects are grouped in clusters of a few elements each, an intuitive approach would be to connect them with curves proportional to the distances.
In general, a modified Gower's add-a-point technique is used to join the objects by means of lines proportional to the distances contained in the second matrix. Alternatively, we study another joint representation by using canonical correlation analysis to relate the principal coordinates derived from both distance matrices. The visualization of variables is also studied and some examples are given.
In this presentation I shall discuss joint spatial representation of domain and mappings of general multivariables. This leads to far-reaching generalizations of regression analysis and principal components analysis, along the lines of a thoroughly geometrized version of the Gifi system. Techniques such as cononical analysis and redundancy analysis also fit in painlessly.
This approach is contrasted with more classical ones such as factor analysis, in which the representation of the individuals is lost, and multidimensional scaling/cluster analysis in which the representation of the variables gets lost.
The key concept, as in the Gifi system, is homogeneity, in which various types of restrictions operate on the representation of the domain and/or the variables. Various definitions of homogeneity, based on different metrics, are discussed, with their implications for data analysis.
Strategic intent has been conceptualised as sustaining obsession to be the best at all levels of organisation (Hamel & Prahalad 1989) and has only recently been operationalised (Dimovski 1994). To further verify the developed operationalised construct of strategic intent, in this paper multiple correspondence analysis (MCA) has been used.
An investigation has been conducted on the population of credit unions in Ohio, USA by using originally developed measures of strategic intent within the broader framework of studying an organisational learning concept. The construct of strategic intent has been operationalised by 13 variables that have been designated as different aspects of strategic intent (i.e. innovation, quality etc.). Variables are ordinal and have been measured on 5-point Likert scale.
Using MCA, the original 13-dimensional variable space has been reduced into two-dimensional subspace explaining about 85% of the total variance. The categories of the investigated variables have been represented as points in the two-dimensional display, according to the values of the first two principal coordinates. Given the ordinal nature of the variables used, it was meaningful to join the points belonging to the same variable with broken lines. The display has shown an intensive bipolarisation with most of the points representing categories with low strategic intent on one side and with high strategic intent on the other side. The general orientation of majority of the broken lines on the graph is parallel to the first principal axis.
The parallel pattern of the broken lines supports our initial hypothesis of having strong relationship among the variables of strategic intent. Such a result can also imply that most of the analysed variables can be used as surrogate variable for strategic intent.
Correspondence analysis is well suited to provide the frame for a relational sociological analysis that takes into account a multidimensional space of positions as well as the position taking agents (Bourdieu). We have studied the political field trying to understand the voting behaviour in its social and historical context.
Several separate correspondence analyses on different sorts of datasets (polling results, opinion surveys and census data) lead to different views of the political field of the Grand-Duchy of Luxembourg, which clarify one another.
As part of the UK Analysis of Large and Complex Datasets initiative, this research examines the issues and complexities in visualizing individual work and life histories. Statistical techniques are beginning to be developed to analyse and model work and life histories using software packages such as BUGS, SABRE and EGRET. However, good statistical practice emphasizes the need to look at the data before analysis, but the complexity of life history data has made this difficult in the past. This complexity includes multivariate changes of state over time, missing data and external events.
Previous approaches to the graphical exploration of life histories will be presented, with a discussion of their limitations. The solution adopted here is to display a work history as a multi-faceted object in three-dimensional space, with changes of state throughout an individual's history being represented by changes in colour, texture and height. Viewing a single object allows a single detailed history to be examined and the relationships between changes of state in many variables to be examined.
The advantages and disadvantages of using a modern visualization system such as SGI EXPLORER will be represented. The advantages are numerous and include a powerful interaction capability, modularity and extendibility and a graphics toolkit, providing useful features such as slicing, texturising, rendering and animation. Disadvantages are less obvious, but include a steep learning curve, the preprocessing of the data needed to provide the software with data in the appropriate format, and the inability of the software to deal with changes of state (non-differentiable functions).
The problem of extending these ideas to the viewing of multiple work histories will be discussed, and possible solutions described. Additional new tools for the visualization toolkit are needed to implement these ideas.
Emphasis will be placed throughout on the use of visualization as a tool for gaining insight into data before statistical analysis. An example from a UK Social Survey will be presented.
Planning of hospitals' bed capacities often is done by relatively simple algorithms as the famous "Hill-Burton"- formula: As determinants of future demand are regarded population size, hospitalization rates, average length of stay and capacity utilization, which all are handled in an age-, sex- and region-specific manner.
This procedure disregards different levels of know-how, technological and staff equipment, staff experience and other infrastructural aspects of the hospitals. To improve this methodological shortcoming for the purpose of a concrete prognosis of future demand of neonatal care facilities in the town of Vienna, we developed a questionnaire for infrastructural aspects of neonatal wards. All hospitals and wards caring neonatal patients (n=12) in Vienna answered to this questionnaire.
Formal Concept Analysis was used to visualize the multidimensional aspects of the current situation of neonatal care in the town of Vienna. We will present the results of
This paper reviews several graphical methods for visualizing n-way contingency table data: parquet diagrams (Reidwyl & Schupach 1994), mosaic displays (Friendly 1994) and fourfold displays (Friendly 1995a), which all display counts by area or observation density, and asks why visualization methods for categorical data are relatively ill-developed, while analogous methods for quantitative data are both well-developed and highly used.
An answer is suggested by a physical model for categorical data (Friendly 1995b) which likens categorical observations to gas molecules in pressure chambers, and provides a conceptual model for the use of area or observation density to display frequency data. The physical model provides concrete interpretations of a surprising number of results in the analysis of categorical data, from degrees of freedom and likelihood ratios, to iterative proportional fitting and Newton-Raphson iteration.
Correspondence Analysis (CA) and Formal Concept Analysis (FCA) are data evaluation methods representing categorical data using diagrams in the plane. In CA the data are transformed into data in a Euclidean metric space and projected onto a suitable subspace. In FCA the data are transformed into a concept lattice, which can be represented by a hierarchical line diagram drawn in the plane without using the distance in the plane. The reduction of dimensionality is done in CA by linear projections, in FCA by projections onto lattices which are embedded in the original lattice in a meaningful way. We compare both methods at the same data and discuss the advantages and disadvantages of CA and FCA.
The use of Biplots of logarithms of proportions in contingency tables is proposed as a tool for visual diagnosis of models of independence and description of deviations from such models. For two- and three-way tables, this paper shows the Biplot patterns that indicate complete, multiple and partial independence and provides graphical approximate tests of significance for them. An iterative method of fitting the Biplot is provided, in which the variability of the proportions under a Poisson or multinominal model is taken into account for the best fit. Some illustrative examples are given: it appears that independence models usually give only a partial description of data, but the Biplots are also useful in showing what patterns of dependence can actually describe the data.
We use cluster analysis to describe the regional linkages that arise through funding of research contract networks in the EU. We find five significantly different kinds of networks that we label as follows: Technological development, Basic research, Quasi-elite, Elite and Southern. These networks are described in terms of three basic dimensions: quality, type of partners and size combined with cost of the project. In terms of variables constructed along these dimensions, we find that the networks are homogeneously formed and that regions of similar technological capabilities are linked together. We discuss this empirical fact by means of a model in which researchers are matched by skill.
In many situations data analysis methods from different areas have to be combined to provide decision support desired for solving underlying problems. In this connection, questions whether a sequential or simultaneous performance of suitable data analysis methods should be preferred, can be discussed within a knowledge-based framework. If data analysis methods are specified by their input/output data, the term proposal (for the solution of an underlying problem) is used for a sequence of data analysis methods, where the methods are ordered in such a way that - according to rules for their application - the output data of predecessor methods correspond to input data of successor methods and where the performance of all methods in the sequence solves the interesting problem. With a-priori values as some kind of a measure of the quality for methods and their input data a-priori judgements of proposals for the same problem are possible. After solutions have been computed by application of selected proposals a-posteriori judgements can be based on goodness of fit criteria derived from chosen outputs. Against this background, a new simultaneous approach using conjoint data is also presented.
Just as continuous variables may be represented by axes in multidimensional space and samples represented relative to these axes, so may categorical variables be represented by category-level-points (CLPs) and samples by points relative to the CLPs. In both cases the value of a particular variable to be associated with a sample is derived from its nearest marker, implying orthogonal projection for linear axes but defining neighbour-regions for categorical variables. In low-dimensional MDS approximations, the neighbour-regions may be approximated by prediction-regions. Prediction regions partition the whole of the approximation space into convex tiles, one for each category-level; samples lying in a particular tile will have the corresponding category-level predicted, so defining a categorical variable version of the Eckart-Young theorem. This framework is very general, but will be illustrated by comparing prediction regions for (i) multiple correspondence analysis (chi-square distance) and (ii) for a similar kind of analysis using the Extended Matching dissimilarity Coefficient (EMC). Incidently, many matching coefficients for multi-level categorical variables are monotonically related to the EMC, so the displays of the corresponding prediction regions are invariant when non-metric MDS is used to approximate the sample distances.
The role of numerical diagnostics in supporting the graphical interpretation of correspondence analysis is discussed. The usual partitioning of inertia leads to various contributions which are adequate for simple correspondence analysis, but need modification in the case of multiple correspondence analysis. Alternative diagnostics which rely more on scalar products and interpoint distances in the display are also discussed and it is demonstrated in an application to sociological data that these are useful in both simple and multiple correspondence analysis.
Following the assumptions of ecological socialisation research, adequate analysis of socialisation conditions must consider the multilevel and multivariate structure of social factors that impact human development (Bronfenbrenner 1979, 1986; Garbarino 1992). This implies that complex models of family configurations or of socialisation factors are needed to explain the variance in developmental paths and outcomes (e.g. Baumrind 1980). Models of multivariate and multilevel structure of socialisation conditions often failed in analysis because of the methodological problems raised by different measurement levels and multivariations of the variables. Thus, the analysis of social constraints on individual and personal development is reduced to single facets of family, school or peer interactions without anchoring them in the macro- or exostructures of society. The paper illustrates the usefulness of the linearization approach as described by Gifi (1980, 1992) and Van de Geer (1993) to solve this problem. The usefulness of the method is illustrated by an empirical example: the effects of socialisation conditions on the development of anxiety and cognitive development. The data is derived from a longitudinal study of 120 Icelandic children.
A least-squares strategy is proposed for representing a two-mode proximity matrix containing similarity or dissimilarity values between pairs of objects chosen from two distinct sets, as an approximate sum of a small number of matrices having the same size as the original but which satisfy certain simple order constraints on their entries. The primary class of constraints considered are those defining a Q-form (or anti-Q-form) for a two-mode matrix, where after suitable and separate row and column reorderings of the latter matrix, the entries within each row and within each column are nondecreasing (or nonincreasing) to a maximum (or minimum) and thereafter nonincreasing (or nondecreasing). A Q-form or anti-Q-form can also be restricted to hold only within rows or only within columns. Matrices satisfying either the Q-form or anti-Q-form can be viewed as defining a joint ordering of the row and column objects along a continuum, where the original proximity information is reflected by the location of each object in the ordering in relation to the location of the objects from the other set. A number of empirical examples based on published data sets are used as illustrations of how such reconstructions might be carried out and interpreted. Finally, several other types of order constraints are mentioned briefly, along with a few corresponding numerical examples, to show how alternative structures can be considered using the same general type of computational strategy as in the fitting of (anti-)Q-forms.
During the year 1994, many elections took place in the Federal Republic of Germany. This frequency of elections offered a good opportunity to analyse the process of agenda setting and agenda building by different political actors.
On the basis of textual data, the development of themes which were mentioned in the course of the different election campaigns is depicted. A sample of 610 news items of wire services was used. These data records cover the period of time between June, 1993 and October, 1994 and all refer to the common key word Wahlkampfthemen (themes of the election campaigns). The news were categorized according to a coding scheme that includes different political fields of interest, contexts, political actors and other relevant thematic issues. The resulting textual data were analysed by correspondence analysis in order to give an insight into connections between the different themes and actors. Additionally, the dynamic aspects of agenda setting and agenda building can be shown by multiple correspondence analysis in a comprehensive way.
Various studies about risk perception of subjects have examined the risk taking behaviour of people by characterizing and evaluating their hazardous activities and technologies. MDS offers the possibility of regarding several judging structures describing the subjective risk classification of an individual. The method of paired comparisons has the advantage of not needing direct judgements of quantitative manner but ordinal relation judgements of perception differences.
In this special context, we studied the classification of 26 different risk outcomes with paired comparisons using a sample of 86 adolescents in Munich (age: 16-24). The risk outcomes can be seen as a representative selection of various quantitative and qualitative studies. The main interest was the influence of context variation on the judgement process. With a computer supported program, the subjects - randomized in three wording groups - conducted the comparisons according to one of the following wordings:
Correspondence analysis can be conceived as a technique in which a contingency table of counts is first processed so that it will only contain the dependence between the row and column variables. Then the singular value decomposition is applied to find a low-rank approximation which is to be displayed in a graph.
The major aim of the presentation is to extend correspondence analysis to three-way tables over and above earlier generalisations by Dequier (1973), Choulakian (1988) and Kroonenberg (1989) using three-way generalisations of the singular value decomposition. Moreover, it will be shown that the combination of these generalisations and Lancaster's (1951) additive decomposition of chi-square provides a powerful vehicle for the decomposition of dependence in a three-way table.
To evaluate interactions in three-way contingency tables with three-way correspondence analysis, biplots can fruitfully be used. It is not possible, however, to display the information of all three modes simultaneously, as biplots are essentially based on two sets of markers. Proposals how to deal with this situation will be presented. An example from child development is used to illustrate the theoretical developments.
Correspondence Analysis (Benzécri 1973) is one of the most popular ways for analysing the interdependence in a contingency table by decomposing Pearson's ~2. When the aim is to study the dependence of the row variable, say, on the column variable. Lauro & D'Ambra (1984) propose Non Symmetrical Correspondence Analysis (NSCA), based on the decomposition of Goodman & Kruskal's predictability index.
The choice of different constraints for the solutions has important consequences. In visualizing results, the usual joint representation in principal coordinates (Greenacre 1993) can be read only in terms of angles. The aim of this paper is to explore suitable transformations for coordinates in order to favour the interpretation for NSCA displays, taking into account their effect on the stability of factorial configurations.
The modelling and interpretation of relationships between nutritional characteristics and hypothetical explanatory variables is of common interest in nutritional epidemiology. The method of classification and regression trees (CART) (Breiman et al. 1984) is one approach to model the relationship between a classification, response or dependent variable to factors or independent variables possibly measured on different scales. CART is a valuable tool of exploratory data analysis which is especially true in a consultancy situation. It allows a simple presentation of multivariate relationships to the client. We discuss the CART model as a special linear model and compare it with models applied to event history data in social sciences (Martens 1994). Moreover, we discuss and use improved split criteria defined for binary, ordinal, quantitative and categorical factors (Lausen et al. 1994).
We apply and illustrate our ideas by modelling and analysing the influence of social and anamnestic factors on the duration of exclusive breast feeding observed in the Dortmunder Längsschnittstudie zur Ernährung von Säuglingen (Schöch et al. 1991). Finally, we compare the results computed using CART with a loglinear model and a Cox regression model applied to the breast feeding study.
The contribution deals with the problem of associations between texts (such as responses to open ended questions) and domains (areas of research in Automatic Information Retrieval (AIR), or categories of respondents in a survey). The use of words or segments as basic statistical units leads to characterize each sentence or response with a large sparse vector. To assign a response (a text) to a specific category, a discriminant analysis is needed. The purpose of a preliminary eigen-analysis (generally a correspondence analysis) is two-fold: First, to visualize the statistical associations between words and categories and second, to regularize the discriminant analysis (often "poorly posed") by discarding the last principal axes (Benzécri 1981; Deerwester et al. 1990). To assess the quality of this discrimination, a resampling procedure leads to a confusion matrix, giving an intrinsic measurement of association between the target categorical variable and the texts. This enables us to compare the predictive power of responses in different languages (three in our example, dealing with an international sample survey) over the same categorical variable.
We present new graphical displays which allow researchers to visualize and readily interpret the statistical effects (i.e. odds ratios) in categorical outcome data. Each display is provided from a user-selected reference perspective corresponding to the specific "coding" assigned to the predictor variable (e.g. dummy, effects, quantitative continuous coding). The effects are estimated from the class of "M-component" association models (Goodman 1991). The simplest models (M=1) correspond to logit models when the outcome variable is dichotomous, and to general "ordinal logit" models in the case of a polytomous outcome or multiple categorical outcome variables; hence the acronym GOLD (Graphical Ordinal Logit Display). In the case of multiple predictor variables, a single joint effects plot is presented along with separate "partial" GOLD plots associated with each predictor. We also justify the use of GOLD plots with continuous outcome data under the general "Universal Regression Model" in which case the statistical effects represent generalized odds ratios.
Latent class models (LCMs) can be useful in a variety of comparative social research contexts. To examine panel data, for example, two or more LCMs for the same cases may be examined as a special case of latent loglinear models available (McCutcheon and Hagenaars 1994) Also, multi-sample, or simultaneous, latent class models (Clogg and Goodman 1984, 1985, 1986) can be used to compare the latent structures when samples are available for more than a single group, the groups may be different nations, states, regions, cultural groups, or, to examine social change, separate samples drawn from the same population at two or more time points (see e.g. McCutcheon 1987a, 1987b; Hagenaars 1990; McCutcheon and Hagenaars forthcoming). When there is a large number of such samples, however, conventional approaches to exploring multi-sample LCMs may become unwieldy.
In this paper, I show that there are cases in which correspondence analysis (CA) can be used complementary with such LCMs. Three cases are explored: the first uses CA in an exploratory framework to examine patterns among the latent classes and the samples when there are several samples available for analysis; the second examines the use of CA to decompose the difference between two latent loglinear models with panel data; and the third considers multiple CA as an exploratory framework to examine patterns of residuals between two latent loglinear models with panel data for multiple groups. Applications of these cases will be illustrated using data from the International Social Science Project (ISSP) and the Political Action panel study.
Visualization of categorical data will be highlighted using two major concepts: discrimination and classification.
Discrimination in low-dimensional space is pursued by canonical discriminant analysis, where the use of a set of given predictor variables is to maximize the distances between a given set of known groups. Results include a classification table that compares the known group structure to the one obtained after objects are reassigned on the basis of posterior probabilities. It is assumed that a good classification depends on how well the groups can be discriminated.
Homogeneity analysis can be viewed as a simultaneous discriminant analysis. The standard joint representation, with category points in the centroid of the objects that belong to the category, can then be used for reassigning the objects. Correct classification can be compared to the size of the discrimination measure, the average of which is the basic goodness-of-fit index of the analysis. We will see that discrimination and classification do not always go hand-in-hand.
In this paper we introduce the use of nonsymmetric correspondence analysis in the field of non-parametric discrimination tree procedures. Suppose we have observed a response variable with J categories (corresponding to the possible classes) and M categorical predictors on a sample of N cases. Since there is a dependency of the response variable on each predictor, the nonsymmetric correspondence analysis, which decomposes the predictability tau index, is preferable to correspondence analysis. Here, we deal with this factorial model in a recursive partitioning procedure, when the sample size is very large. The aim is to get insight into the nature of the dependency of the J classes on the predictor categories for continuously changing partitions of the N cases. We define a discrimination criterion which is based on the geometric properties of the factorial model. In a particular, we can visualize the discrimination power as well as the predictive strength of the predictor categories on factorial axes.
As heated discussions between two leading camps (Carroll, Green & Schaffer 1989; Greenacre 1989) have indicated, graphical display of multiple-choice data leaves a lot of room for further improvement. First, if subjects and options of multiple-choice questions are displayed in the same graph, symmetric or nonsymmetric, one can hardly tell which options of n questions each subject has chosen. This is a serious problem for interpretability of input data from a graphical display. Secondly, another problematic aspect of such a graph is the fact that coordinates of subjects are typically based on the same number of responses, n, while the coordinates of response options are generally based on different numbers of responses. This discrepancy makes the interpretation of graphed data somewhat complicated, though not impossible, because each coordinate is a reflection of its frequency (popularity) as well. Thirdly, the standard procedure of graphing data in terms of two components (solutions, dimensions), projected on orthogonal axes, may not be the best way to elucidate the input data since the data are not always interpretable in terms of orthogonal components, but perhaps interpreting clusters in multidimensional space may be more meaningful. A practical method that satisfies desiderata of graphical display (Nishisato and Nishisato 1994) will be presented.
Data visualization techniques proved useful in analysing Bulgarian politics and public opinion since 1989. With traditional concepts such as left and right not applicable to the realities of political life and meaningless to interviewees, visualization techniques helped understand the way Bulgarians think about parties, politicians or institutions. Displays proved fairly stable over time or even with respect to the choice of technique indicating the presence of a strong structure in the data. They helped explore Bulgarian political space and diagnose or illustrate certain problems such as the difficulty of forming a stable political centre. Visualization of data from the elections is broadly consistent with the displays produced from public opinion data. There are some interesting similarities between structures found in data on political and ethnic attitudes.
Global political change in the last five years has most likely altered individuals' perceptions of similarities amongst a set of nations. With the fall of the communist planned economy, particularly in the former USSR, the distinction of nations according to a dimension of political alignment is expected to be less salient. Other dimensions of perception found in previous analysis (Wish et al. 1972) such as economic development and ethnic/racial differences may prove to be more significant.
The expected change in the perception of the differences of nations is evaluated by comparing similarity judgements from 1968 with those collected a quarter of a century later (i.e. 1993). The data collection process is intended to make no presumptive distinction in the importance of nations. Therefore in the context of comparisons amongst the given set of nations (as stimuli or objects) is seen as a context of nominally scaled data. For reasons of better comparison with the original results of the 1972 study an INDSCAL approach will be used to analyze the data and to visualize individual perceptions.
We investigated the judged similarity structure among 15 English kinship terms. We collected two judged similarity data sets, triads and paired comparison ratings from each of 82 subjects. We then produced two 15 by 15 data matrices for each subject and stacked the resulting 164 matrices and analyzed them with correspondence analysis. This allowed us to visualize where each subject placed each kinship term for both data collection methods in a single spatial representation. We were then able to contrast two theoretical models of cultural sharing. The strong model says that each subject shares the same cultural definition of the position of a term as every other subject. The weak model says that each subject has their own definition of the position of a term and that the cultural definition is simply an average of these positions. We find that each normal member of a culture shares the same cognitive structure for the semantic domain of kinship terms. Implications are discussed.
Observational data typically treated by Correspondence Analysis (CA) and more specifically by Multiple Correspondence Analysis (MCA) generally comprise both explanatory variables and explained variables. The purpose of this paper is to show how concepts and techniques stemming from sophisticated Multivariate Analysis of Variance - such as main effects, within effects and interaction effects for non-balanced data - may be adapted to Multiple CA, so as to enrich the usual aids to interpretation. The basic tools of our method of contributions of points and deviations, with its back and forth process between visualization and computation, will be presented. In particular, procedures for investigating axes of higher order by means of specific questions will be stressed. The new methods that we are developing, together with software implementing our language for interrogating data, will be illustrated in the analysis of a questionnaire (Rouanet and Le Roux 1993).
This paper utilizes correspondence analysis to look for differences in health-related lifestyles. It investigates differences in health-related behavior between groups, sociodemographics, states, etc. in relation to participation in sports and exercise, alcohol consumption, medical checkups etc. But there is also the approach to look at lifestyle and behavior changes over time. To identify comprehensive patterns of interrelationships between all categories of a set of variables, to investigate all variables in one step and to plot these results, the statistical procedures are based on simple correspondence analysis and multiple correspondence analysis. The data for the analyses were collected by telephone interviews for the Behavioral Risk Factor Surveillance System (BRFSS) of the Centers for Disease Control and Prevention (CDCP) Atlanta, USA.
The purpose of this methodological work is to give a training basis for interpreting loglinear models.
For each typical model in the class of hierarchical three- dimensional loglinear models (see Santner 1989) real data sets well-fitted by this model were chosen. The loglinear model was calculated for each data set and correspondence analysis was applied to the contingency table estimated by the fitted loglinear model. We could clearly expose the nature of associations between variables in appropriate types of loglinear models by interpreting the principal axes and the coordinate values of row and column points obtained by correspondence analysis. We concluded that the parallel application of the two complicated statistical methods simplifies the understanding of both. The illustrative data samples are taken from an Estonian life course study database, and partially from literature. The correspondence analysis was performed by SimCA (Greenacre 1986).
Ideal Point Discriminant Analysis (IPDA) fits the conditional probability of column j given row i to the corresponding observed frequency of a contingency table. This conditional probability is assumed to be a negative exponential function of the squared Euclidean distance between the row and the column of the table, which are both represented as points in a joint multidimensional space. By the spatial representation it facilitates a holistic understanding of the relationships between rows and columns of the contingency table. Constraints representing a variety of hypotheses about the contingency table can be imposed on distance parameters, and can be tested. This allows analytic investigations of the relationships in the contingency table. In this talk, we demonstrate the use of IPDA and illustrate through examples, what aspects of spatial representations can legitimately be interpreted and what not. Some graphical techniques useful for IPDA will also be discussed.
The contribution presents results of the study Art Worlds: A
Comparative Study
, based on surveys in the artworlds of Austria
and northern Germany. The data were collected in random samples
of professional and non-professional visitors of exhibitions
with international avantgarde art in Hamburg (n=580) and Vienna
(N=620) in 1993 and 1994. The question whether the producers of
visual avantgarde art and the intellectuals supporting them
(e.g. critics, curators) are still subcultures with specific
values and preferences (cf. Bourdieu), or are rather integrated
into middle class culture (cf. Crane, Gablik), is explored with
the help of latent-class and correspondence analysis, two
techniques that allow visualizations of data. Latent class
analysis for ordinal data, as suggested by Rost, is applied to
a scale measuring traditionalism (identification with family,
religion, nation and mainstream status symbols). In both
samples, three latent classes of low, middle and high
traditionalism are differentiated in a similar way and displayed
graphically. The probabilities of the individuals to belong to
the classes lead to three subclasses for each of these classes.
Correspondence analysis displays are used to explore whether
associations exist between latent classes and subclasses
(considered as supplementary points), residence (living in
Hamburg or Vienna vs. living in smaller towns) and position in
the artworlds (center-periphery). Differences regarding value
orientations can be found in the artworld of Vienna, whereas in
the Hamburg sample, there is no convincing evidence for the
subculture thesis.
Should the graphical display resulting from (multiple) correspondence analysis (MCA) be interpreted as a standard linear biplot or as an ideal-point plot? I show that MCA conforms both to a linear reconstitution formula as to an ideal point model. In the standard linear biplot interpretation of correspondence analysis (Greenacre 1993), the ideal-point features of the data go unnoticed. The biplot interpretation works if the eigenvalues are small. However, if the eigenvalues are large, the ideal-point feature is likely to be prominent in the data. It is then more informative to interpret the diagram as an unfolding diagram, or, even better to restrict interpretation to logratios of probabilities as in Aitchison's (1990) biplot of compositional data. The potential of canonical correspondence analysis (i.e. MCA linearly constrained by external predictors) for visualizing categorical data is emphasized.
Log-linear models are usually used to study the relationships among categorical/categorized variables under the assumption of the Poisson model or the consequent multinomial model obtained by conditioning on the totals. The variation of the means among the cells may be greater than that provided by a Poisson or multinomial model for many data sets. In these situations the two-parameter Negative Binomial (NB) model which allows "extra-Poisson" variation or the consequent multinomial NB model are often used as an alternative to the Poisson model. Overdispersion is a phenomenon often caused by latent heterogeneity, meaning that the sample arises from a population consisting of different subpopulations.
This paper presents an approach to extend NB model to Correlated Generalized Linear Models. In this environment, effects of factors and covariates on overdispersion, as well as on transition probabilities, can be investigated in a flexible manner. An application to data is discussed in some detail and the main features of a maximization routine based on the iteratively re-weighted least squares algorithm are also outlined.
Latent class analysis aims to explain the relations between manifest categorical variables by assuming the existence of a categorical latent variable. The categories of these latent variables are coined latent classes. Given a latent class, the manifest variables are independent. For the purpose of classification, interest regularly goes out to the conditional probabilities to fall into each of the latent classes, given some profile of responses on the manifest variables.
In many practical situations, people are not only interested in membership of latent classes, but also in the relation between explanatory variables and latent class membership. For this purpose, the earlier mentioned classification is sometimes used. It is better, however, not to use such a two-step procedure (first step: determine classification; second step: relate classification to explanatory variables), but instead to use a procedure that fulfils both objectives simultaneously.
If the explanatory variables are categorical, simultaneous latent class analysis can be used. If the explanatory variables are continuous, the concomitant variable latent class model can be used. In this presentation we will give attention to graphical aspects of both latent class analysis and concomitant latent class analysis.
Our previously published work on the analysis of sociological AIDS research between 1980 and 1990, based on entries in Sociological Abstracts, has shown that the evolution of research themes, authors, and journals in this domain are quite distinct from each other. Indeed, they each form a separate set of categorical data generated from the same data base. The research themes seem to divide up into solid, less flexible themes which change little with time, and those that are more flexible, even adaptive and change over time. The set of authors also seems to have this aspect of division between those that move in a clearly designated direction and those that evolve. But coauthorship analysis shows an entirely different aspect involving small networks or cliques that tend to stay together over time, even independently of thematic changes. The third type of categorical data is journals in which this research is published. In an expanding field such as sociological AIDS research between 1980 and 1990, some older journals try to adapt and compete with new thematic journals which appear. Other journals can try to avoid the new field or try to dominate a specific aspect of it. Each of these three types of data generates its own problems of visualization of results. Moreover, there is the additional problem of confronting and synthesizing these different visualizations which are all based on the same original data set. AIDS, Visualization of Results, Multiple Analysis, Journals, Authors, Themes.
In this paper, a process of typification that uses different exploratory statistical procedures and includes categorical and metric variables from a longitudinal data set, will be presented. The data set consists of the Bremen Longitudinal Social Assistance Sample (LSA), a 10% random sample of recipients from the city of bremen, whose welfare use have been observed over a period of 75 months. These data are used to illustrate how a typology of welfare careers and career dynamics can be constructed with exploratory techniques of categorization; how appropriate techniques of visualization can be employed to identify and to avoid artefacts; and how methods derived from tree analysis and modelling techniques for categorical data can be used to construct hypotheses explaining the typology. This process will be presented in two steps:
1. The typification of career dynamics
First, the construction of a preliminary typology of welfare careers will be presented. This typology is based on variables that can be measured on at least an interval scale, and its construction employs methods of cluster analysis. Further, it will be shown how artefacts can be identified by graphical techniques of visualization. Then, the procedures that were used to refine the typology by including information contained in categorical variables will be presented.
2. Exploratory steps towards an explanation of the typology
In a second step, exploratory techniques based on tree analysis (CHAID) will be employed as heuristics for the identification of interaction between the typology developed in the first step and explanatory variables which seem of theoretical interest. By estimating the parameters of these variables using multinominal logit models, the explanatory capacity of the hypothesis will be evaluated.
The responses from many surveys take values in finite sets, rather than in the real numbers, and consequently are not so easily displayed. The results are often summarised in multi-way contingency tables. Here we show how two aspects of survey data have a ready visualization using the output of a statistical modelling exercise. The first is concerned with the relationships between the variables cross-classifying the table: in many studies these exhibit conditional independence which leads to fitting a graphical model and to simple informative displays via the independence graph. The standard graph, with vertices for variables and edges for dependence, needs enhancing by informative choices of location of the vertices and a suitable weighting of the edges to convey strength.
A byproduct of the model is an estimate of how probable each observation or sampled individual is in the population. Residual plots of this information can identify improbable individuals and improbable responses to particular questions, given the responses to others.
Cluster Analysis is a widely used data analysis technique which
is applied to problems of different character and size. The
task can vary from classifying objects with respect to a number
of given patterns to identifying objects with similar behaviour.
Many of the algorithms presented in the literature deal with
problem sizes of up to 50 or even 100 objects but a small number
of features. In the problem we present, we are dealing with
about 2.5 million objects and up to 50 features. The data in
question originates from a long term cooperation of the
Institute of Mathematics and the Public Loan Banks
(Öffentliche Bausparkassen). It is the major aim of the
cooperation to develop a tool for simulation of future
behaviour.
The choice of the algorithms used depends not only on the problem size but also on the presumed data distribution. Choosing a valid similarity or dissimilarity measure is another important question. It depends on the structure of the data and the aim of the cluster analysis. Most important for practical use of cluster analysis is to evaluate the results of the performed cluster analysis. In our case, we determine the quality of the clusters using the results of the simulation. We use our knowledge gained from the cluster analysis to simulate future behaviour of real Loan Banking Collectives (Bausparkollektive). Moreover, it is used to get insight in real life saving patterns.