[Previous] [Up] [Top]
Multivariate Data Analysis
Psychology 6140

Final Graded Assignment

Several research problems, involving ANOVA, MANOVA, Repeated measures, logistic regression, or factor analysis are described below. There is also a "free choice" topic. Your goal in this is to demonstrate what you've learned about the methods of analysis in the second half of the course.

The SAS input files are linked on this page; some R and SPSS versions are available on the Hebb server or the web.

For TWO of these problems,

Place the Results/Discussion section first, followed by answers to the questions, as seems appropriate for the problem. To avoid unnecessary duplication, in answering the questions you may refer to results already described in your results section. Except for the "free choice" question, you need not provide any general introduction, methods, or conclusions sections.

For ease of reading, please format your paper with figures and tables presented inline where possible, rather than as a manuscript submission, where figures and tables generally appear at the end. If you use R and R Studio, you may find it convenient to write your reports using R markdown , which allows you to mix normal writing with R code and output.

1. Weight-Loss Clinics

A behavioural manual was developed to supplement the treatments offered by weight-loss clinics. The manual described techniques for self-monitoring, developing effective coping strategies, changing eating habits, and avoiding regaining the lost weight over time. Two clinics were selected for study and each ran two groups at different times during the same evening of each week. Within each clinic one group was given the behavioural manual in addition to the regular package while another group was not given the manual. In addition, it was thought that the length of time an individual had been trying to lose weight might affect the outcome, so the volunteers were classified as "experienced slimmers" or "novice slimmers". The between-S design is thus a 3-way factorial, Condition x Slimming Status x Clinic.

Weight and body girth measures were taken at three occasions: 9 weeks, 3 months, and 1 year. The weight measures were first expressed as a percentage overweight value, taking the persons height and age into account. These overweight percentages were then expressed as a percentage change on each occasion, relative to the initial baseline value taken prior to the start of the course. For example, a value of -5.5 for OW9 means a 5.5% decrease in the overweight percentage at 9 months, relative to the overweight percentage at baseline. Similarly, the girth measures (the 3 months values to be analyzed here) were expressed as percentage change from the baseline values.


The data are contained in the SAS file slim.sas in the N:\psy6140\data directory on the Hebb server. An SPSS version is available in N:\psy6140\lib\spss\slim.sav, or slim.sav on the web. There is also a CSV file, slim.csv that can be read in R or other software. The variables are:
 COND        - Condition:       1=Experimental, 2=Control
 STATUS      - Slimming status: 1=Experienced, 2=Novice
 CLINIC      - A or B
 OW9 OW3 OW1 - change in overweight percentage, at 9 weeks, 3 months, 1 year, 
               relative to baseline.
 BUST -- ARM - percentage change in various girth measures at 3 months vs. baseline.


  1. The researchers first concern was with the overweight variables. Carry out a multivariate analysis to determine if mean differences exist in the OW variables according to the between-S variables. Perform a parallel analysis treating the OW variables as a repeated measure factor (for this purpose, assume the measures were equally spaced in time). Summarize and contrast the results of these analyses.
  2. The researchers next wished to assess the impact of condition and status on the girth change measures; in particular they wished to know if the behavioural manual or slimming status had differential effects on the measures at different body locations. Perform an appropriate analysis to answer these questions. Summarize these results and describe how your analysis relate to the questions asked.
  3. A final question is whether weight change measures add anything to the analysis of treatment and status effects. Repeat the analysis performed for the previous question, but enter the overweight measures as covariates (predictors).

2. Treatment for Alzheimer's Disease

Recent work suggests that Alzheimer's may involve pathological changes in the central cholinergic system which result in deterioration in memory. If so, it might be possible to halt or slow down the memory impairment by long-term dietary supplements of lecithin, a chemical precursor of choline.

A study was carried out with two randomly assigned groups of Alzheimer's patients, one group being given lecithin and the other given a placebo over a 6 month period. To assess memory functioning in a sensitive way, two types of free recall tests were given to each subject at each of five times: 0, 1, 2, 4, and 6 months. In the first type the same words were repeated at each test; in the second, different but equivalent words were used each time. Hence, differences in performance on the two types of tests should be attributable to long-term learning.

The design therefore has one between-S factor and two within-S factors. The major question is whether the difference in performance on the two test types is the same or not for the two groups.


The data are contained in the file lecithin.sas on the server. An SPSS .sav file version is available in N:\psy6140\lib\spss\lecithin.sav or lecithin.sav on the web. The CSV version is lecithin.csv.

Scores on the repeated test are denoted A1 - A5; scores on the non-repeated test are B1 - B5. Each score is the number of words recalled out of 30. Group is coded 1 = Placebo, 2 = Lecithin.


  1. Examine the data for multivariate outliers, and examine the need for a transformation of these variables to approximate symmetry. [Since the data are counts out of a maximum, they are analogous to proportions.]
  2. Carry out the complete repeated measures analysis for the 3-way design for these data, with appropriate tests for (a) whether assumptions of the univariate (mixed model) analysis are met; and (b) polynomial trends for the TIME factor. Note that the times points are unequally spaced, so you will have to include the time values in the REPEATED statement.
  3. In a data step construct new variables,
     ABAR = mean( of A1-A5 );
     BBAR = mean( of B1-B5 );
     SBAR = mean( of A1--B5 );
    Carry out a univariate analysis of group differences on each of the variables ABAR BBAR SBAR ABDIF and show how these relate to the analyses carried out in step (2).

3. Survival in the ICU

This question uses logistic regression to study the survival of patients following admission to an adult intensive-care-unit (ICU). The data consist of a sample of 200 subjects who were part of a much larger study. The data is taken from Applied Logistic Regression by Hosmer and Lemeshow.


The ICU data has 200 rows and 21 variables. The rows correspond to the 200 patients, while the columns correspond to the input and response variables. Column 1 is an identification code (ID) unique to each patient, which can be ignored in the analysis, but which should be used to identify individual patients.

The response variable Y= died indicates whether the subject was alive (0) or dead (1) when he/she left ICU. The remaining variables in columns 3-21 are the predictor variables. The binary predictor variables are all coded so that the value 1 corresponds to a possible risk factor. In addition, the 3-level variable race has been supplemented by a binary variable white and the variable coma supplemented by a binary variable uncons = (coma>0) A code sheet for the variables is provided in Table 1. You can find the data in N:\data\icu.sas, as a SAS input file, as an SPSS input file, N:\data\icu.sps, an SPSS system file, icu.sav and in N:\data\icu.dat as a plain ASCII data file (for use with any other statistics package. In R, the data set ICU is contained in the vcdExtra package.

Table 1: Code Sheet for ICU Data
1 Identification Code ID Number id
2 Vital Status 0 = Lived
1 = Died
3 Age Years age
4 Sex 0 = Male
1 = Female
5 Race 1 = White
2 = Black
3 = Other
6 Service at ICU admission 0 = Medical
1 = Surgical
7 Cancer part of present problem 0 = No
1 = Yes
8 History of Chronic Renal Failure 0 = No
1 = Yes
9 Infection Probable at ICU admission0 = No
1 = Yes
10 CPR prior to ICU admission 0 = No
1 = Yes
11 Systolic blood pressure at ICU admissionmm Hg systolic
12 Heart Rate at ICU admission beats/min hrtrate
13 Previous ICU admission within 6 mths.0 = No
1 = Yes
14 Type of admission 0 = Elective
1 = Emergency
15 Fracture: Long bone, multiple, neck,
Single area or Neck
0 = No
1 = Yes
16 PO2 from initial Blood Gases 0 =>60
1 =~=\!60
17 PH from initial Blood Gases 0 =~=\!7.25
1 =<7.25
18 PCO2 from initial Blood Gases0 =~=\!45
1 =>45
19 Bicarbonate from initial Blood Gases0 =~=\!18
1 =<45
20 Creatinine from initial Blood Gases0 =~=\!2.0
1 =>2.0
21 Level of Consciousness at ICU admission 0 = No Coma/Stupor
1 = Deep Stupor
2 = Coma
Note: In SAS/INSIGHT, you can do a logistic regression from the Fit Y X menu, by selecting Options -> Response Dist binomial, Link function: logit.


  1. Run a logistic regression using all 19 predictor variables --
    age sex white service cancer renal infect
    cpr systolic hrtrate previcu admit fracture po2 ph pco bic
    creatin uncons
    Which variables appear to be strong predictors of survival? Which variables appear to be unnecesary? Are there any variables which, on logical grounds, should be included in any model?
  2. Use forward, backward, or stepwise selection to determine a minimal model where all predictors are individually significant at the 0.05 level, but perhaps forcing any variables you consider necessary to remain. Compare this model to your model with all variables in terms of goodness of fit and lack of fit.
  3. Investigate whether any quadratic terms (in a quantitative predictor) or interaction terms are necesary among those predictors in your model from the previous step.
  4. Without further analsis (or biological background) does the model seem reasonable? Which terms did you expected to see?
  5. In a sample of this size and heterogeneity, it may be considered likely that there are one or more cases of high leverage or influence. Find the 3-6 cases with the largest value of Cook's D (the C statistic in SAS) and interpret the nature of their influence on your model. [Hint: Use the %inflogis macro.]
  6. Investigate the predicted probability of death while in the ICU for the patients in this sample. Use your final model to obtain predicted probabilities. Plot Pr(died) as a function of age, with other variables held fixed at high or low values of some of your important risk factors.
  7. Given the level of success of your model in predicting a patient's outcome, how might this study be re-done to increase the accuracy of prediction? Is there an alternative design, or other variables which should be measured?

4. Factor Analysis of Twenty-Four Ability Variables

Holzinger & Swineford (1939) gave 24 tests of a variety of psychological abilities to junior high school students at two schools. These data are typical of the kinds of ability tests which have been used throughout the history of factor analysis and are one of the most widely studied sets of correlations in the factor analysis literature. The factor analytic problem is to determine the number and kind of dimensions or latent abilities which may be used to describe the correlations among these tests. Sample test items from the tests are given in HolzingerSwineford.pdf. The orienting questions here are more detailed than in other problems; you are free to choose a reasonable subset.

The data for this problem are available in several forms on the Hebb N: disk:

The raw data for both samples is contained in the file psych24r.sas, in SPSS format as psych24r.sps, and in CSV format as psych24r.csv. An R script, psych24r.R is also provided for reading in the raw data from the CSV file. The raw data also gives sex and chronological age for all subjects. The two samples are distinguished by the variable GRP.

The correlations for the Grant-White sample (N=145) are contained in the file psych24c.sas. The means, standard deviations and measures of reliability(2) are also contained in the correlation file for the Grant-White data. The same correlations are also provided in R format, in psych24c.R.
(2) Gorsuch (1974) reports that "the raw data does not always agree with the published statistics". Here we will assume that the correlations are correct.

The common practice in analysis of these data is to include variables 25 and 26 but not variables 3 & 4, (25 & 26 were attempts to develop better tests for variables 3 & 4) when the Grant-White sample alone is analyzed. However, in order to be able to compare the results of the two schools, we will ignore variables 25 and 26 here. Moreover, to reduce the size of the problem somewhat, we will only use the first 18 variables, named V1-V18 in the SAS files. (For those wishing to try the AMOS program, two raw data files, GRANT AMD and PASTEUR AMD, containing only V1-V18 are available on the Hebb server.)

Gedanken Experiment

From the descriptions of the tests, attempt to develop some theory, however vague, of the manifest content of these 18 tests. Which ones should tend to tap the same underlying abilites? How many different abilities? What results are reported in previous analyses? Do they make sense?

Data screening

Examine the raw data in the Grant-White sample (GRP=1) for:
  1. Univariate outliers or unusual observations on individual variables (Proc UNIVARIATE with PLOT option).
  2. Univariate normality (skewness, kurtosis)
  3. Multivariate outliers: observations whose distance from the centroid is large (see OUTLIER SAS on the class disk)
  4. Do the means and standard deviations agree with the published data?
Also, is there any evidence that the means of the two samples differ substantially for any of the variables?

Exploratory Factor Analysis

Use the correlation matrix from the Grant-White data (in psych24c.sas) for this part.(3)
(3) Since the data set contains the standard deviations, the analysis can be done using either the correlation or covariance matrix.
  1. Determine the number of factors necessary to adequately explain the correlations among these tests. Do various criteria tend to converge or do they indicate different numbers of factors?
  2. Find a rotated factor solution which provides an interpretable description of the correlations among the tests. Are orthogonal or oblique factors more resonable? If oblique, are all factors correlated, or can some pairs be considered independent?

Confirmatory Factor Analysis

This part of the problem is optional, for extra credit. To test your hypothesized factor structure there are two possibilities:
  1. The simplest is just perform a Procrustes target rotation specifying a target factor pattern determined from your exploratory analyses. [The target factor pattern is a matrix of 0s and 1s where the 1s specify the variables hypothesized to load on each factor. For PROC FACTOR, this matrix is read in with columns specifying the variables.]
  2. A stronger test is available by fitting a restricted factor model using PROC CALIS, LISREL, AMOS, or the R packages sem and lavaan.

5. Free Choice

Find an interesting set of data for which the methods of the second half of the course (ANOVA/MANOVA, Logistic regression, PCA/FA) would be appropriate and which is neither too simple nor too complex (e.g., for ANOVA: 2-3 between-S factors, 2-5 response measures or repeated measures). If you chose this, you should make up your own questions, comparable in scope to those in the earlier problems. Answer them, and prepare a Methods and Result sections suitable for a research report. You will also need to provide a brief introduction to provide the context.