stat2dat Convert summary dataset to raw data equivalent stat2dat

SAS Macro Programs: stat2dat

$Version: 1.1 (2 Apr 1999)
Michael Friendly
York University

stat2dat macro ( [download] get

Convert summary dataset to raw data equivalent

The stat2dat macro takes a dataset containing summary statistics (N, mean, std dev) for a between groups design and produce a dataset from which PROC GLM can be run, producing an ANOVA summary table equivalent to that calculated from the raw data. This is useful for re-analysis of publised data, where only the summary information is available.

The input dataset contains one observation for each group. Supply the names of variables containing the N, MEAN, and standard deviation (STD) for each group (see argument list below); The mean square error (MSE) for a reported ANOVA can be supplied instead of individual STD values. The sample size per cell can be supplied as a constant rather than a dataset variable if all groups are of the same size.

The output dataset can then be used with PROC GLM or PROC ANOVA (balanced designs). It contains all variables from the input dataset plus a constructed dependent variable ('Y' by default) and a constructed frequency variable ('FREQ' by default).


See David Larsen, ``Analysis of Variance With Just Summary Statistics as Input'', The American Statistician, May 1992, Vol. 46(2), 151-152.


STAT2DAT is a macro program. Values must be supplied for the N= and MEAN= parameter, and either of the STD= or MSE= parameter.
   %stat2dat(data=inputdataset, out=outputdataset, ..., 
      depvar=Y, freq=freq)

   proc glm data=outputdataset;
      class classvars;
      freq freq;
      model Y = modelterms;
The name of the input dataset. If not specified, the most recently created dataset is used.
The name of the output dataset. If not specified, the new dataset is named DATAn.
The names of one or more class varialbes, which form the factors in the experimental design.
The data set variable containing the group sample size, or a constant if all group sizes are equal.
The data set variable containing the mean for each group.
The data set variable containing group standard deviation. If omitted, supply a constant or the name of a dataset variable containing the MSE (mean squared error).
A constant or the name of a dataset variable containing the MSE.
non-zero to print computed means, to verify the result
The name of constructed frequency variable
The name of constructed dependent variable
Y to expand the output dataset so that the FREQ= variable need not be used.


Keppel ``Design and Analysis'' (1991) gives cell sums and the MSE for a three-way ANOVA.
title 'Effect of feeback on long-term recall';
* Keppel, 1991, p.434;
data learning;
	do grade = 5,12;
	do words = 'low ', 'high';
	do feedback = 'Control', 'Pos', 'Neg';
		input mean @@;
		end; end; end;
8.8 8.0 7.6
8.0 4.4 3.8
9.0 8.4 8.0
7.8 7.4 7.2
To reproduce his analysis, use STAT2DAT to generate equivalent raw data. Note that both N and MSE are given as constants, while MEAN refers to the dataset variable.
%stat2dat(data=learning, class=grade words feedback,
	out=raw, mean=mean, n=5, MSE=1.75, depvar=recall);
proc print data=raw;
The output dataset contains two "pseudo-observations" for each group, calculated so they have the same mean and standard deviation as the raw data.

        1       5     low      Control      8.8    9.39161      4
        2       5     low      Control      8.8    6.43357      1
        3       5     low      Pos          8.0    8.59161      4
        4       5     low      Pos          8.0    5.63357      1
        5       5     low      Neg          7.6    8.19161      4
        6       5     low      Neg          7.6    5.23357      1
        7       5     high     Control      8.0    8.59161      4
        8       5     high     Control      8.0    5.63357      1
        9       5     high     Pos          4.4    4.99161      4
       10       5     high     Pos          4.4    2.03357      1
       11       5     high     Neg          3.8    4.39161      4
       12       5     high     Neg          3.8    1.43357      1
       13      12     low      Control      9.0    9.59161      4
       14      12     low      Control      9.0    6.63357      1
       15      12     low      Pos          8.4    8.99161      4
       16      12     low      Pos          8.4    6.03357      1
       17      12     low      Neg          8.0    8.59161      4
       18      12     low      Neg          8.0    5.63357      1
       19      12     high     Control      7.8    8.39161      4
       20      12     high     Control      7.8    5.43357      1
       21      12     high     Pos          7.4    7.99161      4
       22      12     high     Pos          7.4    5.03357      1
       23      12     high     Neg          7.2    7.79161      4
       24      12     high     Neg          7.2    4.83357      1
The analysis is carried out with PROC GLM as follows:
proc glm data=raw;
	class feedback words grade;
	model recall = feedback|words|grade;
	freq freq;

See also

rpower Retrospective power analysis for univariate GLMs
meanplot Plot means for factorial designs