datachk Perform basic data checking on a dataset datachk

SAS Macro Programs: datachk

Michael Friendly
York University

The datachk macro ( [download] get

Perform basic data checking on a dataset

The DATACHK macro performs basic data screening/checking on numeric variables in a dataset, and is designed to give a compact overview of many variables. The output is a small subset of that produced by the univariate procedure. For a each variable, a few statistics and the lowest/highest observations are shown in a format like:

 Variable             Stat    Value       Extremes Id
 ATBAT                N            322          16 Tony Armas
                      Miss           0          19 Cliff Johnson
 Times at Bat         Mean    380.9286          19 Terry Kennedy
                      Std      153.405          20 Mike Schmidt
                      Skew    -0.07806
                                                663 Joe Carter
                                                677 Don Mattingly
                                                680 Kirby Puckett
                                                687 T Fernandez

In addition, a schematic plot (boxplot) is shown for the standard scores for all variables when the SPLOT=Y option is given (assuming the SPLOT macro is available.)

If a CLASS= variable is specified, this output is produced for each value of the CLASS= variable.

The macro also prints a table of standardized (Z) scores, for all observations which have at least NOUT z-scores greater than ZOUT in absolute value.



The DATACHK macro takes 10 keyword arguments. The VAR= variable list is required.

The arguments may be listed within parentheses in any order, separated by commas. For example:

   %datachk(data=baseball, var=salary runs hits rbi atbat, ..., )


Default values are shown after the name of each parameter.
Name of input data set
Variable(s) to be screened. You may use any of the shorthands for variable lists, e.g., X1-X20, STATA--STATZ, _NUMERIC_.
Class/grouping variable
Name of id variable, used to label observations in the output.
Name of the output dataset, containing the highest and lowest observations on each variable.
Output linesize, mainly used for the boxplots produced by the SPLOT option.
Number of low/high obs printed [Not yet]
Z-score for treating an obs as unusual
Number of |z|>&ZOUT to print an observation
Print boxplots of std scores?


The following call screens the variables SALARY RUNS HITS RBI ATBAT in the Baseball data set.
%include macros(datachk);        *-- or include in an autocall library;

%datachk(data=baseball, id=name, var=salary runs hits rbi atbat, ls=89);
This produces the output shown below, starting with the variable summaries:
Variable             Stat    Value       Extremes Id

ATBAT                N            322          16 Tony Armas
                     Miss           0          19 Cliff Johnson
Times at Bat         Mean    380.9286          19 Terry Kennedy
                     Std      153.405          20 Mike Schmidt
                     Skew    -0.07806
                                              663 Joe Carter
                                              677 Don Mattingly
                                              680 Kirby Puckett
                                              687 T Fernandez

HITS                 N            322           1 Mike Schmidt
                     Miss           0           2 Tony Armas
Hits                 Mean    101.0248           3 Doug Baker
                     Std     46.45474           4 Terry Kennedy
                     Skew    0.291154
                                              211 Tony Gwynn
                                              213 T Fernandez
                                              223 Kirby Puckett
                                              238 Don Mattingly
 ..(others omitted)...

SALARY               N            263          68 B Robidoux
                     Miss          59          68 Mike Kingery
Salary (in 1000$)    Mean    535.9658          70 Al Newman
                     Std      451.104          70 Curt Ford
                     Skew    1.589077
                                             1975 Don Mattingly
                                             2127 Mike Schmidt
                                             2413 Jim Rice
                                             2460 Eddie Murray
The SPLOT option gives:
                                     Schematic Plots                                     
Variable=COL1          Standard score                                                    
          5 +                                                                            
            |                                                            0               
          4 +                                                                            
            |                                                            0               
            |                                                            0               
          3 +                        |                       0           0               
            |                        |           0                       0               
            |                        |           |           |           0               
            |                        |           |           |           0               
          2 +            |           |           |           |           |               
            |            |           |           |           |           |               
            |            |           |           |           |           |               
            |            |           |           |           |           |               
          1 +            |           |           |           |           |               
            |         +-----+     +-----+     +-----+     +-----+        |               
            |         |     |     |     |     |     |     |     |     +-----+            
            |         |     |     |     |     |     |     |     |     |     |            
          0 +         *--+--*     *--+--*     |  +  |     *--+--*     |  +  |            
            |         |     |     |     |     *-----*     |     |     *-----*            
            |         |     |     |     |     |     |     |     |     |     |            
            |         +-----+     +-----+     +-----+     +-----+     +-----+            
         -1 +            |           |           |           |           |               
            |            |           |           |           |                           
            |            |           |           |           |                           
            |            |           |           |           |                           
         -2 +            |           |                       |                           
            |            |           |                                                   
            |            |                                                               
         -3 +                                                                            
     _NAME_             ATBAT        HITS         RBI        RUNS      SALARY            

See also

boxplot Box-and-whisker plots
nqplot Normal QQ plot
outlier Robust multivariate outlier detection
symplot Diagnostic plots for transformations to symmetry