datachk Perform basic data checking on a dataset datachk

SAS Macro Programs: datachk

Michael Friendly
York University



The datachk macro ( [download] get datachk.sas)

Perform basic data checking on a dataset

The DATACHK macro performs basic data screening/checking on numeric variables in a dataset, and is designed to give a compact overview of many variables. The output is a small subset of that produced by the univariate procedure. For a each variable, a few statistics and the lowest/highest observations are shown in a format like:

 Variable             Stat    Value       Extremes Id
 ATBAT                N            322          16 Tony Armas
                      Miss           0          19 Cliff Johnson
 Times at Bat         Mean    380.9286          19 Terry Kennedy
                      Std      153.405          20 Mike Schmidt
                      Skew    -0.07806
                                                663 Joe Carter
                                                677 Don Mattingly
                                                680 Kirby Puckett
                                                687 T Fernandez
                    --------------------------------------------

In addition, a schematic plot (boxplot) is shown for the standard scores for all variables when the SPLOT=Y option is given (assuming the SPLOT macro is available.)

If a CLASS= variable is specified, this output is produced for each value of the CLASS= variable.

The macro also prints a table of standardized (Z) scores, for all observations which have at least NOUT z-scores greater than ZOUT in absolute value.

Method

Usage

The DATACHK macro takes 10 keyword arguments. The VAR= variable list is required.

The arguments may be listed within parentheses in any order, separated by commas. For example:

   %datachk(data=baseball, var=salary runs hits rbi atbat, ..., )

Parameters

Default values are shown after the name of each parameter.
DATA=_last_
Name of input data set
VAR=
Variable(s) to be screened. You may use any of the shorthands for variable lists, e.g., X1-X20, STATA--STATZ, _NUMERIC_.
CLASS=
Class/grouping variable
ID=
Name of id variable, used to label observations in the output.
OUT=_hilo_
Name of the output dataset, containing the highest and lowest observations on each variable.
LS=80
Output linesize, mainly used for the boxplots produced by the SPLOT option.
LOHI=4
Number of low/high obs printed [Not yet]
ZOUT=2
Z-score for treating an obs as unusual
NOUT=2
Number of |z|>&ZOUT to print an observation
SPLOT=yes
Print boxplots of std scores?

Example

The following call screens the variables SALARY RUNS HITS RBI ATBAT in the Baseball data set.
%include macros(datachk);        *-- or include in an autocall library;

%datachk(data=baseball, id=name, var=salary runs hits rbi atbat, ls=89);
This produces the output shown below, starting with the variable summaries:
Variable             Stat    Value       Extremes Id


ATBAT                N            322          16 Tony Armas
                     Miss           0          19 Cliff Johnson
Times at Bat         Mean    380.9286          19 Terry Kennedy
                     Std      153.405          20 Mike Schmidt
                     Skew    -0.07806
                                              663 Joe Carter
                                              677 Don Mattingly
                                              680 Kirby Puckett
                                              687 T Fernandez
                     -------------------------------------------


HITS                 N            322           1 Mike Schmidt
                     Miss           0           2 Tony Armas
Hits                 Mean    101.0248           3 Doug Baker
                     Std     46.45474           4 Terry Kennedy
                     Skew    0.291154
                                              211 Tony Gwynn
                                              213 T Fernandez
                                              223 Kirby Puckett
                                              238 Don Mattingly
                     -------------------------------------------
 ..(others omitted)...

SALARY               N            263          68 B Robidoux
                     Miss          59          68 Mike Kingery
Salary (in 1000$)    Mean    535.9658          70 Al Newman
                     Std      451.104          70 Curt Ford
                     Skew    1.589077
                                             1975 Don Mattingly
                                             2127 Mike Schmidt
                                             2413 Jim Rice
                                             2460 Eddie Murray
                     -------------------------------------------
The SPLOT option gives:
                                     Schematic Plots                                     
                                                                                         
Variable=COL1          Standard score                                                    
                                                                                         
            |                                                                            
          5 +                                                                            
            |                                                                            
            |                                                                            
            |                                                            0               
          4 +                                                                            
            |                                                                            
            |                                                            0               
            |                                                            0               
          3 +                        |                       0           0               
            |                        |           0                       0               
            |                        |           |           |           0               
            |                        |           |           |           0               
          2 +            |           |           |           |           |               
            |            |           |           |           |           |               
            |            |           |           |           |           |               
            |            |           |           |           |           |               
          1 +            |           |           |           |           |               
            |         +-----+     +-----+     +-----+     +-----+        |               
            |         |     |     |     |     |     |     |     |     +-----+            
            |         |     |     |     |     |     |     |     |     |     |            
          0 +         *--+--*     *--+--*     |  +  |     *--+--*     |  +  |            
            |         |     |     |     |     *-----*     |     |     *-----*            
            |         |     |     |     |     |     |     |     |     |     |            
            |         +-----+     +-----+     +-----+     +-----+     +-----+            
         -1 +            |           |           |           |           |               
            |            |           |           |           |                           
            |            |           |           |           |                           
            |            |           |           |           |                           
         -2 +            |           |                       |                           
            |            |           |                                                   
            |            |                                                               
            |                                                                            
         -3 +                                                                            
             ------------+-----------+-----------+-----------+-----------+-----------    
     _NAME_             ATBAT        HITS         RBI        RUNS      SALARY            

See also

boxplot Box-and-whisker plots
nqplot Normal QQ plot
outlier Robust multivariate outlier detection
scatmat
splot
symplot Diagnostic plots for transformations to symmetry