data trunc stem | leaf 400.0 --> 400 40 | 0 86.3 --> 86 8 | 6 59.6 --> 59 5 | 9Using the tens' digit for the leaf:
data trunc stem | leaf 400.0 --> 40 4 | 0 86.3 --> 8 0 | 8 59.6 --> 5 0 | 5
0 | 1 | 2 | 3 | 4 | 5 | 6 |
0-0111111111111111112222222222222333344444455555555666666677777888 1-000001222222333445555666778889 2-0015 3-0 4-0 5- 6-5
UNIT: 10 LPS: 1 64) 0-0111111111111111112222222222222333344444455555555666666677777888 37 1-000001222222333445555666778889 7 2-0015 3 3-0 2 4-0 1 5- 1 6-5
UNIT: 10 LPS: 5 18 0-011111111111111111 35 0-22222222222223333 49 0-44444455555555 12) 0-666666677777 40 0-888 37 1-000001 31 1-222222333 22 1-445555 16 1-66677 11 1-8889 7 2-001 4 2- 4 2-5 HIGH: 300 400 650The display above uses five lines for each stem (LPS=5), so each line corresponds to a class interval of 20 in a histogram. Note that leaves of 0 & 1 go on one line, 2 & 3 on the next, etc.
Two lines per stem is another possibility:
UNIT: 10 LPS: 2 41 0-01111111111111111122222222222223333444444 23) 0-55555555666666677777888 37 1-00000122222233344 20 1-5555666778889 7 2-001 4 2-5 3 3-0 2 3- 2 4-0 1 4- 1 5- 1 5- 1 6- 1 6-5
PROC UNIVARIATE in SAS constructs stem and leaf displays for each variable when you use the PLOT option. If the infant mortality rates are the variable IMR in the data set NATION, these lines give (among other things) a stem and leaf display:
proc univariate data=nations plot; var imr;Figure 1 shows the stem and leaf portion of the PROC UNIVARIATE output for the infant mortality data.
VARIABLE=IMR Infant Mortality Rate STEM LEAF # BOXPLOT 6 5 1 * 6 5 5 4 4 0 1 0 3 3 0 1 0 2 6 1 | 2 002 3 | 1 555566666778899 15 | 1 000011222233344 15 +-----+ 0 55555556666666667778888889 26 *--+--* 0 11111111112222222222222233333333344444 38 +-----+ ----+----+----+----+----+----+----+--- MULTIPLY STEM.LEAF BY 10**+02Figure 1: PROC UNIVARIATE Stem and leaf display for infant mortality data
depth of median = d ( M ) = n + 1 / 2where n is the number of observations. If n is even, so d ( M ) is a half integer (e.g., 61.5) we average the two adjacent values (at depth 61 and 62) to get the median, M .
The median is a letter value , marked "M", defined as the middle observation in a distribution. Other letter values are the hinges , marked "H", which are located at the middle of each of the upper and lower halves of the ordered values. (They are essentially the same as the quartiles , but the definitions differ slightly.)
The hinges are also defined in terms of their depth:
depth of hinges = d ( H ) = [ d ( M ) ] + 1 / 2where [ ] means to drop the fractional part, if any.
Other letter values, called eighths (E), sixteenths (D), etc. are defined similarly as the observation half way, in number of observations, between the previous letter value and the nearer extreme.
d ( M ) = n + 1 / 2
d ( H ) = [ d ( M ) ] + 1 / 2
d ( E ) = [ d ( H ) ] + 1 / 2
d ( D ) = [ d ( E ) ] + 1 / 2This process can be continued until we obtain the extremes at depth 1. The calculations can be done directly from the stem and leaf display, but you should take the original data values (before truncation) as the letter values.
UNIT: 10 LPS: 5 18 0|011111111111111111 E= 16.9 35 0|22222222222223333 H=20 -> 25.7 49 0|44444455555555 12) 0|666666677777 D(M)= 101+1/2 = 51 M=60 -> 60.6 40 0|888 37 1|000001 31 1|222222333 D(H)= [51]+1/2= 26 H=120->129.4 22 1|445555 16 1|66677 D(E)= [26]+1/2= 13.5 E=(162.5+170)/2 11 1|8889 = 166.25 7 2|001 D(D)= [13.5]+1/2=7 D=200 4 2| 4 2|5 HIGH: 300 400 650
The batch can be summarized by a letter-value display :
LETRVALS IMR DEPTH LOWER UPPER SPREAD MIDPT M 51 - 60.6 60.6 - 60.6 H 26 - 26.2 129.4 - 103.2 77.8 E 13H - 16.9 166.3 - 149.4 91.6 D 7 - 12.8 200 - 187.2 106.4 1 1 - 9.6 650 - 640.4 329.8
For each pair of letter values, the average of the two, called a midsummary , contributes information about the central tendency of the distribution. The difference, called a spread , indicates the variability of the data.
Note that:
letter normal spread ------ ------------- H 1.349 E 2.301 D 3.068 C 3.726
H L - 1.5 x H-spr
H U + 1.5 x H-spr
For the infant mortality data, the calculation of fences is (from the letter value display)
inner fences: 25.7 - 1.5 * 103.7 = -129.85 129.4 + 1.5 * 103.7 = 284.95 (outside: 300, 400) outer fences: 25.7 - 3 * 103.7 = -285.4 129.4 + 3 * 103.7 = 440.5 (far out: 650-Saudi Arabia) adjacent: 259Here is the boxplot, showing two outside, and one far-out values:
BOXPLOT IMR 9.60 223.07 436.53 650.00 +-------------------+-------------------+-------------------+ +--------+ <-| | |-----------> o o X +--------+ LIBYA AFGHAN. S. ARABIANote that the skewness of the data is apparent in (a) the location of the median relative to the hinges, (b) the relative lengths of the whisker lines, and (c) the outside values on the high end.
The PLOT option of PROC UNIVARIATE also gives a small boxplot. When you have a grouping variable, you can produce full-page, side-by-side boxplots for each group on the printer with PROC UNIVARIATE.
The countries in the NATIONS data set are classified by REGION. Adding the statement BY REGION to the previous example gives side-by-side boxplots. (The data must first be sorted by REGION.)
proc univariate data=nations plot; var imr; by region;
Univariate Procedure Schematic Plots Variable=IMR Infant Mortality Rate | 700 + | | * | 600 + | | | 500 + | | | 400 + 0 | | | 300 + 0 | | | | | 200 + | | | * +-----+ | | *--+--* | | | | +-----+ 100 + +-----+ | + | | +-----+ | | | | *--+--* | 0 *-----* | | *--+--* +-----+ 0 + +-----+ | ------------+-----------+-----------+-----------+----------- REGION Americas Africa Europe Asia/OceFigure 2: Boxplots for IMR by region. The side-by-side display allows visually comparing the groups in central value, spread, shape, and unusual values.
Figure 3 shows boxplots of Infant mortality rate by region for the NATIONS data. This plot is drawn by the statements below. The option I=BOXT draws horizontal lines at the tips of the whisker lines.
title h=1.5 'Boxplot with I=BOX'; symbol i=boxt mode=include /* V6.06 */ v=+ color=black h=1.2; proc gplot data=nations ; plot imr * region / frame vaxis=axis1 haxis=axis2 vminor=1 hminor=0; axis1 label=(h=1.5 a=90 r=0) value=(h=1.3); axis2 label=(h=1.5) value=(h=1.3) offset=(8 pct);Figure 3: Simple boxplot with I=BOX option
The notches to give approximate 95% comparison intervals are calculated as:
Median ± 1.58 ( H-spr / sqrt n )
/* Description of Parameters: */ %macro BOXPLOT( /* ------------------------- */ data=_LAST_, /* Input dataset */ class=, /* Grouping variable(s) */ var=, /* Ordinate variable */ id=, /* Observation ID variable */ width=.5, /* Box width as proportion of maximum */ notch=0, /* =0|1, 1=draw notched boxes */ connect=0, /* =0 or line style to connect medians*/ f=0.5, /* Notch depth, fraction of halfwidth */ fn=1, /* Box width proportional to &FN */ varfmt=, /* Format for ordinate variable */ classfmt=, /* Format for class variable(s) */ varlab=, /* Label for ordinate variable */ classlab=, /* Label for class variable(s) */ yorder=, /* Tick marks, range for ordinate */ anno=, /* Addition to ANNOTATE set */ out=boxstat, /* Output data set: quartiles, etc. */ name=BOXPLOT /* Name for graphic catalog entry */ );The figure below shows notched boxplots for the IMR data produced using the BOXPLOT macro. The width of each box is made proportional to sqrt n .
%include boxplot; title1 h=1.5 'Infant Mortality Rate by Region'; %boxplot(data=nations, class=region, var=imr,id=nation, notch=1, fn=sqrt(n), classfmt=region., varlab=Infant Mortality Rate, classlab=Region);
A particularly useful family of transformations is what Tukey calls the ladder of powers , where a variable x is transformed to t p ( x ) by a power p according to
t p ( x ) = left { lpile x p -1 over p , above log 10 x , lpile p != 0 above p = 0 right "" . (1)Although the family of transformations given by (1) is defined for all powers p , it is common to use only the simple integer and half-integer powers listed in Table 1. (In some situations, we may add the cube root, p = 1 / 3 to this list.) Table 1: Simple transformations in the ladder of powers
+---------------------------------------------+ | Power Transformation Re-expression | |---------------------------------------------| | 3 Cube x 3 | | | | 2 Square x 2 | | | | 1 NONE (Raw) x | | | | 1/2 Square root sqrt x | | | | 0 Log log10 x | | | | -1/2 Reciprocal root -1/sqrt x | | | | -1 Reciprocal -1 / x | +---------------------------------------------+The important properties of this family of transformations are:
log X X X**2 0 1 1 1 { } 9 } 99 1 10 100 1 { } 90 } 9900 2 100 10000 1 { }900 } 990000 3 1000 1000000
Since the power transformations preserve the order of the data values, we can find a suitable transformation by looking at the letter values and their midsummaries transformed to various powers. In a symmetric distribution, the midsummaries do not systematically change as we go out toward the extremes.
LETRVALS IMR * .5 DEPTH LOWER UPPER SPREAD MIDPT M 51 - 7.8 7.8 - 7.8 H 26 - 5.1 11.4 - 6.3 8.2 E 13H - 4.1 12.9 - 8.8 8.5 D 7 - 3.6 14.1 - 10.6 8.9 1 1 - 3.1 25.5 - 22.4 14.3 LETRVALS 10zIMR DEPTH LOWER UPPER SPREAD MIDPT M 51 - 1.8 1.8 - 1.8 H 26 - 1.4 2.1 - .7 1.8 E 13H - 1.2 2.2 - 1 1.7 D 7 - 1.1 2.3 - 1.2 1.7 1 1 - 1 2.8 - 1.8 1.9The boxplots for sqrt IMR and log IMR show the effect of these transformations on the shape of the data. The distribution becomes more symmetric as the power decreases.
BOXPLOT IMR * .5 3.10 10.56 18.03 25.50 +-------------------+-------------------+-------------------+ +----------------+ ----| | |---------------------- +----------------+ BOXPLOT 10zIMR .98 1.59 2.20 2.81 +-------------------+-------------------+-------------------+ +----------------------+ -------------| | |---------------------- +----------------------+
Graphical methods can also be used to find a power transformation for symmetry. The simplest, useful plots are based on the idea that in a symmetric distribution, the distances of points at the lower end to the median should match the distances of corresponding points in the upper end to the median. In general, if the distribution were perfectly symmetric the observations at depth i should satisfy,
Lower distance to median = Upper distance to median
fwd 1.25i M - x (i) = x (n+1-i) - M
mid identical ( x (n+1-i) + x (i) ) / 2 vs. x (n+1-i) - x (i) identical spread .This plot is based on the idea that in a symmetric distribution, each mid value should equal the median,
( x (n+1-i) + x (i) ) / 2 = M .
Since the spreads increase as we go from the center out to the tails, this plot depicts how the pattern of mid values changes from the middle of the distribution to the extremes. Because the plot is untilted (slope = 0) when the distribution is symmetric, expansion of the vertical scale allows us to see systematic departures from flatness far more clearly.
x (i) + x (n+1-i) over 2 - Mas the vertical coordinate, against a squared measure of spread,
Lower 2 + Upper 2 over 4 M = ( M - x (i) ) 2 + ( x (n+1-i) - M ) 2 over 4 Mas the vertical coordinate. If this graph is approximately linear, with a slope, b , then p = 1 - b is the indicated power for a transformation to approximate symmetry.
proc sort data=analyze out=sortup; by X; proc sort data=analyze out=sortdn; by descending X; data symplot; merge sortup(rename=(X=frombot)) /* frombot = x(i) */ sortdn(rename=(X=fromtop)); /* fromtop = x(n+1-i) */ mid = (fromtop + frombot) / 2; spread = fromtop - frombot;These plots are all implemented in the SYMPLOT macro, which takes the following parameters:
%macro SYMPLOT( data=_LAST_, /* data to be analyzed */ var=, /* variable to be plotted */ plot=MIDSPR, /* Type of plot(s): NONE, or any of */ /* UPLO, MIDSPR, MIDZSQ, or POWER */ trim=0, /* # or % of extreme obs. to be trimmed */ symbol=dot, /* plotting symbol */ out=symplot, /* output data set */ name=SYMPLOT); /* name for graphic catalog entry */For example, the symmetry power plot of the IMR data is produced by these statements:
%include symplot; title h=1.6 'IMR data: Symmetry transformation plot'; %symplot(data=nations,var=IMR,plot=POWER,trim=5 pct);Trimming 5% of the observations in each tail ensures that extreme observations do not influence the choice of the power transformation.
The dot chart , developed by Cleveland (1984) is quite effective for this purpose, as long as the number of observations is not too large. If the observations are sorted by the response variable, the dot chart portrays the empirical cumulative distribution of the sample.
Simple dotplots can be produced on a printer with the TIMEPLOT procedure. The observations are arranged on the vertical axis in their order in the data set, so the data set should be sorted in descending order of the response variable. For example, the statements below produce the printer plot shown in Figure 4. The options REF=0 and JOINREF produce the dashed line connecting the plotting symbol to the vertical axis.
%include data(duncan); proc sort data=duncan; by descending prestige; proc timeplot data=duncan; plot prestige = '*' / ref=0 joinref; id job;
Duncan Occupational Prestige Data JOB Prestige min max 0 97 *---------------------------------------------* Physician 97 ||-------------------------------------------*| Professor 93 ||------------------------------------------* | Banker 92 ||------------------------------------------* | Architect 90 ||-----------------------------------------* | Chemist 90 ||-----------------------------------------* | Dentist 90 ||-----------------------------------------* | Lawyer 89 ||----------------------------------------* | Civil Eng. 88 ||----------------------------------------* | Minister 87 ||---------------------------------------* | Pilot 83 ||--------------------------------------* | Accountant 82 ||-------------------------------------* | Factory Owner 81 ||-------------------------------------* | Author 76 ||----------------------------------* | Contractor 76 ||----------------------------------* | PS Teacher 73 ||---------------------------------* | RR Engineer 67 ||------------------------------* | Welfare Wrkr. 59 ||--------------------------* | Undertaker 57 ||-------------------------* | Machinist 57 ||-------------------------* | Electrician 53 ||------------------------* | Reporter 52 ||-----------------------* | Store Manager 45 ||--------------------* | Insur. Agent 41 ||------------------* | Policeman 41 ||------------------* | Bookkeeper 39 ||-----------------* | RR Conductor 38 ||-----------------* | Mail carrier 34 ||---------------* | Carpenter 33 ||--------------* | Plumber 29 ||------------* | Auto repair 26 ||-----------* | Machine opr. 24 ||----------* | Barber 20 ||--------* | Motorman 19 ||--------* | Store clerk 16 ||------* | Cook 16 ||------* | Coal miner 15 ||------* | Truck driver 13 ||-----* | Watchman 11 ||----* | Gas stn attn 10 ||----* | Taxi driver 10 ||----* | Waiter 10 ||----* | Janitor 8 ||---* | Bartender 7 ||--* | Soda clerk 6 ||--* | Shoe-shiner 3 ||* | *---------------------------------------------*Figure 4: Dot chart of Occupational Prestige from PROC TIMEPLOT
%macro dotplot( data=_LAST_, /* input data set */ xvar=, /* horizontal variable (response) */ xorder=, /* plotting range of response */ xref=, /* reference lines for response variable */ yvar=, /* vertical variable (observation label) */ ysortby=&xvar, /* how to sort observations */ ylabel=, /* label for y variable */ group=, /* vertical grouping variable */ gpfmt=, /* format for printing group variable */ /* value (include the . at the end) */ connect=DOT, /* draw lines to ZERO, DOT, AXIS, or NONE */ dline=2, /* style of horizontal lines */ dcolor=BLACK, /* color of horizontal lines */ errbar=, /* variable giving length of error bar */ /* for each observation */ name=DOTPLOT); /* Name for graphic catalog entry */The high-resolution dot chart of occupational prestige is produced by these statements:
%include DUNCAN ; * get the data; %include DOTPLOT; * get the macro; %dotplot( data=duncan, yvar=job, xvar=prestige, connect=DOT);
Quantile comparison plots or Q - Q plots compare the empirical cumulative distribution of the data
G HAT ( x ) = number ( X <= x ) over n = estimated Prob ( X <= x )with the cumulative probability distribution (CDF) of the reference distribution,
F ( z ) = Prob ( Z <= z )
A general, graphical approach is to plot the ordered data values, x (i) , against the expected values of those observations if they followed the particular reference distribution. When the data follows the reference distribution, the points in such a plot will follow a straight line. Departures from the line shows how the data differ from the assumed distribution.
p i = i - 1/2 over n(Other definitions are used also; this one splits each observation, assigning half of it above, and half below x (i) )
z i = F -1 ( p i )
expected = Median ( x ) + ( H-spr / 1.349 ) z i
One way to do this is to calculate an estimated standard error, s Hat ( z i ) , of the ordinate z i and plot curves showing the interval z i ± 2 s Hat ( z sub i ) to give approximate 95% confidence intervals. Chambers et al. (1983) give formulas for estimating the standard errors for most of the common comparison distributions. For the normal Q-Q plot, the standard error is estimated by
s Hat ( z i ) = s hat over f ( z i ) sqrt p i ( 1 - p i ) over n (2)where f ( z i ) is the normal density function, and s hat estimates the slope of the comparison line, the standard deviation of x . Chambers et al. (1983) stress using a robust estimate of s hat , such as the measure based on the hinge-spread, so that the standard errors are not distorted by outliers or straggling tails.
To produce your own plot with SAS/GRAPH software, you can sort the data and calculate the normal quantiles in a DATA step with the PROBIT function as shown below. This is the most general method, since you can easily adjust the calculation of p i , use a different F -1 function for Q-Q plots based on other distributions, and add calculations for standard errors, as described above.
**- Normal Q-Q plot using PROBIT --; proc sort data=nations; by imr; proc univariate noprint data=nations; var imr; output out=stats n=nobs median=median qrange=hspr mean=mean std=std; data quantile; set nations; if _n_=1 then set stats; if imr ^= . ; *-- remove missing data!; i + 1; p = (i - .5) / nobs; z = probit(p); normal = mean + z * std; robust = median + z * hspr/1.349;
proc gplot data=quantile; plot imr * z = 1 normal* z = 2 robust* z = 3 / overlay frame; symbol1 i=none h=1.1 v=- color=black; symbol2 i=join l=20 v=none color=blue; symbol3 i=join l=1 v=none color=red;
%macro nqplot ( data=_LAST_, /* input data set */ var=x, /* variable to be plotted */ out=nqplot, /* output data set */ mu=MEDIAN, /* est of mean of normal distribution: */ /* MEAN, MEDIAN or literal value */ s=HSPR, /* est of std deviation of normal: */ /* STD, HSPR, or literal value */ stderr=YES, /* plot std errors around curves? */ detrend=YES, /* plot detrended version? */ lh=1.5, /* height for axis labels */ anno=, /* name of input annotate data set */ name=NQPLOT, /* name of graphic catalog entries */ gout=); /* name of graphic catalog */Only the name of the data set and the variable to be plotted need be supplied. The normal Q-Q plot of the IMR data would be produced with this statement:
%nqplot(data=nations,var=imr);