The next slide shows a scatterplot matrix of four measurements on three species of Iris flowers, the length and width of the sepal, and length and width of the petal. Iris species is reflected in the color and shape of the plotting symbol (which unfortunately do not remain visually distinct when photoreduced).
%macro SCATMAT( data =_LAST_, /* data set to be plotted */ var =_NUMERIC_, /* variables to be plotted - can be */ /* a list or X1-X4 or VARA--VARB */ group=, /* grouping variable (plot symbol) */ symbols=%str(- + : $ = X _ Y), colors=BLACK RED GREEN BLUE BROWN YELLOW ORANGE PURPLE, gcat=GSEG); /* graphic catalog for plot matrix */The scatterplot matrix for the Iris data is produced by these statements:
%include scatmat; %scatmat(data=iris, var=SEPALLEN SEPALWID PETALLEN PETALWID, group=spec_no);
Star plots differ from glyph plots in that all variables are used to construct the plotted star figure; there is no separation into foreground and background variables. Instead, the star-shaped figures are usually arranged in a rectangular array on the page.
The main use of star plots would be in detecting which groups of observations seem to cluster together. They would appear as stars with similar shape.
If the variables are measured on different scales, there two ways the data are typically displayed, neither very satisfactory:
If the data values are denoted y ij , we first allow each variable to be transformed to standard scores,
y ij star = ( y ij - y bar j ) / s jwhere y bar j is the mean of variable j , and s j is the standard deviation. This allows us to equate variables with very different ranges, such as the crime variables.
Say the variables are positioned on the horizontal scale at positions x j , the values we wish to find. Then if each case has a linear profile, that means that
y ij star = a i + b i x jwith slope b i and intercept a i . Hartigan (1975) shows that the solution can be expressed in terms of the first two eigenvectors of the correlation matrix C .
The linear profiles display of the crimes data is shown in the next slide. Note that property crimes are positioned on the left, crimes of violence on the right. Cities with a positive slope (Dallas, Chicago) have a greater prevalence of violent crime than property crime; for cities with a negative slope (Los Angeles, Denver, Honolulu) the reverse is true.
Overall crime rate for each city is represented by the intercept of each line: cities at the top (Los Angeles, New York, New Orleans) having the most crimes and cities at the bottom having the least.
The biplot is based on the idea that any data matrix, Y ( n × p ) , can be represented approximately as the product of a two-column matrix, A ( n × 2 ) , and a two-row matrix, BT(2 × p) ,
Y = A BT (13)so that any score y ij of Y is approximately the inner product of row i of A , and of column j of BT,
y ij = a iTb j = a i1 b 1j + a i2 b 2j (14)Thus, the rows of A represent the observations in a two-dimensional space, and the columns of BT represent the variables in the same space.
The approximation used in the biplot is like that in principal components analysis : the two biplot dimensions account for the greatest possible variance of the original data matrix. In the biplot display,
OBS TRIBE SCHOOL POVERTY ECONOMIC 1 SHOSHONE 10.3 29.0 9.08 2 APACHE 8.9 46.8 10.02 3 SIOUX 10.2 46.3 10.75 4 NAVAJOS 5.4 60.2 9.26 5 HOPIS 11.3 44.7 11.25For the biplot, the data is usually first expressed in terms of deviations from the mean of each variable.
Deviations from Means Y SCHOOL POVERTY ECONOMIC SHOSHONE 1.08 -16.4 -0.992 APACHE -0.32 1.4 -0.052 SIOUX 0.98 0.9 0.678 NAVAJOS -3.82 14.8 -0.812 HOPIS 2.08 -0.7 1.178The row and column dimensions are found from the singular value decomposition of this matrix. (In this case, the fit is perfect, since there are only two linearly independent variables.)
Biplot coordinates Dim1 Dim2 OBS SHOSHONE -3.4579 -0.9040 OBS APACHE 0.3024 -0.0619 OBS SIOUX 0.1567 0.6842 OBS NAVAJOS 3.2110 -0.9216 OBS HOPIS -0.2122 1.2034 VAR SCHOOL -0.7305 1.5996 VAR POVERTY 4.6791 0.2435 VAR ECONOMIC 0.0296 0.9841Algebraically, this expresses each score as the product of the observation row and the variable row:
Shoshone-school = -3.46*-0.73 + (-.904)*1.6 = 1.08Geometrically, this corresponds to dropping a perpendicular line from the Shoshone point to the School vector in the biplot.
D 2 + i | m | e | n | school s 1.5 + i | o | n | HOPIS | 2 1 + economic | | | SIOUX | 0.5 + | | | poverty | 0 + + | APACHE | | | -0.5 + | | | |SHOSHONE NAVAJOS -1 + | | | | -1.5 + +---------+---------+---------+---------+---------+ -4 -2 0 2 4 6 Dimension 1
As a substantive example of the biplot technique, we consider data on the rates of various crimes (per 100,000 population) in the U.S. states.
The horizontal dimension in the plot is interpreted as overall crime rate. The states at the right (Nevada, California, NY) are high in crime, those at the left (North Dakota, South Dakota) are low. The vertical dimension is a contrast between property crime vs. violent crime. The angles between the variable vectors represent the correlations of the variables.
%macro BIPLOT( data=_LAST_, /* Data set for biplot */ var =_NUMERIC_, /* Variables for biplot */ id =ID, /* Observation ID variable */ dim =2, /* Number of dimensions */ factype=SYM, /* factor type: GH|SYM|JK */ scale=1, /* Scale factor for vars */ out =BIPLOT, /* Biplot coordinates */ anno=BIANNO, /* Annotate labels */ std=MEAN, /* Standardize: NO|MEAN|STD*/ pplot=YES); /* Produce printer plot? */The number of biplot dimensions is specified by the DIM= parameter and the type of biplot factorization by the FACTYPE= value. The BIPLOT macro constructs two output data sets, identified by the parameters OUT= and ANNO=. The macro will produce a printed plot (if PPLOT=YES), but leaves it to the user to construct a PROC GPLOT plot, since the scaling of the axes should be specified to achieve an appropriate geometry in the plot.
d i 2 = ( x i - x bar )TS -1 ( x i - x bar ) (15)where S is the p x p sample variance covariance matrix. This is the multivariate analog of the square of the standard score for a single variable,
z i 2 = ( x i - x bar ) 2 over s 2
With p variables, d i 2 is distributed approximately as chi² with p degrees of freedom for large samples from the multivariate normal distribution. Therefore, a Q-Q plot of the ordered distance values, d (i) 2 , against the corresponding quantiles of the chi² ( p ) distribution should yield a straight line through the origin for multivariate normal data.
Unfortunately, like all classical (e.g., least squares) techniques, the chi² plot for multivariate normality is not resistant to the effects of outliers. A few discrepant observations not only affect the mean vector, but also inflates the variance covariance matrix. Thus, the effect of the few wild observations is spread through all the d 2 values. Moreover, this tends to decrease the range of the d 2 values, making it harder to detect extreme ones.
One reasonably general solution is to use a multivariate trimming procedure (Gnanadesikan & Kettenring, 1972; Gnanadesikan, 1977), to calculate squared distances which are not affected by potential outliers.
This is an iterative process where, on each iteration, some proportion of the observations with the largest d 2 values are temporarily set aside, and the trimmed mean, x bar sub (-) and trimmed variance covariance matrix, S (-) are computed from the remaining observations. Then new d sup 2 values are computed from
d i 2 = ( x i - x bar (-) )TS (-) -1 ( x i - x bar (-) ) (16)The effect of trimming is that observations with large distances do not contribute to the calculations for the remaining observations.
Observations trimmed in calculating Mahalanobis distance OBS MODEL PASS CASE DSQ PROB 1 AMC PACER 1 2 22.7827 0.0189645 2 CAD. SEVILLE 1 14 23.8780 0.0132575 3 CHEV. CHEVETTE 1 15 23.5344 0.0148453 4 VW RABBIT DIESEL 1 66 25.3503 0.0080987 5 VW DASHER 1 68 36.3782 0.0001464 6 AMC PACER 2 2 34.4366 0.0003068 7 CAD. SEVILLE 2 14 42.1712 0.0000151 8 CHEV. CHEVETTE 2 15 36.7623 0.0001263 9 PLYM. CHAMP 2 52 20.9623 0.0337638 10 VW RABBIT DIESEL 2 66 44.2961 0.0000064 11 VW DASHER 2 68 78.5944 0.0000000
%macro OUTLIER( data=_LAST_, /* Data set to analyze */ var=_NUMERIC_, /* input variables */ id=, /* ID variable for observations */ out=CHIPLOT, /* Output dataset for plotting */ pvalue=.1, /* Prob < pvalue -> weight=0 */ passes=2, /* Number of passes */ print=YES); /* Print OUT= data set? */By default, two passes (PASSES= ) are made through the iterative procedure. A printer plot of DSQ * EXPECTED is produced automatically, and you can use PROC GPLOT afterward with the OUT= data set to give a high-resolution plot with labelled outliers, as shown in the example below.
The outlier procedure is run on the auto data with these statements:
%include OUTLIER; * get the macro; %include AUTO; * get the data; data AUTO; set AUTO; repair = mean(of REP77 REP78); if repair=. then delete; %OUTLIER(data=AUTO, var=PRICE MPG REPAIR HROOM RSEAT TRUNK WEIGHT LENGTH TURN DISPLA GRATIO, pvalue=.05, id=MODEL, out=CHIPLOT, print=NO);The plot is produced from the OUT= data set as shown below:
data labels; set chiplot; if prob < .05; * only label outliers; xsys='2'; ysys='2'; y = dsq; function = 'LABEL'; text = model; size = 1.4;
proc gplot data=chiplot ; plot dsq * expected = 1 expected * expected = 2 / overlay anno=labels vaxis=axis1 haxis=axis2; symbol1 f=special v=K h=1.5 i=none c=black; symbol2 v=none i=join c=red; label dsq ='Squared Distance' expected='Chi-square quantile'; axis1 label=(a=90 r=0 h=1.5 f=duplex) value=(h=1.3); axis2 order=(0 to 30 by 10) label=(h=1.5 f=duplex) value=(h=1.3); title h=1.5 'Outlier plot for Auto data';
For p variables, xT= ( x 1 , ... , x p ) , sets of weights, c = ( c 1 , ... , c sub p ) , are found so as to maximize the F statistic for the derived canonical variable, CAN1 = S c i x sub i . That is, the canonical weights indicate the linear combination of the variables which best discriminates among groups, in the sense of having the greatest univariate F . When there are more than two groups, it is possible to find additional linear combinations, CAN2, CAN3, ... CANs (s is the smaller of p and g - 1 ) which
Canonical discriminant analysis is performed with PROC CANDISC. Since there are 3 groups s = g - 1 = 2 dimensions are sufficient to display all possible discrimination among groups, and the printed output indicates that both dimensions are highly significant.
Canonical Discriminant Analysis Adjusted Approx Squared Canonical Canonical Standard Canonical Correlation Correlation Error Correlation 1 0.909389 0.906426 0.014418 0.826988 2 0.625998 0.618252 0.050677 0.391874 Test of H0: The canonical correlations in the current row and all that follow are zero Likelihood Ratio Approx F Num DF Den DF Pr > F 1 0.10521315 57.4891 10 276 0.0001 2 0.60812612 22.3928 4 139 0.0001
The canonical discriminant plot below shows that the three groups are ordered on CAN1 from Normal to Overt Diabetic with Chemical Diabetics in the middle. The two glucose measures and the measure SSPG of insulin resistance are most influential in separating the groups along the first canonical dimension. The Chemical Diabetics are apparently higher than the other groups on these three variables. Among the remaining variables, INSTEST is most influential, and contributes mainly to the second canonical dimension. The wide separation of the 99% confidence circles portrays the strength of group differences on the canonical dimensions and the variable vectors help to interpret the meaning of these dimensions.
title 'Canonical Discriminant analysis plot: Diabetes data'; %include canplot; *-- unnecessary in PsychLab; %include diabetes; *-- %include data(diabetes) in PsychLab; %canplot( data=diabetes, class=group, var=relwt glufast glutest instest sspg, scale=3.5);The SCALE parameter sets the scaling of the variable vectors relative to the points in the plot.