One idea is to add an elliptical confidence region ( data ellipse ) around the mean to highlight the joint relationship. With data for several groups this helps show the extent to which the relationship is the same for all groups. (Better than individual regression lines.)
( x - x bar )T S size +2 -1 ( x - x bar ) <= chi 2 2 ( 1 - a ) ' ' . (3)where x bar = ( x bar , y bar ) are the sample means, S ( 2 × 2 ) is the covariance matrix of (x , y ) , and chi 2 2 ( 1 - a ) is the (1 - a ) percentage point of the chi² distribution with two degrees of freedom.
%macro CONTOUR( data=_LAST_, /* input data set */ x=, /* X variable */ y=, /* Y variable */ group=, /* Group variable */ pvalue= .5, /* probability value */ out=CONTOUR, /* output data set */ colors=RED GREEN BLUE BLACK);For example, a set of 50% data ellipses for Weight vs. Price of automobiles grouped by region of origin is produced by these statements:
%include contour; %contour(data=auto, x=price, y=weight, group=origin);
One set of ideas involves dividing the data into roughly equal-sized strips according to their ordered x-values. For each strip, we can plot:
The plot below shows the lottery numbers assigned to each birthday (Day of Year) in the 1970 US Draft Lottery carried out by the Selective Service to determine the order in which elligible men would be drafted into the US Army. It was supposed to be completely random, and looks that way from the scatter plot
SCATTER DRAFT 366 - ** * * 2 * ** * L - *** * * **2 ** 2 2* * * O - *** * * 2 * ** * * *2 * * * T -*** * * * 2* * ** * * * 2* T - * 2*2** * * ** * ** ** ** E 275 o * 2 **** * * 2 ** * * ** * R -* *2* * *2 * * * * * * * ** Y - ** * * * * * *******2 * * -*** **** * **2 ** * * * * * N - 2 * * * ** * * * * 2* * 2 * U 184 * * ** * ** * * 2 *** ** ** M -* * 2 * * * ** * * * 2** 2 B - * * 2 * ** * * **2 ** * 3 E - * ** * 2 * * * * 3 ** 2 * R - * * * ** *** *2* * * * * * * 92 + * **** * * * ** * * * * *2* - * * 2 * * * ** * ** * *2* ** - *2 * * * ** * 2 2 * * * * * - * * 2 *** * * * * * 2 * *2 - * * * ** * **** * * * * * ** * 1 - * * 2* ** * * * -------------------+------------------+------------------- 1 123 244 366 DAY OF YEARThe next slide shows the trace line plot of the medians (M), upper (H), and lower (L) hinges of the lottery numbers against day of the year. It is clear, even with fairly coarse resolution, that the lottery numbers decrease towards the end of the year, indicating a defect in the method used to select these "random numbers".
342 + H H L - H HH HHH H O - HH HHH H HHH HH H H T -H H H HHHHHHHHHH HHH HHHHH HH T -H HHHHH HHHH HH H HHHHH HH H H E 268 + HHHHH H H HH H HH HHH H R - HHHH M HHH H HHHHHH HHHH H Y - HH MMMMMMMM MM HH H HHHH HH - MMM MMMMMMMM MM MMM H H H H H N -M MMM MMMM MMMMMM H M HHHHHHHHH U 194 +MM M L MM MM MMM MMMMM HH H M - MMMM LL MMMMM MM MMMMMM MM H H B - M L L M MMMMM MMMM M MM E - L LLLL MM MM L MMM R -L LLLL LLLLL L LLLLL MMMMMMM 121 + L LLLL L L L LL LL LL LL MM M - L L LLLLL LLLL L LL LL LLL LL - LLLL LLLLL LLLLL LLL LL LLL LLLLLL LL - LL LL LLL LLLLLL LL L LL LL LLLLL LLL - L L LLL LLL L LLLLL L 47 + L LL LL L +-------------------+-------------------+-------------------+ 21 129 237 346 DAY OF YEAR
Another generally useful technique is robust, locally weighted regression smoothing , often called lowess . The procedure finds a smoothed fitted value, y hat i , for each x i by fitting a weighted regression to the points in the neighborhood of x i . The points closest to x i receive the greatest weight. This is the "locally weighted regression" part of the procedure, and the weights are called neighborhood weights .
The robust part works as follows: Once the fitted values, y hat i , have been found, the residuals, r i = y sub i - y hat i are used to determine a new set of weights ( robustness weights ) so that points which have large residuals are down-weighted, and the locally weighted regression is repeated.
In practice, lowess depends on the choice of a smoothing parameter, f , 0 < f <= 1 , the fraction of the data points to be considered in the calculation of y hat i . Choosing f = .5 means that only the r = [ .5 n ] points closest to x i have non-zero weights. Increasing the value of f makes the fitted curve smoother, decreasing f lets the curve follow the data more closely.
h i = max | x i - x ij |
w ij = W left "[" x ij - x i over h i right "]"where W ( bullet ) is the tricube weight function,
W ( z ) = left "{" lpile 0 above ( 1 - | z | 3 ) 3 lpile for | z | >= 1 above for | z | < 1 right " "
y ij = a i + b i x ij + e ij
delta i = B left "[" e i over 6 Mdn ( | e | ) right "]"where B ( bullet ) is the bisquare weight function,
B ( z ) = left "{" lpile 0 above ( 1 - z 2 ) 2 lpile for | z | >= 1 above for | z | < 1 right " "
The figure below shows the lowess smoothed fit for the draft lottery data. The steady decline in Lottery Number over the last two-thirds of the year is now quite clear. (The plot is over-dramatic, and perhaps misleading, since the vertical scale covers a smaller range than the plots of the raw data and trace lines.)
SMOOTH ~ DRAFT LOWESS }366 SCATTER SMOOTH 213 - 26666663 L o *666666674 366 O -46665 66 T - *65 T - *66 E 189 66* R - 563 Y - 3664 - 366663 N - 3666* U 165 + 56* M - 55 B - 26* E - 54 R - 26* 141 + 54 - 26 - 62 - 44 - 26 117 - 4 -------------------+------------------+------------------- 1 123 244 366 DAY OF YEAR
The LOWESS macro reads the input data set into PROC IML, calculates the smoothed y hat values, and creates an output data set containing the original and smoothed data. The macro takes the following parameters:
/*---------------------------------------------------------------* * LOWESS SAS Locally weighted robust scatterplot smoothing * *---------------------------------------------------------------* %macro LOWESS( data=_LAST_, /* name of input data set */ out=SMOOTH, /* name of output data set */ x = X, /* name of independent variable */ y = Y, /* name of Y variable to be smoothed */ id=, /* optional row ID variable */ f = .50, /* lowess window width */ p = 1, /* 1=linear fit, 2=quadratic */ iter=2, /* total number of iterations */ plot=NO, /* draw the plot? */ gplot=NO, /* draw the plot? */ pplot=NO, /* draw a printer plot? */ symbol=circle, htext=1.5, hsym=1.5, name=LOWESS); /* name for graphic catalog entry */e.g.,
%include lowess; %lowess( data=auto, x=weight, y=mpg, f=.3, gplot=YES );
y = f ( x ) + residual = fit + residual
All other things equal, we prefer a "simple" f ( x ) like a linear function, to a more complex one. If a scatterplot of y against x appears substantially non-linear, there are two choices:
y'= a + b x'+ residual
Once again, the ladder of powers provides a scheme for understanding the effect of various power transformations. Visual examination of the scatterplot of the raw data, possiblly enhanced with a lowess smoothed curve, can suggest the direction to move on the ladder for transforming x , or y , or both.
As a preliminary example, consider the simple case where
y = size -2 1 over 5 x 2for x = 0, 1, 2, 3, 4, 5. Here, the relation can be straightened by transforming x :
xT= x 2 --> y = size -2 1 over 5 xTor y :
yT= sqrt y --> y = sqrt size -2 1 over 5 xTThe data and the transformed values are:
X Y X} Y.9 0 0 0 0 1 0.2 1 0.4472 2 0.8 4 0.8944 3 1.8 9 1.342 4 3.2 16 1.789 5 5 25 2.236These values are plotted below:
SCAT X, Y SCAT XP, Y 5- f 5- f - - - - - - - f - f - - - f - f - - - f - f - - 0f----f-------------------- 0ff------------------------ 0 X 5 0 X' 25 SCAT X , YP 2.5- - f - - f - - f - f - - f - 0 f------------------------- 0 X 5
The arrow points in the direction to move along the ladder of powers for x or for y . That is, if the arrow points toward smaller values of x (or y ), move down the ladder of powers, toward sqrt x or log ( x ) (or sqrt y , log ( y ) ). The greater the curvature, the further from 1 (raw data) should be the power.
x --> x p
y --> y q
Since the power transformations are order-preserving, we can try different power transformations on a few well chosen points, to see what effect they would have on all the data.
Tukey (1977) suggests finding three summary points which are the median ( x , y ) values in three equal-sized strips based on the ordered x -values. Use subscripts L , M , H to denote the ( x , y ) values for the Low, Middle, and High strip.
If the relation is linear, these three points will fall on a line, or equivalently, the slope of the line from (x , y ) sub L to (x , y ) M will be the same as the slope of the line from (x , y ) M to (x , y ) sub H .
Y- - - Y- - - H - - - - - - - H - - - - - - M - - - M - - - L - - - L - - - - - - - - 0+------------------------- 0+------------------------- 0 X 0 XOn the other hand, if the relation is not linear, the lines connecting the pairs of points will have different slopes.
This division is modified when equal x -values straddle the dividing line: they are all kept in the same third. Also, neither end portion should cover more than half the range of the x -values, which can happen if the distribution of x is highly skewed.
This section uses data on literacy rates and gross national product for 22 nations to show the details of the calculations. (The plot indicates the relation is highly nonlinear. The arrow rule indicates we should transform x to lower powers or y to higher powers.)
NEPAL 45 5 BURMA 57 47.5 UGANDA 64 27.5 S. VIETNAM 76 17.5 SCAT GNP THAILAND 96 68 100- f f f f HAITI 105 10.5 - f INDONESIA 131 17.5 - f S. KOREA 144 77 L - f GHANA 172 22.5 I - f PERU 179 47.5 T - f f EL SALVADOR 219 39.4 E - f BR.GUIANA 235 74 R - f HONG KONG 272 57.5 A -f f f PANAMA 329 65.7 C - f LEBANON 362 47.5 Y - SINGAPORE 400 50 -f ARGENTINA 490 86.4 -fff ICELAND 572 98.5 - f CZECHOSLOVAK 680 97.5 -f FRANCE 943 96.4 0------------------------------------- NEW ZEALAND 1310 98.5 0 GNP 2000 CANADA 1947 97.5Since the distribution of GNP is so skewed, the range rule suggests allocating only two points to the upper third. The summary points are found to be:
2 RSTAB GNP SUMPTS: SPLIT IS 7 13 2 SUMMARY POINTS L 76 17.5 M 329 65.7 H 1629 98
r = m HM over m ML = ( y H - y M ) / ( x H - x M ) over ( y M - y L ) / ( x M - x L ) (4)As shown earlier, a linear relation implies r approx 1 (or log r approx 0 ).
For the GNP data, this ratio is quite far from 1:
1 1 SLOPES S SUMMARY HALF SLOPE POINTS SLOPES RATIO [X*1] [Y*1] 1629 98 0.02485 329 65.7 0.1304 0.1905 76 17.5
Since we are using medians, and the ladder of powers is order-preserving (as long as all values are positive), we can transform the summary points to various powers, and the ratio of slopes for the transformed summary points will tell what effect that transformation will have on the whole data set.
If we transform x --> x p , and y --> y q , then the ratio of slopes for that pair of transformations is
r < (p,q) = m HM < (p,q) over m ML < (p,q) = ( y H q - y M q ) / ( x H p - x M p ) over ( y M q - y L q ) / ( x M p - x L p ) (5)Since we know we must go down the scale of powers for x , we try log x , then - 1 / sqrt x . The first is not far enough, but the second is too far. p = - 1/3 is an in-between power, and seems to work quite well.
0 1 SLOPES S SUMMARY HALF SLOPE POINTS SLOPES RATIO [X*0] [Y*1] 3.212 98 46.49 2.517 65.7 0.6138 75.74 1.881 17.5 H.5 1 SLOPES S SUMMARY HALF SLOPE POINTS SLOPES RATIO [X*H0.5] [Y*1] H0.02478 98 1064 H0.05513 65.7 1.315 809 H0.1147 17.5 H.33 1 SLOPES S SUMMARY HALF SLOPE POINTS SLOPES RATIO [X*H0.33] [Y*1] H0.08711 98 533.3 H0.1477 65.7 1.016 524.9 H0.2395 17.5
RATIOS OF SLOPES : YE [H2 ] [H1 ] [ H.5] [LOG ] [ .5] [RAW ] [ 2 ] X$ [H2 ] .778 2.213 3.575 5.590 8.459 12.394 24.385 [H1 ] .175 .499 .806 1.261 1.908 2.796 5.500 [ H.5] .083 .235 .379 .593 .898 1.315 2.588 [LOG ] .039 .110 .177 .277 .419 .614 1.208 [ .5] .018 .051 .082 .128 .194 .284 .559 [RAW ] .008 .023 .038 .059 .089 .130 .257 [ 2 ] .002 .005 .008 .012 .018 .027 .053
The figure below shows the data and scatterplot when GNP is transformed to the reciprocal cube root.
NEG. CUBE ROOT OF X IS CLOSEST TO BEING STRAIGHT GNP3 ~ GNP GNP3[;1]~GNP[;1] POWER H.333 GNP3 TRANSFORMED X H0.2815 5 H0.2602 47.5 H0.2503 27.5 SCAT GNP3 H0.2364 17.5 100 ff f f - H0.2187 68 f - H0.2123 10.5 f - H0.1972 17.5 L f - H0.1911 77 I f - H0.1801 22.5 T f f - H0.1777 47.5 E f - H0.1662 39.4 R f - H0.1623 74 A f f f - H0.1546 57.5 C f - H0.1451 65.7 Y - H0.1406 47.5 f - H0.136 50 f f f - H0.1271 86.4 f - H0.1207 98.5 f - H0.114 97.5 0------------------------------------ H0.1022 96.4 H0.3 GNP*H1 3 H0.05 H0.09161 98.5 H0.08029 97.5The plot shows no signs of nonlinearity. However, some find "reciprocal cube root" of GNP hard to interpret, and might prefer a slightly curved relationship with log GNP .
The plot below shows data generated from a mixture of three bivariate normal distributions. It is hard to see that there are three regions of high density.
PLOT OF Y*X LEGEND: A = 1 OBS, B = 2 OBS, ETC. Y | AA A | A BA 50 + A A A A | B A ABACD ABB CAB | A AACBCBAA BA | AA B BAAACACA B | A AA B ABAABA A A A 40 + AA A B D | A AA B | A CA AA | A B BBC BBA AB A | A B BAB E AAAA AA A 30 + A A ADAAC A A A A | A A AB A BA BAAAA | A A AB A A A A A | A A AAA AA | A A A AAA 20 + A AABAAA A A A | AA A BACBB A | A ABA ABB DAA BAA B A | AAA AAB CACAAB A | B AA AF AAB B A 10 + AA AB B | A B | B | | 0 + --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-- -3 0 3 6 9 12 15 18 21 24 27 30 33 36 39 X
To jitter, add a uniformly distributed random quantity, scaled so that it breaks up the overlap, but does not corrupt the data. Let u i = u [ -1, 1 ] be uniformly distributed from -1 to 1. Then, to jitter x , calculate
xTi = x i + s u iwhere s is the scale factor, which might be some small fraction (e.g., .02) of the range of the data, or 1/2 the rounding interval if the data have been rounded. If the y variable is also discrete, the same process can be applied to jitter y --> yT.
Then, in the scatter plot of y (or yT) vs. xT it should be easier to see the density of points in different regions.