y = b 0 + b 1 x 1 + b 2 x 2 + b 3 x 3 + ... + b p x p + e (6)In matrix form for a sample of n the model is,
y = X b + e (7)
In addition to the assumption that model (6) correctly specifies the relation between y and the x i , the usual statistical inference requires the following assumptions on the residual errors:
The usual least squares estimators can be expressed in matrix notation as:
b Hat = ( XTX ) -1 XTy
y hat = X b Hat = X ( XTX ) -1 XTy = H ywhere H is the ( n x n ) matrix called the hat matrix (it puts the "hat" on y ).
e = y - y hat = ( I - H ) y
Plots of the residuals can help to answer the following questions:
A better way to plot residuals to assess changes in spread is to plot the absolute value of the residual, | e i | against y hat i or x . This has the effect of folding the plot along a horizontal line at zero. Adding a smoothed (e.g., lowess) curve to the configuration of points helps considerably.
One idea is to make a diagnostic plot of log 10 | e sub i* | against log 10 ( x i ) . If the points in this plot are approximately linear with slope b , it can be shown that a power transformation, y --> y (1-b) , where y 0 is interpreted to mean log (y) , will make the residual variance more nearly equal.
Influence = Leverage × y-Discrepancyso, an observation must have both high influence and a large residual to be influential.
h i = (1 / n) + ( x i - x bar ) 2 over S ( x k - x bar ) 2In general, it can be shown that
0 <= h i <= 1 and S i=1n (h i ) = p + 1so the average value of h i is ( p + 1) / n . In practice, observations are typically considered high leverage if h i > 2 ( p + 1 ) / n .
RSTUDENT identical e i star = e i over s (-i) sqrt 1 - h i (9)
Two influence measures based on these ideas are commonly used. Both are functions of the product of leverage and y -discrepancy.
DFFITS i = y hat i - y hat (-i) over s (-i) sqrt h i = e i over s (-i) x sqrt h i over 1 - h i (10)Belsley et al. (1980) suggest using DFFITS > 2 sqrt < ( p+1)/n > as a criterion for declaring an influential outlier.
COOKD i = d iT( XTX ) d i over ( p + 1 ) s 2 = e i 2 over (p+1) s 2 x h i over 1 - h i 2 (11)Cook's D is approximately DFFITS 2 / p , so a cutoff of 4 / n might be used to detect influential outliers.
The results of a regression analysis predicting prestige from income and education are shown below. From these results, it would appear that
Duncan Occupational Prestige Data DEP VARIABLE: PRESTIGE Prestige ROOT MSE 13.36903 R-SQUARE 0.8282 DEP MEAN 47.68889 ADJ R-SQ 0.8200 PARAMETER ESTIMATES PARAMETER STANDARD T FOR H0: VARIABLE DF ESTIMATE ERROR PARAMETER=0 PROB > |T| INTERCEP 1 -6.06466292 4.27194117 -1.420 0.1631 INCOME 1 0.59873282 0.11966735 5.003 0.0001 EDUC 1 0.54583391 0.09825264 5.555 0.0001However, plots of the influence measures for each case indicate that several observations a great deal of influence on this conclusion. One way to understand the separate effects of discrepancy and leverage is to make a plot of RSTUDENT * HATVALUE.
The plot below identifies individual observations which are either high leverage ( h i > 2 (p+1)/n = .133 ) or have large studentized residuals (|RSTUDENT| > 2.02, the critical value of t (41) at a = .05 ).
-+----+----+----+----+----+----+----+----+----+----+----+----+----+- S 4.00+ + T | | U | | D | | E | * Minister | N 3.00+ + T | | I | | Z | | E | | D 2.00+ * Contractor + | * 28 | R | * 18 | E | | S | | I 1.00+ * 25 * 32 + D | 26 * * 5* 7 RR Engineer * | U | * 29 * 20 | A | 361**** 10* 13 | L | * 37 ** 40 | 0.00+-*-44---343**-3-*-15----------------------------------------------+ | 11 * * 35 | | 334*** * * 8 | | 42 * 31 | | * 41 | -1.00+ * 38 * 21 + | * 33 | | * 22 | | | | * 24 * RR Conductor | -2.00+ * 23 + | | | * Reporter | | | | | -3.00+ + -+----+----+----+----+----+----+----+----+----+----+----+----+----+- .02 .04 .06 .08 .10 .12 .14 .16 .18 .20 .22 .24 .26 .28 H LEVERAGE
%include inflplot; %inflplot(data=duncan, y=Prestige, x=Income Educ, id=Job, bsize=14, color=red);
The partial regression plot for variable x k is defined as follows. Let X [ k ] be the matrix of predictors omitting variable k ,
X [ k ] = [ 1 , x 1 , x 2 , ... , x k-1 , x k+1 , ... , x p ]and let y k* be the part of y which cannot be predicted from X [ k ] , that is, the residuals from regression of y on X [ k ] .
y k* = y - y hat X [ k ]Similarly, let x k* be the residuals from regression of x k on X [ k ] . Then, the partial regression residual plot for x k is the scatter plot of y k* vs. x k*.
From these plots, it was decided to eliminate these three cases and repeat the regression analysis. The regression results are shown below.
Duncan Occupational Prestige Data DEP VARIABLE: PRESTIGE Prestige ROOT MSE 11.49227 R-SQUARE 0.8762 DEP MEAN 46.52381 ADJ R-SQ 0.8699 PARAMETER ESTIMATES PARAMETER STANDARD T FOR H0: VARIABLE DF ESTIMATE ERROR PARAMETER=0 PROB > |T| INTERCEP 1 -6.31736312 3.67962246 -1.717 0.0939 INCOME 1 0.93066285 0.15374847 6.053 0.0001 EDUC 1 0.28464102 0.12136228 2.345 0.0242The prediction of occupational prestige has improved somewhat ( R 2 = .88 ), and the relative regression weights for income and education have changed considerably.
The PARTIAL option of PROC REG produces partial regression plots, one for each predictor given in the MODEL statement (plus one for the intercept). In these plots, points are identified by the value of the (character or numeric) variable given in the ID statement.
For the Duncan occupational prestige data, the partial regression plots were produced by the statements,
proc reg data=duncan; model prestige=income educ / influence partial r; output out=diagnose p=yhat r=residual rstudent=rstudent h=hatvalue cookd=cookd dffits=dffits; id case;The plot for INCOME in Figure 5 shows that three observations (6, 16 and 27) have great influence on the regression coefficient for income. Without these three observations, the partial effect (slope) of income on prestige would be a good deal greater.
PARTIAL REGRESSION RESIDUAL PLOTS PRESTIGE P 80 + r | e | s | t | i 60 + g | e | | | 40 + 27 | 17 | 18 | 28 | 20 + | 2*20 | 6 25 5 10 | 32 7 1* 30 16 | ** 19 0 + 44 35 1* | ** 33 | 1534 | 43**42 | 14 31 22 -20 + 45 4138 | 23 | 21 24 9 | | -40 + | | | | -60 + | | | | -80 + +---------+---------+---------+---------+---------+---------+ -60 -40 -20 0 20 40 60 Income INCOMEFigure 5: Partial regression plot for income
%macro PARTIAL( data = _LAST_, /* name of input data set */ yvar =, /* name of dependent variable */ xvar =, /* list of independent variables */ id =, /* ID variable */ label=INFL, /* label ALL, NONE, or INFLuential obs */ out =, /* output data set: partial residuals */ htext=1.5, /* height of text labels in the plots */ plots=&xvar, /* which partial plots to produce */ gout=gseg, /* name of graphic catalog */ name=PARTIAL); /* name of graphic catalog entries */An example is shown in Figure 6. This plot, with all points labelled with CASE number, is produced by the statements,
%include partial ; %partial(data=duncan, yvar=prestige, xvar=educ income, id=case, label=ALL);
y ( l ) = left { lpile y l - 1 over l , above log y , lpile l != 0 above l = 0 right "" , (12)with the general linear model, y ( l ) = X b + e , where l , the power of y, is an additional parameter to be estimated.
Box and Cox proposed a maximum likelihood procedure, to estimate the power ( l hat ), the regression coefficients ( b Hat ) and error variance simultaneously by maximizing the likelihood,
L ( l , b , s e 2 ) = Prob ( model | data )This gives a statistical test of whether a transformation is necessary as the test of H 0 : l = 1 , which can be tested by a likelihood ratio test. The test statistic is the likelihood ratio G 2 ,
G 2 = - 2 log left [ L ( b , s e 2 | l = 1 ) over L ( b , s e 2 | l = l hat ) right ] ~ chi (1) 2
It turns out that log L ( l ) ~ sqrt MSE , the
mean square error in regression using the y l as
the criterion, so maximizing log L is equivalent to finding the
value of l which minimizes the MSE.
Hence, a simple (but computationally intensive) method is to fit the model y hat <( l )> = X b , for a range of l values (say, -2 to +2) and find the value l hat which minimizes the MSE. The maximum likelihood method also provides a 95% confidence interval for l as those values for which
log L ( l ) > log L ( l hat ) - ( 3.84 ) / 2where 3.84 is the .95 percentile of the chi square distribution with 1 df. As an added benefit, one can plot the values of the partial t or F statistic for each regressor vs. l to see how the contribution of each predictor to the model varies with the power.
For the auto data, the previous plots were constructed with the following statements:
%include data(auto); %include boxcox; *-- unnecessary in PsychLab; %boxcox(data=auto, resp=MPG, model=Weight Displa Gratio, id=model, out=coxout, gplot=RMSE EFFECT INFL, pplot=RMSE EFFECT INFL, lopower=-2.5, hipower=2.5, conf=.99);The OUT= parameter specifies the name of the dataset to contain the transformed response.