Modern Model Selection Methods

Robert Stine, University of Pennsylvania

This workshop includes five lectures on statistical model selection. The lecture topics are listed below, with some elaboration following each. The problem of variable selection in regression is used throughout to unify the methods, though the ideas generalize to many other types of models.

(1) An overview of model selection

Automated data collection, data warehousing, and ever-faster computing combine to make it possible to fit many variations of complex models. The combination of many predictors, large samples, and powerful software make it easy to build such models that hold the potential to reveal hidden structure. The problem for the statistician is to decide whether what was found is meaningful.

Well-known criteria like Mallows' Cp, AIC, and BIC often produce very different solutions. Which is right? All three originated to solve different problems with simpler models, and one must ask whether they remain useful in judging models with so many parameters. Several (two or three) examples will be used to illustrate the use of these methods. This first lecture reviews these classical selection criteria and contrasts them with some recent innovations in model selection. New tools for model selection have originated from various perspectives, including risk inflation, minimax estimation, empirical Bayes, multiple hypothesis testing, and information theory. The trend has been to make the selection criteria adaptive to the problem at hand, and the following lectures explore some of these developments more carefully.

(2) Thresholding and multiple shrinkage

A classical problem in model selection is regression with just as many orthogonal predictors as observations, such as in the use of Fourier methods (trigonometric regression) in time series analysis. Recently, Donoho and Johnstone have shown how a simple thresholding procedure provides the means to decide which coefficients are important to retain for the final model. The solution has certain optimal properties from a minimax perspective.

One arrives at a similar variable selection method though multiple shrinkage in a Bayesian setting and through a decision-theoretic idea known as risk inflation. This lecture explores these dissimilar, but convergent, approaches to variable selection. Examples with the use of wavelet regression for nonlinear smoothing are given.

(3) Adaptive methods

Thresholding methods are appropriate when only few of the underlying parameters are large. In other problems, many of the coefficients are important to retain and thresholding misses important parts of the ``signal.'' Several recent variations on thresholding methods address this problem, and these originate in diverse areas, including empirical Bayes and multiple hypothesis testing. Time permitting, we will also discuss some more computational approaches to this problem, such as those based on cross-validation.

Examples extending those from the previous lecture are used to see the value of these improvements.

(4) Information theory and statistics

Information theory offers an alternative view of many statistical problems, in particular model selection. Coding theory, a part of information theory, concerns the efficient compression of data into a minimal length message. These same ideas can be applied to statistical model selection. The connection is relatively intuitive: a good-fitting statistical model is able to compress its data well. Think of how a regression model reduces the initial sum of squares down to the residual sum of squares.

Before we can talk about these applications to selection in any depth, we need to lay some foundations. This lecture introduces the important results from coding theory that are needed to see the relationship between information theory and modeling that are developed in the next class. Anyone who has ever wondered how those disk compression tools like WinZip work will find the answer here as well.

(5) Information theory and model selection

Coding automatically protects from over-fitting. When coding, the compressed message must include enough information so that the receiver can recover the original data. For a statistical model, this means that the message must include the parameters that identify the model used to compress the data. A good fit alone is not enough since th number of bits needed to encode the parameters may be larger than the bits saved in compressing the data. The resulting two-part codes (model parameters, model data) then suggest a means to model selection: pick the model with the shortest message. This the approach of Rissanen's MDL criterion.

Going further, all of the previous approaches to model selection (AIC, BIC, thresholding, empirical Bayes, adaptive thresholding) can be cast as particular methods for coding a model. This commonality reveals further connections among the methods and suggests how they can be customized and improved.

Instructor

Robert Stine received his PhD in Statistics from Princeton University and currently teaches Statistics at the Wharton School of the University of Pennsylvania. Professor Stine has published widely on resampling methods for assessing statistical variation, and maintains interests in exploratory data analysis, statistical graphics, and statistical computing.