This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This approach began with Kolmogorov's theory of algorithmic complexity, matured in the literature on information theory, and has recently received renewed interest within the statistics community. In the pages that follow, we review both the practical as well as the theoretical aspects of MDL as a tool for model selection, emphasizing the rich connections between information theory and statistics. At the boundary between these two disciplines, we nd many interesting interpretations of popular frequentist and Bayesian procedures. As we will see, MDL provides an objective umbrella under which rather disparate approaches to statistical modeling can co-exist and be compared. We illustrate the MDL principle by considering problems in regression, nonparametric curve estimation, cluster analysis, and time series analysis. Because model selection in linear regression is an extremely common problem that arises in many applications, we present detailed derivations of several MDL criteria in this context and discuss their properties through a number of examples. Our emphasis
|
4364
|
Elements of Information Theory
– Cover, Thomas
- 1991
|
|
1390
|
Introduction to the theory of neural computation
– Hertz, Krogh, et al.
- 1991
|
|
1083
|
Introduction to Kolmogorov Complexity and Its Applications
– Li, Vitanyi
- 1993
|
|
841
|
Estimating the dimension of a model
– Schwarz
- 1978
|
|
727
|
Spline Models for Observational Data
– Wahba
- 1990
|
|
699
|
Modeling by shortest data description
– Rissanen
- 1978
|
|
611
|
A new look at the statistical model identification
– Akaike
- 1974
|
|
574
|
Bayesian Theory
– Bernardo, Smith
- 1994
|
|
425
|
Bayes factors
– Kass, Raftery
- 1995
|
|
304
|
Three approaches to the quantitative definition of information
– Kolmogorov
- 1965
|
|
280
|
A universal prior for integers and estimation by minimum description length
– Rissanen
- 1983
|
|
262
|
Time Series: Theory and Methods
– Brockwell, Davis
- 1998
|
|
253
|
Regression shrinkage and selection via the lasso
– Tibshirani
- 1995
|
|
227
|
Spline Functions: Basic Theory
– Schumaker
- 1981
|
|
216
|
Stochastic complexity in statistical inquiry
– Rissanen
- 1989
|
|
210
|
An information measure for classification
– Wallace, Boulton
- 1968
|
|
190
|
Stochastic complexity and modeling
– Rissanen
- 1986
|
|
188
|
Constructing simple stable description for image partitioning
– Leclerc
- 1989
|
|
178
|
Fisher Information and Stochastic Complexity
– Rissanen
- 1996
|
|
139
|
Variable Selection via Gibbs Sampling
– George, McCulloch
- 1993
|
|
106
|
Asymptotic Methods in Statistical Decision Theory
– LeCam
- 1986
|
|
92
|
Bounds on the sample complexity of Ba.yesian learning using information theory and the VC dimension
– Haussler, Kearns, et al.
- 1991
|
|
86
|
Some comments on C p
– Mallows
- 1973
|
|
85
|
A Monte Carlo approach to nonnormal and nonlinear state-space modeling
– Carlin, Polson, et al.
- 1992
|
|
80
|
Nonparametric regression using Bayesian variable selection
– Smith, Kohn
- 1996
|
|
75
|
The determination of the order of an autoregression
– Hannan, Quinn
- 1979
|
|
73
|
Multiple shrinkage and subset selection in wavelets
– Clyde, Parmigiani, et al.
- 1998
|
|
72
|
Flexible discriminant analysis by optimal scoring
– Hastie, Tibshirani, et al.
- 1994
|
|
71
|
Bayesian model averaging for linear regression models
– Raftery, Madigan, et al.
- 1997
|
|
69
|
Some comments on Cp
– Mallows
- 1973
|
|
68
|
Penalized discriminant analysis
– Hastie, Buja, et al.
- 1995
|
|
66
|
Information-theoretic asymptotics of Bayes methods
– Clarke, Barron
- 1990
|
|
65
|
The intrinsic Bayes factor for model selection and prediction
– Berger, Pericchi
- 1996
|
|
58
|
Regression and time series model selection in small samples. Biometrika 76:297–307
– Hurvich, Tsai
- 1989
|
|
58
|
Logical basis for information theory and probability theory
– Kolmogorov
- 1968
|
|
57
|
On a measure of the information provided by an experiment
– Lindley
- 1956
|
|
55
|
Simultaneous Noise Suppression and Signal Compression using a library of orthonormal bases and the minimum description length criterion. To appear, Wavelets in Geophysics
– Saito
|
|
53
|
Fractional Bayes factors for model comparison
– O’Hagan
- 1997
|
|
48
|
Calibration and empirical bayes variable selection
– George, Foster
- 1997
|
|
45
|
Regression by leaps and bounds
– Furnival, Wilson
- 1974
|
|
41
|
Approaches to Bayesian Variable Selection
– GEORGE, MCCULLOCH
- 1994
|
|
40
|
Prequential analysis, stochastic complexity and bayesian inference
– Dawid
- 1992
|
|
38
|
Hybrid adaptive splines
– Luo, Wahba
- 1997
|
|
37
|
An optimal selection of regression variables
– Shibata
- 1981
|
|
36
|
Bayes factors and choice criteria for linear models
– Smith, Spiegelhalter
- 1980
|
|
35
|
A Strong Version of the Redundancy-Capacity Theorem for Universal coding," accepted for publication
– Merhav, Feder
- 1994
|
|
34
|
Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion
– Hurvich, Simonoff, et al.
- 1998
|
|
31
|
A new look at the statistical model identi cation
– Akaike
- 1974
|
|
27
|
Stochastic complexity (with discussion
– Rissanen
- 1987
|
|
27
|
Density estimation by stochastic complexity
– Rissanen, Speed, et al.
- 1992
|