#### DMCA

## Consistency of cross validation for comparing regression procedures. Annals of Statistics, Accepted paper

Citations: | 25 - 3 self |

### Citations

5785 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...vide the data into r subgroups and do prediction one at a time for each subgroup based on estimation using the rest of the subgroups (this is called r-fold CV; see Breiman, Friedman, Olshen and Stone =-=[3]-=-). When multiple splittings are used, there are two natural ways to proceed. One is to first average the prediction errors over the different splittings and then select the procedure that minimizes th... |

2781 |
Information theory and an extension of the maximum likelihood principle
- Akaike
- 1973
(Show Context)
Citation Context ...ing parameter selection for nonparametric regression. In linear regression, it has been shown that delete-1 and generalized CVs are asymptotically equivalent to the Akaike Information Criterion (AIC) =-=[1]-=- and they are all inconsistent in the sense that the probability of selecting the true model does not converge to 1 as n goes to ∞ (see Li [15]). In addition, interestingly, the analysis of Shao [19] ... |

1830 |
Spline Models for Observational Data
- Wahba
- 1990
(Show Context)
Citation Context ... may be simple linear regression and δ2 may be a local polynomial regression procedure (see, e.g., Fan and Gijbels [8]). For another example, δ1 may be a spline estimation procedure (see, e.g., Wahba =-=[29]-=-) and δ2 may be a wavelet estimation procedure (see, , the regrese.g., Donoho and Johnstone [7]). Based on a sample (Xi,Yi) n i=1 sion procedures δ1 and δ2 yield estimators fn,1(x) and fn,2(x), re... |

1150 |
Cross-validatory choice and assessment of statistical prediction
- Stone
- 1974
(Show Context)
Citation Context ...ting in size. Furthermore, it can even be of a smaller order than the proportion for estimation while not affecting the consistency property. 1. Introduction. Cross validation (e.g., Allen [2], Stone =-=[25]-=- and Geisser [9]) is one of the most commonly used model selection criteria. Basically, based on a data splitting, part of the data is used for fitting each competing model (or procedure) and the rest... |

903 |
Local polynomial modelling and its applications
- FAN, GIJBELS
- 1996
(Show Context)
Citation Context ...re are two regression procedures, say δ1 and δ2, that are considered. For example, δ1 may be simple linear regression and δ2 may be a local polynomial regression procedure (see, e.g., Fan and Gijbels =-=[8]-=-). For another example, δ1 may be a spline estimation procedure (see, e.g., Wahba [29]) and δ2 may be a wavelet estimation procedure (see, , the regrese.g., Donoho and Johnstone [7]). Based on a sampl... |

814 |
Convergence of Stochastic Processes.
- Pollard
- 1984
(Show Context)
Citation Context ... = ‖f − fn1,j‖ 4 4 , where the subscript Z1 in VarZ1 and EZ1 is used to denote the conditional expectation given Z1 . Thus conditional on Z1 , on Hn1 , by Bernstein’s inequality (see, e.g., Pollard =-=[18]-=-, page 193), for each x > 0, we have ( n∑ PZ1 (f(Xi) − i=n1+1 fn1,1(Xi)) 2 − n2‖f − fn1,1‖ 2 ) 2 ≥ x ( ≤ exp − 1 x 2 2 n2‖f − fn1,1‖4 4 + (2(An1,ǫ) 2x/3) ) .CONSISTENCY OF CROSS VALIDATION 21 T... |

714 |
Smoothing Noisy Data with Spline Functions: estimating the correct degree of smoothing by the method of generalized cross validation
- Craven, Wahba
- 1979
(Show Context)
Citation Context ...verall performance is selected. There are a few different versions of cross-validation (CV) methods, including delete-1 CV, delete-k (k > 1) CV and also generalized CV methods (e.g., Craven and Wahba =-=[6]-=-). Cross validation can be applied to various settings, including parametric and nonparametric regression. There can be different primary goals when Received March 2006; revised February 2007. 1 Suppo... |

343 |
Optimal Global Rates of Convergence for Nonparametric Estimators
- Stone
- 1982
(Show Context)
Citation Context ...0 < α ≤ 1 and α = β − m; also ‖f‖∞ is bounded, or for Sobolev classes, the rates of convergence under the sup-norm distance and Lp (p < ∞) are different only by a logarithmic factor (see, e.g., Stone =-=[24]-=- and Nemirovski [16]). If one takes an optimal or near-optimal estimator under the L∞ loss, Condition 3 is satisfied typically with Mn being a logarithmic term. 3.2. The main theorem. Let I ∗ = 1 if δ... |

327 |
Smoothing Methods in Statistics
- Simonoff
- 1996
(Show Context)
Citation Context ...xample, for the rather famous Old Faithful Geyser data (Weisberg [31]), there were several analyses related to the comparison of linear regression with nonparametric alternatives (see, e.g., Simonoff =-=[21]-=- and Hart [13]). For simplicity, suppose that there are two regression procedures, say δ1 and δ2, that are considered. For example, δ1 may be simple linear regression and δ2 may be a local polynomial ... |

316 | Minimax estimation via wavelet shrinkage
- Donoho, Johnstone
- 1998
(Show Context)
Citation Context ...., Fan and Gijbels [8]). For another example, δ1 may be a spline estimation procedure (see, e.g., Wahba [29]) and δ2 may be a wavelet estimation procedure (see, , the regrese.g., Donoho and Johnstone =-=[7]-=-). Based on a sample (Xi,Yi) n i=1 sion procedures δ1 and δ2 yield estimators fn,1(x) and fn,2(x), respectively. We need to select the better one of them. Though for simplicity we assumed that the... |

316 |
A Distribution-Free Theory of Nonparametric Regression
- Györfi, Kohler, et al.
- 2002
(Show Context)
Citation Context ...egression estimation (see, e.g., Speckman [22] and Burman [5] for spline estimation, Härdle, Hall and Marron [12], Hall and Johnstone [11] and references therein for kernel estimation). Györfi et al. =-=[10]-=- gave risk bounds for kernel and nearest-neighbor regression with bandwidth or neighbor size selected by delete-1 CV. See Opsomer, Wang and Yang [17] for a review and references related to the use of ... |

229 |
The relationship between variable selection and data augmentation and a method for prediction
- Allen
- 1974
(Show Context)
Citation Context ...o be dominating in size. Furthermore, it can even be of a smaller order than the proportion for estimation while not affecting the consistency property. 1. Introduction. Cross validation (e.g., Allen =-=[2]-=-, Stone [25] and Geisser [9]) is one of the most commonly used model selection criteria. Basically, based on a data splitting, part of the data is used for fitting each competing model (or procedure) ... |

215 |
Linear-model selection by cross-validation
- Shao
- 1993
(Show Context)
Citation Context ...) [1] and they are all inconsistent in the sense that the probability of selecting the true model does not converge to 1 as n goes to ∞ (see Li [15]). In addition, interestingly, the analysis of Shao =-=[19]-=- showed that in order for delete-k CV to be consistent, k needs to be dominatingly large in the sense that k/n → 1 (and n − k → ∞). Zhang [35] proved that delete-k CV is asymptotically equivalent to t... |

198 |
The predictive sample reuse method with applications
- Geisser
- 1975
(Show Context)
Citation Context ...thermore, it can even be of a smaller order than the proportion for estimation while not affecting the consistency property. 1. Introduction. Cross validation (e.g., Allen [2], Stone [25] and Geisser =-=[9]-=-) is one of the most commonly used model selection criteria. Basically, based on a data splitting, part of the data is used for fitting each competing model (or procedure) and the rest of the data is ... |

189 |
Optimal rates of convergence for nonparametric estimators
- Stone
- 1980
(Show Context)
Citation Context ...or comparing two general regression procedures. Let {an} be a sequence of positive numbers approaching zero. The following simple definition concerns the rate of convergence in probability (cf. Stone =-=[23]-=-). Definition 3. A procedure δ (or { θn} ∞ n=1 ) is said to converge exactly at rate {an} in probability under the loss L if L(θ, θn) = Op(an), and for every 0 < ǫ < 1, there exists cǫ > 0 such th... |

165 |
Nonparametric smoothing and lack-of-fit tests
- Hart
- 1997
(Show Context)
Citation Context ...e rather famous Old Faithful Geyser data (Weisberg [31]), there were several analyses related to the comparison of linear regression with nonparametric alternatives (see, e.g., Simonoff [21] and Hart =-=[13]-=-). For simplicity, suppose that there are two regression procedures, say δ1 and δ2, that are considered. For example, δ1 may be simple linear regression and δ2 may be a local polynomial regression pro... |

114 |
An asymptotic theory for linear model selection (with discussion
- Shao
- 1997
(Show Context)
Citation Context ...e in the sense that k/n → 1 (and n − k → ∞). Zhang [35] proved that delete-k CV is asymptotically equivalent to the Final Prediction Error (FPE) criterion when k → ∞. The readers are referred to Shao =-=[20]-=- for more asymptotic results and references on model selection for linear regression. In the context of nonparametric regression, delete-1 CV for smoothing parameter selection leads to consistent regr... |

106 |
Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation: discrete index set
- Li
- 1987
(Show Context)
Citation Context ...ally equivalent to the Akaike Information Criterion (AIC) [1] and they are all inconsistent in the sense that the probability of selecting the true model does not converge to 1 as n goes to ∞ (see Li =-=[15]-=-). In addition, interestingly, the analysis of Shao [19] showed that in order for delete-k CV to be consistent, k needs to be dominatingly large in the sense that k/n → 1 (and n − k → ∞). Zhang [35] p... |

85 |
A comparative study of ordinary cross-validation, vfold cross-validation and the repeated learning-testing methods
- Burman
- 1989
(Show Context)
Citation Context ... same ratio (this is called multifold CV; see, e.g., Zhang [35]); (2) the same as in (1) but consider only a sample of all possible splits (this is called repeated learning-testing; see, e.g., Burman =-=[4]-=-); (3) divide the data into r subgroups and do prediction one at a time for each subgroup based on estimation using the rest of the subgroups (this is called r-fold CV; see Breiman, Friedman, Olshen a... |

81 |
How far are automatically chosen regression smoothing parameters from their optimum
- Härdle, Hall, et al.
- 1988
(Show Context)
Citation Context ...s to asymptotically optimal or rate-optimal choice of smoothing parameters and/or optimal regression estimation (see, e.g., Speckman [22] and Burman [5] for spline estimation, Härdle, Hall and Marron =-=[12]-=-, Hall and Johnstone [11] and references therein for kernel estimation). Györfi et al. [10] gave risk bounds for kernel and nearest-neighbor regression with bandwidth or neighbor size selected by dele... |

60 | Adaptive regression by mixing
- Yang
- 2001
(Show Context)
Citation Context ...re hard to distinguish, the forced action of choosing a single winner can substantially damage the accuracy of estimating the regression function. An alternative is to average the estimates. See Yang =-=[33, 34]-=- for references and theoretical results on combining models/procedures and simulation results that compare CV and a model combining the procedure Adaptive Regression by Mixing (ARM). It should be poin... |

56 | Nonparametric regression with correlated errors. Stat. Science 16, 134–153
- Opsomer, Wang, et al.
- 1999
(Show Context)
Citation Context ...nces therein for kernel estimation). Györfi et al. [10] gave risk bounds for kernel and nearest-neighbor regression with bandwidth or neighbor size selected by delete-1 CV. See Opsomer, Wang and Yang =-=[17]-=- for a review and references related to the use of CV for bandwidth selection for nonparametric regression with dependent errors. In real-world applications of regression, in pursuing a better estimat... |

56 |
Model selection via multifold cross-validation
- Zhang
- 1993
(Show Context)
Citation Context ...i [15]). In addition, interestingly, the analysis of Shao [19] showed that in order for delete-k CV to be consistent, k needs to be dominatingly large in the sense that k/n → 1 (and n − k → ∞). Zhang =-=[35]-=- proved that delete-k CV is asymptotically equivalent to the Final Prediction Error (FPE) criterion when k → ∞. The readers are referred to Shao [20] for more asymptotic results and references on mode... |

42 |
Spline smoothing and optimal rates of convergence in nonparametric regression models. The Annals of Statistics
- Speckman
- 1985
(Show Context)
Citation Context ...el regression and Li [14] for the nearest-neighbor method) and leads to asymptotically optimal or rate-optimal choice of smoothing parameters and/or optimal regression estimation (see, e.g., Speckman =-=[22]-=- and Burman [5] for spline estimation, Härdle, Hall and Marron [12], Hall and Johnstone [11] and references therein for kernel estimation). Györfi et al. [10] gave risk bounds for kernel and nearest-n... |

41 | Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilonnet estimator: Finite sample oracle inequalities and examples. Working Paper Series Working Paper 130, U.C. Berkeley Division of Biostatistic
- Laan, Dudoit
- 2003
(Show Context)
Citation Context ...s validation is also applicable to choose one of them. Recently, a general CV methodology has been advocated by van der Laan, Dudoit, van der Vaart and their co-authors (e.g., van der Laan and Dudoit =-=[26]-=-, van der Laan, Dudoit and van der Vaart [27] and van der Vaart, Dudoit and van der Laan [28]), which can be applied in other contexts (e.g., survival function estimation). Risk bounds for estimating ... |

41 | Model selection in nonparametric regression
- Wegkamp
- 2003
(Show Context)
Citation Context ...stency in selection. For example, with two nested models, if both models are correct, then reasonable selection rules will be consistent in estimation but not necessarily so in selection. See Wegkamp =-=[30]-=- for risk bounds for a modified CV (with an extra complexity penalty) for estimating the regression function. 3.1. Definitions and conditions. We first give some useful definitions for comparing estim... |

36 | Regression with multiple candidate models: selecting or mixing? Statist. Sinica 13
- YANG
- 2003
(Show Context)
Citation Context ...re hard to distinguish, the forced action of choosing a single winner can substantially damage the accuracy of estimating the regression function. An alternative is to average the estimates. See Yang =-=[33, 34]-=- for references and theoretical results on combining models/procedures and simulation results that compare CV and a model combining the procedure Adaptive Regression by Mixing (ARM). It should be poin... |

33 |
Topics in Non-Parametric Statistics. In: Lectures on Probability Theory and Statistics. Ecole d’Eté de Probabilités de St
- Nemirovski
- 2000
(Show Context)
Citation Context ...− m; also ‖f‖∞ is bounded, or for Sobolev classes, the rates of convergence under the sup-norm distance and Lp (p < ∞) are different only by a logarithmic factor (see, e.g., Stone [24] and Nemirovski =-=[16]-=-). If one takes an optimal or near-optimal estimator under the L∞ loss, Condition 3 is satisfied typically with Mn being a logarithmic term. 3.2. The main theorem. Let I ∗ = 1 if δ1 is asymptotically ... |

22 |
Empirical Functionals and Efficient Smoothing Parameter Selection" (with discussion
- Johnstone, Hall
- 1992
(Show Context)
Citation Context ...al or rate-optimal choice of smoothing parameters and/or optimal regression estimation (see, e.g., Speckman [22] and Burman [5] for spline estimation, Härdle, Hall and Marron [12], Hall and Johnstone =-=[11]-=- and references therein for kernel estimation). Györfi et al. [10] gave risk bounds for kernel and nearest-neighbor regression with bandwidth or neighbor size selected by delete-1 CV. See Opsomer, Wan... |

22 |
Applied Linear Regression 3rd Ed
- WEISBERG
- 2005
(Show Context)
Citation Context ...ther a nonparametric method can provide a better estimate to capture a questionable slight curvature seen in the data. For a specific example, for the rather famous Old Faithful Geyser data (Weisberg =-=[31]-=-), there were several analyses related to the comparison of linear regression with nonparametric alternatives (see, e.g., Simonoff [21] and Hart [13]). For simplicity, suppose that there are two regre... |

12 |
On the Consistency of Cross-Validation in Kernel Nonparametric Regression,” The Annals of Statistics
- Wong
- 1983
(Show Context)
Citation Context ...nd references on model selection for linear regression. In the context of nonparametric regression, delete-1 CV for smoothing parameter selection leads to consistent regression estimators (e.g., Wong =-=[32]-=- for kernel regression and Li [14] for the nearest-neighbor method) and leads to asymptotically optimal or rate-optimal choice of smoothing parameters and/or optimal regression estimation (see, e.g., ... |

9 |
Estimation of optimal transformations using v-fold cross-validation and repeated learning testing methods
- Burman
- 1990
(Show Context)
Citation Context ...d Li [14] for the nearest-neighbor method) and leads to asymptotically optimal or rate-optimal choice of smoothing parameters and/or optimal regression estimation (see, e.g., Speckman [22] and Burman =-=[5]-=- for spline estimation, Härdle, Hall and Marron [12], Hall and Johnstone [11] and references therein for kernel estimation). Györfi et al. [10] gave risk bounds for kernel and nearest-neighbor regress... |

7 |
Consistency for Cross-Validated Nearest Neighbor Estimates in Nonparametric Regression
- Li
- 1985
(Show Context)
Citation Context ...or linear regression. In the context of nonparametric regression, delete-1 CV for smoothing parameter selection leads to consistent regression estimators (e.g., Wong [32] for kernel regression and Li =-=[14]-=- for the nearest-neighbor method) and leads to asymptotically optimal or rate-optimal choice of smoothing parameters and/or optimal regression estimation (see, e.g., Speckman [22] and Burman [5] for s... |

6 |
The crossvalidated adaptive epsilon-net estimator
- Laan, Dudoit, et al.
- 2006
(Show Context)
Citation Context ... of them. Recently, a general CV methodology has been advocated by van der Laan, Dudoit, van der Vaart and their co-authors (e.g., van der Laan and Dudoit [26], van der Laan, Dudoit and van der Vaart =-=[27]-=- and van der Vaart, Dudoit and van der Laan [28]), which can be applied in other contexts (e.g., survival function estimation). Risk bounds for estimating the target function were derived and their im... |

2 |
Oracle inequalities for multi-fold cross validation
- Vaart, Dudoit, et al.
- 2006
(Show Context)
Citation Context ... been advocated by van der Laan, Dudoit, van der Vaart and their co-authors (e.g., van der Laan and Dudoit [26], van der Laan, Dudoit and van der Vaart [27] and van der Vaart, Dudoit and van der Laan =-=[28]-=-), which can be applied in other contexts (e.g., survival function estimation). Risk bounds for estimating the target function were derived and their implications on adaptive estimation and asymptotic... |