Results 11 - 20
of
26
Towards Robust Model Selection using Estimation and Approximation Error Bounds
- Proc. 9 th Annual Conference on Computational Learning Theory, p.57, ACM
, 1996
"... this paper we extend on previous work [17] and introduce a novel model selection criterion, based on combining two recent chains of thought. In particular we make use of the powerful framework of uniform convergence of empirical processes pioneered by Vapnik and Chernovenkins [23], combined with rec ..."
Abstract
-
Cited by 10 (8 self)
- Add to MetaCart
this paper we extend on previous work [17] and introduce a novel model selection criterion, based on combining two recent chains of thought. In particular we make use of the powerful framework of uniform convergence of empirical processes pioneered by Vapnik and Chernovenkins [23], combined with recent results concerning the approximation ability of non-linear manifolds of functions, focusing in particular on feedforward neural networks. The main contributions of this work are twofold: (i) Conceptual - elucidating a coherent and robust framework for model selection, (ii) Technical - the main contribution here is a lower bound on the approximation error (Theorem 10), which holds in a well specified sense for most functions of interest. As far as we are aware, this result is new in the field of function approximation. The remainder of the paper is organized as follows. In
Aggregation for regression learning
- Laboratoire de Probabilités, Université Paris VI, 2004, http://www.proba.jussieu.fr/mathdoc/preprints/index.html# 2004. L. Birgé
, 2004
"... Abstract. This paper studies statistical aggregation procedures in regression setting. A motivating factor is the existence of many different methods of estimation, leading to possibly competing estimators. We consider here three different types of aggregation: model selection (MS) aggregation, conv ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Abstract. This paper studies statistical aggregation procedures in regression setting. A motivating factor is the existence of many different methods of estimation, leading to possibly competing estimators. We consider here three different types of aggregation: model selection (MS) aggregation, convex (C) aggregation and linear (L) aggregation. The objective of (MS) is to select the optimal single estimator from the list; that of (C) is to select the optimal convex combination of the given estimators; and that of (L) is to select the optimal linear combination of the given estimators. We are interested in evaluating the rates of convergence of the excess risks of the estimators obtained by these procedures. Our approach is motivated by recent minimax results in Nemirovski (2000) and Tsybakov (2003). There exist competing aggregation procedures achieving optimal convergence separately for each one of (MS), (C) and (L) cases. Since the bounds in these results are not directly comparable with each other, we suggest an alternative solution. We prove that all the three optimal bounds can be nearly achieved via a single “universal ” aggregation procedure. We propose such a procedure which consists in mixing of the initial estimators with the weights obtained by penalized least squares. Two different penalities are considered: one of them is related to hard thresholding techniques, the second one is a data dependent L1-type penalty. 1.
A Sharp Concentration Inequality With Applications
, 1999
"... We derive a new general concentration-of-measure inequality. The concentration inequality applies, among others, to configuration functions as defined by Talagrand and also to combinatorial entropies such as the logarithm of the number of increasing subsequences in a random permutation and to Va ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
We derive a new general concentration-of-measure inequality. The concentration inequality applies, among others, to configuration functions as defined by Talagrand and also to combinatorial entropies such as the logarithm of the number of increasing subsequences in a random permutation and to Vapnik-Chervonenkis (vc) entropies. The results find direct applications in statistical learning theory, substantiating the possibility to use the empirical vc-entropy in penalization techniques.
Adaptive Estimation in Pattern Recognition by Combining Different Procedures
- Statistica Sinica
"... : We study a problem of adaptive estimation of a conditional probability function in a pattern recognition setting. In many applications, for more flexibility, one may want to consider various estimation procedures targeted at different scenarios and/or under different assumptions. For example, when ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
: We study a problem of adaptive estimation of a conditional probability function in a pattern recognition setting. In many applications, for more flexibility, one may want to consider various estimation procedures targeted at different scenarios and/or under different assumptions. For example, when the feature dimension is high, to overcome the familiar curse of dimensionality, one may seek a good parsimonious model among a number of candidates such as CART, neural nets, additive models, and others. For such a situation, one wishes to have an automated final procedure performing always as well as the best candidate. In this work, we propose a method to combine a countable collection of procedures for estimating the conditional probability. We show that the combined procedure has a property that its statistical risk is bounded above by that of any of the procedure being considered plus a small penalty. Thus in an asymptotic sense, the strengths of the different estimation procedures i...
On Learning Multicategory Classification with Sample Queries
, 2003
"... Consider the pattern recognition problem of learning multi-category classification from a labeled sample, for instance, the problem of learning character recognition where a category corresponds to an alphanumeric letter. The classical theory of pattern recognition assumes labeled examples appear ac ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Consider the pattern recognition problem of learning multi-category classification from a labeled sample, for instance, the problem of learning character recognition where a category corresponds to an alphanumeric letter. The classical theory of pattern recognition assumes labeled examples appear according to the unknown underlying pattern-class conditional probability distributions where the pattern classes are picked randomly according to their a priori probabilities. In this paper we pose the following question: Can the learning accuracy be improved if labeled examples are independently randomly drawn according to the underlying class conditional probability distributions but the pattern classes are chosen not necessarily according to their a priori probabilities ? We answer this in the a#rmative by showing that there exists a tuning of the subsample proportions which minimizes a loss criterion. The tuning is relative to the intrinsic complexity of the Bayes-classifier. As this complexity depends on the underlying probability distributions which are assumed to be unknown, we provide an algorithm which learns the proportions in an on-line manner utilizing sample querying which asymptotically minimizes the criterion. In practice, this algorithm may be used to boost the performance of existing learning classification algorithms by apportioning better subsample proportions.
How accurate can any regression procedure be?
- Iowa State University
, 2000
"... Various parametric and nonparametric regression procedures have been constructed according to different possible characteristics of the underlying regression function. To reduce the dependence on subjective assumptions, the theme of adaptive estimation is to construct a procedure that provides an ac ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Various parametric and nonparametric regression procedures have been constructed according to different possible characteristics of the underlying regression function. To reduce the dependence on subjective assumptions, the theme of adaptive estimation is to construct a procedure that provides an accurate estimate of the regression function for various scenarios without knowing which one describes the data well. A closely related question is: Given a regression procedure, how many regression functions are estimated accurately? In this work, for a given sequence of prescribed estimation accuracy (in sample size), we give an upper bound (in terms of metric entropy) on the number of regression functions for which the accuracy is achieved. A consequence is that if one demands near optimal performance for a target class of regression functions, then the same accuracy can not be achieved for many additional regression functions. This has a negative implication on adaptive estimation. The main result is also applied to show that as far as polynomial rates of convergence are concerned, any regression procedure is essentially no better than a method based on sparse approximation.
A permutation approach to validation
- In Proc. 10th SIAM International Conference on Data Mining (SDM
, 2010
"... We give a permutation approach to validation (estimation of out-sample error). One typical use of validation is model selection. We establish the legitimacy of the proposed permutation complexity by proving a uniform bound on the out-sample error, similar to a VC-style bound. We extensively demonstr ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We give a permutation approach to validation (estimation of out-sample error). One typical use of validation is model selection. We establish the legitimacy of the proposed permutation complexity by proving a uniform bound on the out-sample error, similar to a VC-style bound. We extensively demonstrate this approach experimentally on synthetic data, standard data sets from the UCI-repository, and a novel diffusion data set. The out-of-sample error estimates are comparable to cross validation (CV); yet, the method is more efficient and robust, being less susceptible to overfitting during model selection. 1
Minimax Nonparametric Classification - Part II: Model Selection for Adaptation
- IEEE Transaction on Information Theory
, 1998
"... We study nonparametric estimation of a conditional probability for classification based on a collection of finite-dimensional models. For the sake of flexibility, different types of models, linear or nonlinear, are allowed as long as each satisfies a dimensionality assumption. We show that with a su ..."
Abstract
- Add to MetaCart
We study nonparametric estimation of a conditional probability for classification based on a collection of finite-dimensional models. For the sake of flexibility, different types of models, linear or nonlinear, are allowed as long as each satisfies a dimensionality assumption. We show that with a suitable model selection criterion, the penalized maximum likelihood estimator has risk bounded by an index of resolvability expressing a good trade-off among approximation error, estimation error, and model complexity. The bound does not require any assumption on the target conditional probability and can be used to demonstrate the adaptivity of estimators based on model selection. Examples are given with both splines and neural nets, and problems of high-dimensional estimation are considered. The resulting adaptive estimator is shown to behave optimally or near optimally over Sobolev classes (with unknown orders of interaction and smoothness) and classes of integrable Fourier transform of gr...
On the Learnability of Rich Function Classes
, 1983
"... this paper we present an extension of the PAC framework with which rich function classes with possibly infinite pseudo-dimension may be learned with a finite number of examples and a finite amount of partial information. As an example we consider learning a family of infinite dimensional Sobolev cla ..."
Abstract
- Add to MetaCart
this paper we present an extension of the PAC framework with which rich function classes with possibly infinite pseudo-dimension may be learned with a finite number of examples and a finite amount of partial information. As an example we consider learning a family of infinite dimensional Sobolev classes. ] 1999 Academic Press Key Words: PAC learning; computational learning theory; information -based complexity; VC-theory; approximation theory; partial information

