Results 1  10
of
57
Aggregation by exponential weighting and sharp oracle inequalities
"... Abstract. In the present paper, we study the problem of aggregation under the squared loss in the model of regression with deterministic design. We obtain sharp oracle inequalities for convex aggregates defined via exponential weights, under general assumptions on the distribution of errors and on t ..."
Abstract

Cited by 57 (5 self)
 Add to MetaCart
(Show Context)
Abstract. In the present paper, we study the problem of aggregation under the squared loss in the model of regression with deterministic design. We obtain sharp oracle inequalities for convex aggregates defined via exponential weights, under general assumptions on the distribution of errors and on the functions to aggregate. We show how these results can be applied to derive a sparsity oracle inequality. 1
Aggregation by exponential weighting, sharp PACBayesian bounds and sparsity
 MACH LEARN
"... ..."
(Show Context)
Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms
, 2010
"... In this paper we present a generic algorithmic framework, namely, the accelerated stochastic approximation (ACSA) algorithm, for solving strongly convex stochastic composite optimization (SCO) problems. While the classical stochastic approximation (SA) algorithms are asymptotically optimal for solv ..."
Abstract

Cited by 50 (8 self)
 Add to MetaCart
(Show Context)
In this paper we present a generic algorithmic framework, namely, the accelerated stochastic approximation (ACSA) algorithm, for solving strongly convex stochastic composite optimization (SCO) problems. While the classical stochastic approximation (SA) algorithms are asymptotically optimal for solving differentiable and strongly convex problems, the ACSA algorithm, when employed with proper stepsize policies, can achieve optimal or nearly optimal rates of convergence for solving different classes of SCO problems during a given number of iterations. Moreover, we investigate these ACSA algorithms in more detail, such as, establishing the largedeviation results associated with the convergence rates and introducing efficient validation procedure to check the accuracy of the generated solutions.
Sparse Regression Learning by Aggregation and Langevin MonteCarlo
, 2009
"... We consider the problem of regression learning for deterministic design and independent random errors. We start by proving a sharp PACBayesian type bound for the exponentially weighted aggregate (EWA) under the expected squared empirical loss. For a broad class of noise distributions the presented ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
We consider the problem of regression learning for deterministic design and independent random errors. We start by proving a sharp PACBayesian type bound for the exponentially weighted aggregate (EWA) under the expected squared empirical loss. For a broad class of noise distributions the presented bound is valid whenever the temperature parameter β of the EWA is larger than or equal to 4σ 2, where σ 2 is the noise variance. A remarkable feature of this result is that it is valid even for unbounded regression functions and the choice of the temperature parameter depends exclusively on the noise level. Next, we apply this general bound to the problem of aggregating the elements of a finitedimensional linear space spanned by a dictionary of functions φ1,...,φM. We allow M to be much larger than the sample size n but we assume that the true regression function can be well approximated by a sparse linear combination of functions φj. Under this sparsity scenario, we propose an EWA with a heavy tailed prior and we show that it satisfies a sparsity oracle inequality with leading constant one. Finally, we propose several Langevin MonteCarlo algorithms to approximately compute such an EWA when the number M of aggregated functions can be large. We discuss in some detail the convergence of these algorithms and present numerical experiments that confirm our theoretical findings.
PACBayesian bounds for sparse regression estimation with exponential weights
 Electronic Journal of Statistics
"... Abstract. We consider the sparse regression model where the number of parameters p is larger than the sample size n. The difficulty when considering highdimensional problems is to propose estimators achieving a good compromise between statistical and computational performances. The BIC estimator ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the sparse regression model where the number of parameters p is larger than the sample size n. The difficulty when considering highdimensional problems is to propose estimators achieving a good compromise between statistical and computational performances. The BIC estimator for instance performs well from the statistical point of view [11] but can only be computed for values of p of at most a few tens. The Lasso estimator is solution of a convex minimization problem, hence computable for large value of p. However stringent conditions on the design are required to establish fast rates of convergence for this estimator. Dalalyan and Tsybakov [19] propose a method achieving a good compromise between the statistical and computational aspects of the problem. Their estimator can be computed for reasonably large p and satisfies nice statistical properties under weak assumptions on the design. However, [19] proposes sparsity oracle inequalities in expectation for the empirical excess risk only. In this paper, we propose an aggregation procedure similar to that of [19] but with improved statistical performances. Our main theoretical result is a sparsity oracle inequality in probability for the true excess risk for a version of exponential weight estimator. We also propose a MCMC method to compute our estimator for reasonably large values of p.
Progressive mixture rules are deviation suboptimal
 Advances in Neural Information Processing Systems
"... We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by th ..."
Abstract

Cited by 29 (4 self)
 Add to MetaCart
(Show Context)
We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule ˆg satisfies ER(ˆg) ≤ ming∈G R(g) + Cst log G n, (1) where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is no better than Cst / √ n: it fails to achieve the expected Cst/n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1
Suboptimality of penalized empirical risk minimization in classification
 In Proceedings of the 20th annual conference on Computational Learning Theory (COLT). Lecture Notes in Computer Science 4539 142–156
, 2007
"... Abstract. Let F be a set of M classification procedures with values in [−1, 1]. Given a loss function, we want to construct a procedure which mimics at the best possible rate the best procedure in F. This fastest rate is called optimal rate of aggregation. Considering a continuous scale of loss func ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Let F be a set of M classification procedures with values in [−1, 1]. Given a loss function, we want to construct a procedure which mimics at the best possible rate the best procedure in F. This fastest rate is called optimal rate of aggregation. Considering a continuous scale of loss functions with various types of convexity, we prove that optimal rates of aggregation can be either ((log M)/n) 1/2 or (log M)/n. We prove that, if all the M classifiers are binary, the (penalized) Empirical Risk Minimization procedures are suboptimal (even under the margin/low noise condition) when the loss function is somewhat more than convex, whereas, in that case, aggregation procedures with exponential weights achieve the optimal rate of aggregation. 1
Model selection for density estimation with L2loss. ArXiv eprints
, 2008
"... We consider here estimation of an unknown probability density s belonging to L2(µ) where µ is a probability measure. We have at hand n i.i.d. observations with density s and use the squared L2norm as our loss function. The purpose of this paper is to provide an abstract but completely general metho ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
We consider here estimation of an unknown probability density s belonging to L2(µ) where µ is a probability measure. We have at hand n i.i.d. observations with density s and use the squared L2norm as our loss function. The purpose of this paper is to provide an abstract but completely general method for estimating s by model selection, allowing to handle arbitrary families of finitedimensional (possibly nonlinear) models and any s ∈ L2(µ). We shall, in particular, consider the cases of unbounded densities and bounded densities with unknown L∞norm and investigate how the L∞norm of s may influence the risk. We shall also provide applications to adaptive estimation and aggregation of preliminary estimators. Although of a purely theoretical nature, our method leads to results that cannot presently be reached by more concrete ones. 1
Sparse estimation by exponential weighting
 Statist. Sci
, 2012
"... Abstract. Consider a regression model with fixed design and Gaussian noise where the regression function can potentially be well approximated by a function that admits a sparse representation in a given dictionary. This paper resorts to exponential weights to exploit this underlying sparsity by impl ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
Abstract. Consider a regression model with fixed design and Gaussian noise where the regression function can potentially be well approximated by a function that admits a sparse representation in a given dictionary. This paper resorts to exponential weights to exploit this underlying sparsity by implementing the principle of sparsity pattern aggregation. This model selection take on sparse estimation allows us to derive sparsity oracle inequalities in several popular frameworks, including ordinary sparsity, fused sparsity and group sparsity. One striking aspect of these theoretical results is that they hold under no condition in the dictionary. Moreover, we describe an efficient implementation of the sparsity pattern aggregation principle that compares favorably to stateoftheart procedures on some basic numerical examples. Key words and phrases: Highdimensional regression, exponential weights, sparsity, fused sparsity, group sparsity, sparsity oracle inequalities, sparsity pattern aggregation, sparsity prior, sparse regression. 1.
Sparsity regret bounds for individual sequences in online linear regression
 JMLR Workshop and Conference Proceedings, 19 (COLT 2011 Proceedings):377–396
, 2011
"... We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T. We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in th ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T. We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an onlinelearning algorithm called SeqSEW and based on exponential weighting and datadriven truncation. In a second part we apply a parameterfree version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design.