Results 1  10
of
37
The tradeoffs of large scale learning
 IN: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 20
, 2008
"... This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of smallscale and largescale learning problems. Smallscale learning problems are subject to the usual approx ..."
Abstract

Cited by 254 (4 self)
 Add to MetaCart
(Show Context)
This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of smallscale and largescale learning problems. Smallscale learning problems are subject to the usual approximation–estimation tradeoff. Largescale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in nontrivial ways.
Lectures on the central limit theorem for empirical processes
 Probability and Banach Spaces
, 1986
"... Abstract. Concentration inequalities are used to derive some new inequalities for ratiotype suprema of empirical processes. These general inequalities are used to prove several new limit theorems for ratiotype suprema and to recover anumber of the results from [1] and [2]. As a statistical applica ..."
Abstract

Cited by 135 (9 self)
 Add to MetaCart
(Show Context)
Abstract. Concentration inequalities are used to derive some new inequalities for ratiotype suprema of empirical processes. These general inequalities are used to prove several new limit theorems for ratiotype suprema and to recover anumber of the results from [1] and [2]. As a statistical application, an oracle inequality for nonparametric regression is obtained via ratio bounds. 1.
Fast rates for regularized objectives
 In Neural Information Processing Systems
, 2008
"... We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. The result applies, in particular, to the SVM object ..."
Abstract

Cited by 40 (8 self)
 Add to MetaCart
(Show Context)
We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. The result applies, in particular, to the SVM objective. Thus, we obtain a rate of 1/n on the convergence of the SVM objective (with fixed regularization parameter) to its infinite data limit. We demonstrate how this is essential for obtaining certain type of oracle inequalities for SVMs. The results extend also to approximate minimization as well as to strong convexity with respect to an arbitrary norm, and so also to objectives regularized using other ℓp norms. 1
Concentration inequalities and asymptotic results for ratio type empirical processes
 ANN. PROBAB
, 2006
"... Let F be a class of measurable functions on a measurable space (S, S) with values in [0, 1] and let Pn = n −1 n ∑ δXi i=1 be the empirical measure based on an i.i.d. sample (X1,...,Xn) from a probability distribution P on (S, S). We study the behavior of suprema of the following type: sup rn<σP f ..."
Abstract

Cited by 40 (5 self)
 Add to MetaCart
(Show Context)
Let F be a class of measurable functions on a measurable space (S, S) with values in [0, 1] and let Pn = n −1 n ∑ δXi i=1 be the empirical measure based on an i.i.d. sample (X1,...,Xn) from a probability distribution P on (S, S). We study the behavior of suprema of the following type: sup rn<σP f ≤δn Pnf − Pf  φ(σPf) where σP f ≥ Var 1/2 P f and φ is a continuous, strictly increasing function with φ(0) = 0. Using Talagrand’s concentration inequality for empirical processes, we establish concentration inequalities for such suprema and use them to derive several results about their asymptotic behavior, expressing the conditions in terms of expectations of localized suprema of empirical processes. We also prove new bounds for expected values of supnorms of empirical processes in terms of the largest σP f and the L2(P) norm of the envelope of the function class, which are especially suited for estimating localized suprema. With this technique, we extend to function classes most of the known results on ratio type suprema of empirical processes, including some of Alexander’s results for VC classes of sets. We also consider applications of these results to several important problems in nonparametric statistics and in learning theory (including general excess risk bounds in empirical risk minimization and their versions for L2regression and classification and ratio type bounds for margin distributions in classification).
Statistical properties of kernel principal component analysis
 Machine Learning
, 2004
"... The main goal of this paper is to prove inequalities on the reconstruction error for Kernel Principal Component Analysis. With respect to previous work on this topic, our contribution is twofold: (1) we give bounds that explicitly take into account the empirical centering step in this algorithm, and ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
The main goal of this paper is to prove inequalities on the reconstruction error for Kernel Principal Component Analysis. With respect to previous work on this topic, our contribution is twofold: (1) we give bounds that explicitly take into account the empirical centering step in this algorithm, and (2) we show that a “localized” approach allows to obtain more accurate bounds. In particular, we show faster rates of convergence towards the minimum reconstruction error; more precisely, we prove that the convergence rate can typically be faster than n −1/2. We also obtain a new relative bound on the error. A secondary goal, for which we present similar contributions, is to obtain convergence bounds for the partial sums of the biggest or smallest eigenvalues of the kernel Gram matrix towards eigenvalues of the corresponding kernel operator. These quantities are naturally linked to the KPCA procedure; furthermore these results can have applications to the study of various other kernel algorithms. The results are presented in a functional analytic framework, which is suited to deal rigorously with reproducing kernel Hilbert spaces of infinite dimension. 1
Smoothness, low noise and fast rates
 In NIPS
, 2010
"... We establish an excess risk bound of Õ HR2 n + √ HL ∗) Rn for ERM with an Hsmooth loss function and a hypothesis class with Rademacher complexity Rn, where L ∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = √ R/n, this translates to a learning rate o ..."
Abstract

Cited by 30 (9 self)
 Add to MetaCart
(Show Context)
We establish an excess risk bound of Õ HR2 n + √ HL ∗) Rn for ERM with an Hsmooth loss function and a hypothesis class with Rademacher complexity Rn, where L ∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = √ R/n, this translates to a learning rate of Õ (RH/n) in the separable (L ∗ = 0) case and Õ RH/n + √ L ∗) RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth nonnegative objective. 1
Distancebased classification with lipschitz functions
 Journal of Machine Learning Research
, 2003
"... The goal of this article is to develop a framework for large margin classification in metric spaces. We want to find a generalization of linear decision functions for metric spaces and define a corresponding notion of margin such that the decision function separates the training points with a large ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
The goal of this article is to develop a framework for large margin classification in metric spaces. We want to find a generalization of linear decision functions for metric spaces and define a corresponding notion of margin such that the decision function separates the training points with a large margin. It will turn out that using Lipschitz functions as decision functions, the inverse of the Lipschitz constant can be interpreted as the size of a margin. In order to construct a clean mathematical setup we isometrically embed the given metric space into a Banach space and the space of Lipschitz functions into its dual space. To analyze the resulting algorithm, we prove several representer theorems. They state that there always exist solutions of the Lipschitz classifier which can be expressed in terms of distance functions to training points. We provide generalization bounds for Lipschitz classifiers in terms of the Rademacher complexities of some Lipschitz function classes. The generality of our approach can be seen from the fact that several wellknown algorithms are special cases of the Lipschitz classifier, among them the support vector machine, the linear programming machine, and the 1nearest neighbor classifier. 1.
Empirical minimization
 Probability Theory and Related Fields, 135(3):311 – 334
, 2003
"... We investigate the behavior of the empirical minimization algorithm using various methods. We first analyze it by comparing the empirical, random, structure and the original one on the class, either in an additive sense, via the uniform law of large numbers, or in a multiplicative sense, using isomo ..."
Abstract

Cited by 27 (10 self)
 Add to MetaCart
We investigate the behavior of the empirical minimization algorithm using various methods. We first analyze it by comparing the empirical, random, structure and the original one on the class, either in an additive sense, via the uniform law of large numbers, or in a multiplicative sense, using isomorphic coordinate projections. We then show that a direct analysis of the empirical minimization algorithm yields a significantly better bound, and that the estimates we obtain are essentially sharp. The method of proof we use is based on Talagrand’s concentration inequality for empirical processes.
On the Performance of Kernel Classes
 Journal of Machine Learning Research
, 2003
"... We present sharp bounds on the localized Rademacher averages of the unit ball in a reproducing kernel Hilbert space in terms of the eigenvalues of the integral operator associated with the kernel. We use this result to estimate the performance of the empirical minimization algorithm when the base cl ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
We present sharp bounds on the localized Rademacher averages of the unit ball in a reproducing kernel Hilbert space in terms of the eigenvalues of the integral operator associated with the kernel. We use this result to estimate the performance of the empirical minimization algorithm when the base class is the unit ball of the reproducing kernel Hilbert space. 1.
Stochastic Gradient Descent Tricks
"... Abstract. Chapter 1 strongly advocates the stochastic backpropagation method to train neural networks. This is in fact an instance of a more general technique called stochastic gradient descent (SGD). This chapter provides background material, explains why SGD is a good learning algorithm when the ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Chapter 1 strongly advocates the stochastic backpropagation method to train neural networks. This is in fact an instance of a more general technique called stochastic gradient descent (SGD). This chapter provides background material, explains why SGD is a good learning algorithm when the training set is large, and provides useful recommendations. 1