Results 1  10
of
176
Online Learning with Kernels
, 2003
"... Kernel based algorithms such as support vector machines have achieved considerable success in various problems in the batch setting where all of the training data is available in advance. Support vector machines combine the socalled kernel trick with the large margin idea. There has been little u ..."
Abstract

Cited by 2831 (123 self)
 Add to MetaCart
Kernel based algorithms such as support vector machines have achieved considerable success in various problems in the batch setting where all of the training data is available in advance. Support vector machines combine the socalled kernel trick with the large margin idea. There has been little use of these methods in an online setting suitable for realtime applications. In this paper we consider online learning in a Reproducing Kernel Hilbert Space. By considering classical stochastic gradient descent within a feature space, and the use of some straightforward tricks, we develop simple and computationally efficient algorithms for a wide range of problems such as classification, regression, and novelty detection. In addition to allowing the exploitation of the kernel trick in an online setting, we examine the value of large margins for classification in the online setting with a drifting target. We derive worst case loss bounds and moreover we show the convergence of the hypothesis to the minimiser of the regularised risk functional. We present some experimental results that support the theory as well as illustrating the power of the new algorithms for online novelty detection. In addition
Pegasos: Primal Estimated subgradient solver for SVM
"... We describe and analyze a simple and effective stochastic subgradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy ɛ is Õ(1/ɛ), where each iteration operates on a singl ..."
Abstract

Cited by 542 (20 self)
 Add to MetaCart
We describe and analyze a simple and effective stochastic subgradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy ɛ is Õ(1/ɛ), where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require Ω(1/ɛ2) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total runtime of our method is Õ(d/(λɛ)), where d is a bound on the number of nonzero features in each example. Since the runtime does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to nonlinear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an orderofmagnitude speedup over previous SVM learning methods.
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
, 2010
"... Stochastic subgradient methods are widely used, well analyzed, and constitute effective tools for optimization and online learning. Stochastic gradient methods ’ popularity and appeal are largely due to their simplicity, as they largely follow predetermined procedural schemes. However, most common s ..."
Abstract

Cited by 311 (3 self)
 Add to MetaCart
(Show Context)
Stochastic subgradient methods are widely used, well analyzed, and constitute effective tools for optimization and online learning. Stochastic gradient methods ’ popularity and appeal are largely due to their simplicity, as they largely follow predetermined procedural schemes. However, most common subgradient approaches are oblivious to the characteristics of the data being observed. We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradientbased learning. The adaptation, in essence, allows us to find needles in haystacks in the form of very predictive but rarely seenfeatures. Ourparadigmstemsfromrecentadvancesinstochasticoptimizationandonlinelearning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. In a companion paper, we validate experimentally our theoretical analysis and show that the adaptive subgradient approach outperforms stateoftheart, but nonadaptive, subgradient algorithms. 1
(Online) Subgradient Methods for Structured Prediction
"... Promising approaches to structured learning problems have recently been developed in the maximum margin framework. Unfortunately, algorithms that are computationally and memory efficient enough to solve large scale problems have lagged behind. We propose using simple subgradientbased techniques for ..."
Abstract

Cited by 89 (15 self)
 Add to MetaCart
(Show Context)
Promising approaches to structured learning problems have recently been developed in the maximum margin framework. Unfortunately, algorithms that are computationally and memory efficient enough to solve large scale problems have lagged behind. We propose using simple subgradientbased techniques for optimizing a regularized risk formulation of these problems in both online and batch settings, and analyze the theoretical convergence, generalization, and robustness properties of the resulting techniques. These algorithms are are simple, memory efficient, fast to converge, and have small regret in the online setting. We also investigate a novel convex regression formulation of structured learning. Finally, we demonstrate the benefits of the subgradient approach on three structured prediction problems. 1
A secondorder perceptron algorithm
, 2005
"... Kernelbased linearthreshold algorithms, such as support vector machines and Perceptronlike algorithms, are among the best available techniques for solving pattern classification problems. In this paper, we describe an extension of the classical Perceptron algorithm, called secondorder Perceptr ..."
Abstract

Cited by 83 (23 self)
 Add to MetaCart
(Show Context)
Kernelbased linearthreshold algorithms, such as support vector machines and Perceptronlike algorithms, are among the best available techniques for solving pattern classification problems. In this paper, we describe an extension of the classical Perceptron algorithm, called secondorder Perceptron, and analyze its performance within the mistake bound model of online learning. The bound achieved by our algorithm depends on the sensitivity to secondorder data information and is the best known mistake bound for (efficient) kernelbased linearthreshold classifiers to date. This mistake bound, which strictly generalizes the wellknown Perceptron bound, is expressed in terms of the eigenvalues of the empirical data correlation matrix and depends on a parameter controlling the sensitivity of the algorithm to the distribution of these eigenvalues. Since the optimal setting of this parameter is not known a priori, we also analyze two variants of the secondorder Perceptron algorithm: one that adaptively sets the value of the parameter in terms of the number of mistakes made so far, and one that is parameterless, based on pseudoinverses.
Composite Objective Mirror Descent
"... We present a new method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously known firstorder algorithms, such as the projected gradient method, mirror descent, and forwardbackward splitting, our method yields n ..."
Abstract

Cited by 67 (8 self)
 Add to MetaCart
(Show Context)
We present a new method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously known firstorder algorithms, such as the projected gradient method, mirror descent, and forwardbackward splitting, our method yields new analysis and algorithms. We also derive specific instantiations of our method for commonly used regularization functions, such as ℓ1, mixed norm, and tracenorm. 1
NoRegret Reductions for Imitation Learning and Structured Prediction
 In AISTATS
, 2011
"... Sequential prediction problems such as imitation learning, where future observations depend on previous predictions (actions), violate the common i.i.d. assumptions made in statistical learning. This leads to poor performance in theory and often in practice. Some recent approaches (Daumé III et al., ..."
Abstract

Cited by 67 (13 self)
 Add to MetaCart
(Show Context)
Sequential prediction problems such as imitation learning, where future observations depend on previous predictions (actions), violate the common i.i.d. assumptions made in statistical learning. This leads to poor performance in theory and often in practice. Some recent approaches (Daumé III et al., 2009; Ross and Bagnell, 2010) provide stronger guarantees in this setting, but remain somewhat unsatisfactory as they train either nonstationary or stochastic policies and require a large number of iterations. In this paper, we propose a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting. We show that any such no regret algorithm, combined with additional reduction assumptions, must find a policy with good performance under the distribution of observations it induces in such sequential settings. We demonstrate that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem. 1
Aggregation by exponential weighting and sharp oracle inequalities
"... Abstract. In the present paper, we study the problem of aggregation under the squared loss in the model of regression with deterministic design. We obtain sharp oracle inequalities for convex aggregates defined via exponential weights, under general assumptions on the distribution of errors and on t ..."
Abstract

Cited by 54 (5 self)
 Add to MetaCart
(Show Context)
Abstract. In the present paper, we study the problem of aggregation under the squared loss in the model of regression with deterministic design. We obtain sharp oracle inequalities for convex aggregates defined via exponential weights, under general assumptions on the distribution of errors and on the functions to aggregate. We show how these results can be applied to derive a sparsity oracle inequality. 1
Aggregation by exponential weighting, sharp PACBayesian bounds and sparsity
 MACH LEARN
"... ..."
(Show Context)