Results 1  10
of
103
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
"... Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVR ..."
Abstract

Cited by 70 (6 self)
 Add to MetaCart
Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning. 1
Minimizing Finite Sums with the Stochastic Average Gradient
, 2013
"... We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradie ..."
Abstract

Cited by 42 (2 self)
 Add to MetaCart
We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than blackbox SG methods. The convergence rate is improved from O(1 / √ k) to O(1/k) in general, and when the sum is stronglyconvex the convergence rate is improved from the sublinear O(1/k) to a linear convergence rate of the form O(ρ k) for ρ < 1. Further, in many cases the convergence rate of the new method is also faster than blackbox deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of nonuniform sampling strategies. 1
Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization.
 Mathematical Programming,
, 2015
"... Abstract We introduce a proximal version of the stochastic dual coordinate ascent method and show how to accelerate the method using an innerouter iteration procedure. We analyze the runtime of the framework and obtain rates that improve stateoftheart results for various key machine learning op ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
(Show Context)
Abstract We introduce a proximal version of the stochastic dual coordinate ascent method and show how to accelerate the method using an innerouter iteration procedure. We analyze the runtime of the framework and obtain rates that improve stateoftheart results for various key machine learning optimization problems including SVM, logistic regression, ridge regression, Lasso, and multiclass SVM. Experiments validate our theoretical findings.
SAGA: A Fast Incremental Gradient Method With Support for NonStrongly Convex Composite Objectives
, 2014
"... In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and ha ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports nonstrongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method. 1
A proximal stochastic gradient method with progressive variance reduction.
 SIAM Journal on Optimization,
, 2014
"... Abstract We consider the problem of minimizing the sum of two convex functions: one is the average of a large number of smooth component functions, and the other is a general convex function that admits a simple proximal mapping. We assume the whole objective function is strongly convex. Such probl ..."
Abstract

Cited by 30 (6 self)
 Add to MetaCart
(Show Context)
Abstract We consider the problem of minimizing the sum of two convex functions: one is the average of a large number of smooth component functions, and the other is a general convex function that admits a simple proximal mapping. We assume the whole objective function is strongly convex. Such problems often arise in machine learning, known as regularized empirical risk minimization. We propose and analyze a new proximal stochastic gradient method, which uses a multistage scheme to progressively reduce the variance of the stochastic gradient. While each iteration of this algorithm has similar cost as the classical stochastic gradient method (or incremental gradient method), we show that the expected objective value converges to the optimum at a geometric rate. The overall complexity of this method is much lower than both the proximal full gradient method and the standard proximal stochastic gradient method.
Proximal stochastic dual coordinate ascent
 CoRR
"... We introduce a proximal version of dual coordinate ascent method. We demonstrate how the derived algorithmic framework can be used for numerous regularized loss minimization problems, including `1 regularization and structured output SVM. The convergence rates we obtain match, and sometimes improve, ..."
Abstract

Cited by 29 (4 self)
 Add to MetaCart
We introduce a proximal version of dual coordinate ascent method. We demonstrate how the derived algorithmic framework can be used for numerous regularized loss minimization problems, including `1 regularization and structured output SVM. The convergence rates we obtain match, and sometimes improve, stateoftheart results. 1
CommunicationEfficient Algorithms for Statistical Optimization
"... We study two communicationefficient algorithms for distributed statistical optimization on largescale data. The first algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provid ..."
Abstract

Cited by 27 (5 self)
 Add to MetaCart
(Show Context)
We study two communicationefficient algorithms for distributed statistical optimization on largescale data. The first algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves meansquared error that decays as O(N −1 +(N/m) −2). Wheneverm ≤ √ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of the bootstrap. Requiring only a single round of communication, it has meansquared error that decays asO(N −1 +(N/m) −3), and so is more robust to the amount of parallelization. We complement our theoretical results with experiments on largescale problems from the internet search domain. In particular, we show that our methods efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which consists ofN ≈ 2.4×10 8 samples andd ≥ 700,000 dimensions. 1
Minibatch primal and dual methods for SVMs
 In 30th International Conference on Machine Learning
, 2013
"... We address the issue of using minibatches in stochastic optimization of SVMs. We show that the same quantity, the spectral norm of the data, controls the parallelization speedup obtained for both primal stochastic subgradient descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and us ..."
Abstract

Cited by 26 (7 self)
 Add to MetaCart
We address the issue of using minibatches in stochastic optimization of SVMs. We show that the same quantity, the spectral norm of the data, controls the parallelization speedup obtained for both primal stochastic subgradient descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and use it to derive novel variants of minibatched SDCA. Our guarantees for both methods are expressed in terms of the original nonsmooth primal problem based on the hingeloss. 1.
Accelerated minibatch stochastic dual coordinate ascent
"... Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the minibatch setting that is often used in practice. Our main contribution is to introduce an accelerated mini ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the minibatch setting that is often used in practice. Our main contribution is to introduce an accelerated minibatch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007].