Results 1  10
of
34
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
"... Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVR ..."
Abstract

Cited by 52 (4 self)
 Add to MetaCart
Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning. 1
A proximal stochastic gradient method with progressive variance reduction. arXiv preprint arXiv:1403.4699
, 2014
"... ar ..."
(Show Context)
SAGA: A Fast Incremental Gradient Method With Support for NonStrongly Convex Composite Objectives
, 2014
"... In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and ha ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
(Show Context)
In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports nonstrongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method. 1
Incremental majorizationminimization optimization with application to largescale machine learning. arXiv preprint arXiv:1402.4419
, 2014
"... Abstract. Majorizationminimization algorithms consist of successively minimizing a sequence of upper bounds of the objective function. These upper bounds are tight at the current estimate, and each iteration monotonically drives the objective function downhill. Such a simple principle is widely app ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Abstract. Majorizationminimization algorithms consist of successively minimizing a sequence of upper bounds of the objective function. These upper bounds are tight at the current estimate, and each iteration monotonically drives the objective function downhill. Such a simple principle is widely applicable and has been very popular in various scientific fields, especially in signal processing and statistics. We propose an incremental majorizationminimization scheme for minimizing a large sum of continuous functions, a problem of utmost importance in machine learning. We present convergence guarantees for nonconvex and convex optimization when the upper bounds approximate the objective up to a smooth error; we call such upper bounds “firstorder surrogate functions.” More precisely, we study asymptotic stationary point guarantees for nonconvex problems, and for convex ones, we provide convergence rates for the expected objective function value. We apply our scheme to composite optimization and obtain a new incremental proximal gradient algorithm with linear convergence rate for strongly convex functions. Our experiments show that our method is competitive with the state of the art for solving machine learning problems such as logistic regression when the number of training samples is large enough, and we demonstrate its usefulness for sparse estimation with nonconvex penalties.
An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization
, 2014
"... ar ..."
(Show Context)
Randomized dual coordinate ascent with arbitrary sampling. arXiv:1411.5873
, 2014
"... We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primaldual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primaldual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to an arbitrary distribution. In contrast to typical analysis, we directly bound the decrease of the primaldual error (in expectation), without the need to first analyze the dual error. Depending on the choice of the sampling, we obtain efficient serial, parallel and distributed variants of the method. In the serial case, our bounds match the best known bounds for SDCA (both with uniform and importance sampling). With standard minibatching, our bounds predict initial dataindependent speedup as well as additional datadriven speedup which depends on spectral and sparsity properties of the data. We calculate theoretical speedup factors and find that they are excellent predictors of actual speedup in practice. Moreover, we illustrate that it is possible to design an efficient minibatch importance sampling. The distributed variant of Quartz is the first distributed SDCAlike method with an analysis for nonseparable data. 1
Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm
 Advances in Neural Information Processing Systems
, 2014
"... ar ..."
(Show Context)
Smoothed Gradients for Stochastic Variational Inference
 In Advances in Neural Information Processing Systems 27
, 2014
"... The field of statistical machine learning has seen a rapid progress in complex hierarchical Bayesian models. In Stochastic Variational Inference (SVI), the inference problem is mapped to an optimization problem involving stochastic gradients. While this scheme was shown to scale up to massive data ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
The field of statistical machine learning has seen a rapid progress in complex hierarchical Bayesian models. In Stochastic Variational Inference (SVI), the inference problem is mapped to an optimization problem involving stochastic gradients. While this scheme was shown to scale up to massive data sets, the intrinsic noise of the stochastic gradients impedes a fast convergence. Inspired by gradient averaging methods from stochastic optimization, we propose a variance reduction scheme tailored to SVI by averaging successively over the sufficient statistics of the local variational parameters. Its simplicity comes at the cost of biased stochastic gradients. We show that we can eliminate large parts of the bias while obtaining the same variance reduction as in simple gradient averaging schemes. We explore the tradeoff between variance and bias based on the example of Latent Dirichlet Allocation. 1
S2CD: SemiStochastic Coordinate Descent
, 2014
"... We propose a novel reduced variance method—semistochastic coordinate descent (S2CD)—for the problem of minimizing a strongly convex function represented as the average of a large number of smooth convex functions: f(x) = 1n i fi(x). Our method first performs a deterministic step (computation of th ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
We propose a novel reduced variance method—semistochastic coordinate descent (S2CD)—for the problem of minimizing a strongly convex function represented as the average of a large number of smooth convex functions: f(x) = 1n i fi(x). Our method first performs a deterministic step (computation of the gradient of f at the starting point), followed by a large number of stochastic steps. The process is repeated a few times, with the last stochastic iterate becoming the new starting point where the deterministic step is taken. The novelty of our method is in how the stochastic steps are performed. In each such step, we pick a random function fi and a random coordinate j—both using nonuniform distributions—and update a single coordinate of the decision vector only, based on the computation of the jth partial derivative of fi at two different points. Each random step of the method constitutes an unbiased estimate of the gradient of f and moreover, the squared norm of the steps goes to zero in expectation, meaning that the method enjoys a reduced variance property. The complexity of the method is the sum of two terms: O(n log(1/)) evaluations of gradients ∇fi and O(κ ̂ log(1/)) evaluations of partial derivatives∇jfi, where κ ̂ is a novel condition number. 1
Coordinate descent with arbitrary sampling I: Algorithms and complexity
, 2014
"... The design and complexity analysis of randomized coordinate descent methods, and in particular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an inequality involving the objec ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
The design and complexity analysis of randomized coordinate descent methods, and in particular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an inequality involving the objective function and the sampling, capturing in a compact way certain smoothness properties of the function in a random subspace spanned by the sampled coordinates. ESO inequalities were previously established for special classes of samplings only, almost invariably for uniform samplings. In this paper we develop a systematic technique for deriving these inequalities for a large class of functions and for arbitrary samplings. We demonstrate that one can recover existing ESO results using our general approach, which is based on the study of eigenvalues associated with samplings and the data describing the function. 1