Results 1  10
of
36
SAGA: A Fast Incremental Gradient Method With Support for NonStrongly Convex Composite Objectives
, 2014
"... In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and ha ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
(Show Context)
In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports nonstrongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method. 1
Iteration complexity of feasible descent methods for convex optimization.
 The Journal of Machine Learning Research,
, 2014
"... Abstract In many machine learning problems such as the dual form of SVM, the objective function to be minimized is convex but not strongly convex. This fact causes difficulties in obtaining the complexity of some commonly used optimization algorithms. In this paper, we proved the global linear conv ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
Abstract In many machine learning problems such as the dual form of SVM, the objective function to be minimized is convex but not strongly convex. This fact causes difficulties in obtaining the complexity of some commonly used optimization algorithms. In this paper, we proved the global linear convergence on a wide range of algorithms when they are applied to some nonstrongly convex problems. In particular, we are the first to prove O(log(1/ )) time complexity of cyclic coordinate descent methods on dual problems of support vector classification and regression.
Stochastic primaldual coordinate method for regularized empirical risk minimization.
, 2014
"... Abstract We consider a generic convex optimization problem associated with regularized empirical risk minimization of linear predictors. The problem structure allows us to reformulate it as a convexconcave saddle point problem. We propose a stochastic primaldual coordinate (SPDC) method, which alt ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
Abstract We consider a generic convex optimization problem associated with regularized empirical risk minimization of linear predictors. The problem structure allows us to reformulate it as a convexconcave saddle point problem. We propose a stochastic primaldual coordinate (SPDC) method, which alternates between maximizing over a randomly chosen dual variable and minimizing over the primal variable. An extrapolation step on the primal variable is performed to obtain accelerated convergence rate. We also develop a minibatch version of the SPDC method which facilitates parallel computing, and an extension with weighted sampling probabilities on the dual variables, which has a better complexity than uniform sampling on unnormalized data. Both theoretically and empirically, we show that the SPDC method has comparable or better performance than several stateoftheart optimization methods.
An Accelerated Proximal Coordinate Gradient Method
, 2014
"... We develop an accelerated randomized proximal coordinate gradient (APCG) method, for solving a broad class of composite convex optimization problems. In particular, our method achieves faster linear convergence rates for minimizing strongly convex functions than existing randomized proximal coordina ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
We develop an accelerated randomized proximal coordinate gradient (APCG) method, for solving a broad class of composite convex optimization problems. In particular, our method achieves faster linear convergence rates for minimizing strongly convex functions than existing randomized proximal coordinate gradient methods. We show how to apply the APCG method to solve the dual of the regularized empirical risk minimization (ERM) problem, and devise efficient implementations that avoid fulldimensional vector operations. For illconditioned ERM problems, our method obtains improved convergence rates than the stateoftheart stochastic dual coordinate ascent (SDCA) method.
A universal catalyst for firstorder optimization.
 In Advances in Neural Information Processing Systems,
, 2015
"... Abstract We introduce a generic scheme for accelerating firstorder optimization methods in the sense of Nesterov, which builds upon a new analysis of the accelerated proximal point algorithm. Our approach consists of minimizing a convex objective by approximately solving a sequence of wellchosen ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Abstract We introduce a generic scheme for accelerating firstorder optimization methods in the sense of Nesterov, which builds upon a new analysis of the accelerated proximal point algorithm. Our approach consists of minimizing a convex objective by approximately solving a sequence of wellchosen auxiliary problems, leading to faster convergence. This strategy applies to a large class of algorithms, including gradient descent, block coordinate descent, SAG, SAGA, SDCA, SVRG, Finito/MISO, and their proximal variants. For all of these methods, we provide acceleration and explicit support for nonstrongly convex objectives. In addition to theoretical speedup, we also show that acceleration is useful in practice, especially for illconditioned problems where we measure significant improvements.
Unregularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. arXiv preprint arXiv:1506.07512,
, 2015
"... Abstract We develop a family of accelerated stochastic algorithms that optimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear leastsquares regression, across a wide range of problem settings. To achiev ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Abstract We develop a family of accelerated stochastic algorithms that optimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear leastsquares regression, across a wide range of problem settings. To achieve this, we establish a framework, based on the classical proximal point algorithm, useful for accelerating recent fast stochastic algorithms in a blackbox fashion. Empirically, we demonstrate that the resulting algorithms exhibit notions of stability that are advantageous in practice. Both in theory and in practice, the provided algorithms reap the computational benefits of adding a large strongly convex regularization term, without incurring a corresponding bias to the original ERM problem.
Coordinate descent with arbitrary sampling I: Algorithms and complexity
, 2014
"... The design and complexity analysis of randomized coordinate descent methods, and in particular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an inequality involving the objec ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
The design and complexity analysis of randomized coordinate descent methods, and in particular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an inequality involving the objective function and the sampling, capturing in a compact way certain smoothness properties of the function in a random subspace spanned by the sampled coordinates. ESO inequalities were previously established for special classes of samplings only, almost invariably for uniform samplings. In this paper we develop a systematic technique for deriving these inequalities for a large class of functions and for arbitrary samplings. We demonstrate that one can recover existing ESO results using our general approach, which is based on the study of eigenvalues associated with samplings and the data describing the function. 1
S2CD: SemiStochastic Coordinate Descent
, 2014
"... We propose a novel reduced variance method—semistochastic coordinate descent (S2CD)—for the problem of minimizing a strongly convex function represented as the average of a large number of smooth convex functions: f(x) = 1n i fi(x). Our method first performs a deterministic step (computation of th ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
We propose a novel reduced variance method—semistochastic coordinate descent (S2CD)—for the problem of minimizing a strongly convex function represented as the average of a large number of smooth convex functions: f(x) = 1n i fi(x). Our method first performs a deterministic step (computation of the gradient of f at the starting point), followed by a large number of stochastic steps. The process is repeated a few times, with the last stochastic iterate becoming the new starting point where the deterministic step is taken. The novelty of our method is in how the stochastic steps are performed. In each such step, we pick a random function fi and a random coordinate j—both using nonuniform distributions—and update a single coordinate of the decision vector only, based on the computation of the jth partial derivative of fi at two different points. Each random step of the method constitutes an unbiased estimate of the gradient of f and moreover, the squared norm of the steps goes to zero in expectation, meaning that the method enjoys a reduced variance property. The complexity of the method is the sum of two terms: O(n log(1/)) evaluations of gradients ∇fi and O(κ ̂ log(1/)) evaluations of partial derivatives∇jfi, where κ ̂ is a novel condition number. 1
Adding vs. averaging in distributed primaldual optimization
, 2015
"... Abstract Distributed optimization methods for largescale machine learning suffer from a communication bottleneck. It is difficult to reduce this bottleneck while still efficiently and accurately aggregating partial work from different machines. In this paper, we present a novel generalization of t ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Abstract Distributed optimization methods for largescale machine learning suffer from a communication bottleneck. It is difficult to reduce this bottleneck while still efficiently and accurately aggregating partial work from different machines. In this paper, we present a novel generalization of the recent communicationefficient primaldual framework (COCOA) for distributed optimization. Our framework, COCOA + , allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes with convergence guarantees only allow conservative averaging. We give stronger (primaldual) convergence rate guarantees for both COCOA as well as our new variants, and generalize the theory for both methods to cover nonsmooth convex loss functions. We provide an extensive experimental comparison that shows the markedly improved performance of COCOA + on several realworld distributed datasets, especially when scaling up the number of machines.