Results 1  10
of
55
Stochastic Dual Coordinate Ascent Methods
, 2013
"... Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it ..."
Abstract

Cited by 103 (13 self)
 Add to MetaCart
(Show Context)
Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it has so far lacked good convergence analysis. This paper presents a new analysis of Stochastic Dual Coordinate Ascent (SDCA) showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD. This analysis justifies the effectiveness of SDCA for practical applications.
Revisiting frankwolfe: Projectionfree sparse convex optimization
 In ICML
, 2013
"... We provide stronger and more general primaldual convergence results for FrankWolfetype algorithms (a.k.a. conditional gradient) for constrained convex optimization, enabled by a simple framework of duality gap certificates. Our analysis also holds if the linear subproblems are only solved approxi ..."
Abstract

Cited by 86 (2 self)
 Add to MetaCart
(Show Context)
We provide stronger and more general primaldual convergence results for FrankWolfetype algorithms (a.k.a. conditional gradient) for constrained convex optimization, enabled by a simple framework of duality gap certificates. Our analysis also holds if the linear subproblems are only solved approximately (as well as if the gradients are inexact), and is proven to be worstcase optimal in the sparsity of the obtained solutions. On the application side, this allows us to unify a large variety of existing sparse greedy methods, in particular for optimization over convex hulls of an atomic set, even if those sets can only be approximated, including sparse (or structured sparse) vectors or matrices, lowrank matrices, permutation matrices, or maxnorm bounded matrices. We present a new general framework for convex optimization over matrix factorizations, where every FrankWolfe iteration will consist of a lowrank update, and discuss the broad application areas of this approach. 1.
Minimizing Finite Sums with the Stochastic Average Gradient
, 2013
"... We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradie ..."
Abstract

Cited by 42 (2 self)
 Add to MetaCart
We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than blackbox SG methods. The convergence rate is improved from O(1 / √ k) to O(1/k) in general, and when the sum is stronglyconvex the convergence rate is improved from the sublinear O(1/k) to a linear convergence rate of the form O(ρ k) for ρ < 1. Further, in many cases the convergence rate of the new method is also faster than blackbox deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of nonuniform sampling strategies. 1
Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization
"... We introduce a proximal version of the stochastic dual coordinate ascent method and show how to accelerate the method using an innerouter iteration procedure. We analyze the runtime of the framework and obtain rates that improve stateoftheart results for various key machine learning optimizat ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
(Show Context)
We introduce a proximal version of the stochastic dual coordinate ascent method and show how to accelerate the method using an innerouter iteration procedure. We analyze the runtime of the framework and obtain rates that improve stateoftheart results for various key machine learning optimization problems including SVM, logistic regression, ridge regression, Lasso, and multiclass SVM. Experiments validate our theoretical findings. 1.
Proximal stochastic dual coordinate ascent
 CoRR
"... We introduce a proximal version of dual coordinate ascent method. We demonstrate how the derived algorithmic framework can be used for numerous regularized loss minimization problems, including `1 regularization and structured output SVM. The convergence rates we obtain match, and sometimes improve, ..."
Abstract

Cited by 29 (4 self)
 Add to MetaCart
We introduce a proximal version of dual coordinate ascent method. We demonstrate how the derived algorithmic framework can be used for numerous regularized loss minimization problems, including `1 regularization and structured output SVM. The convergence rates we obtain match, and sometimes improve, stateoftheart results. 1
Optimization with firstorder surrogate functions
 In Proceedings of the International Conference on Machine Learning (ICML
, 2013
"... In this paper, we study optimization methods consisting of iteratively minimizing surrogates of an objective function. By proposing several algorithmic variants and simple convergence analyses, we make two main contributions. First, we provide a unified viewpoint for several firstorder optimization ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
In this paper, we study optimization methods consisting of iteratively minimizing surrogates of an objective function. By proposing several algorithmic variants and simple convergence analyses, we make two main contributions. First, we provide a unified viewpoint for several firstorder optimization techniques such as accelerated proximal gradient, block coordinate descent, or FrankWolfe algorithms. Second, we introduce a new incremental scheme that experimentally matches or outperforms stateoftheart solvers for largescale optimization problems typically arising in machine learning. 1.
Trading computation for communication: Distributed stochastic dual coordinate ascent
 in NIPS
, 2013
"... We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized lo ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
(Show Context)
We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances. 1
On optimal probabilities in stochastic coordinate descent methods. arXiv:1310.3438
, 2013
"... Abstract We propose and analyze a new parallel coordinate descent method'NSyncin which at each iteration a random subset of coordinates is updated, in parallel, allowing for the subsets to be chosen nonuniformly. We derive convergence rates under a strong convexity assumption, and comment on ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
(Show Context)
Abstract We propose and analyze a new parallel coordinate descent method'NSyncin which at each iteration a random subset of coordinates is updated, in parallel, allowing for the subsets to be chosen nonuniformly. We derive convergence rates under a strong convexity assumption, and comment on how to assign probabilities to the sets to optimize the bound. The complexity and practical performance of the method can outperform its uniform variant by an order of magnitude. Surprisingly, the strategy of updating a single randomly selected coordinate per iterationwith optimal probabilitiesmay require less iterations, both in theory and practice, than the strategy of updating all coordinates at every iteration.
Efficient Image and Video Colocalization with FrankWolfe Algorithm
 In ECCV
"... Abstract. In this paper, we tackle the problem of performing efficient colocalization in images and videos. Colocalization is the problem of simultaneously localizing (with bounding boxes) objects of the same class across a set of distinct images or videos. Building upon recent stateoftheart m ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper, we tackle the problem of performing efficient colocalization in images and videos. Colocalization is the problem of simultaneously localizing (with bounding boxes) objects of the same class across a set of distinct images or videos. Building upon recent stateoftheart methods, we show how we are able to naturally incorporate temporal terms and constraints for video colocalization into a quadratic programming framework. Furthermore, by leveraging the FrankWolfe algorithm (or conditional gradient), we show how our optimization formulations for both images and videos can be reduced to solving a succession of simple integer programs, leading to increased efficiency in both memory and speed. To validate our method, we present experimental results on the PASCAL VOC 2007 dataset for images and the YouTubeObjects dataset for videos, as well as a joint combination of the two. 1
A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization
, 2013
"... Linear optimization is many times algorithmically simpler than nonlinear convex optimization. Linear optimization over matroid polytopes, matching polytopes and path polytopes are example of problems for which we have simple and efficient combinatorial algorithms, but whose nonlinear convex count ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
Linear optimization is many times algorithmically simpler than nonlinear convex optimization. Linear optimization over matroid polytopes, matching polytopes and path polytopes are example of problems for which we have simple and efficient combinatorial algorithms, but whose nonlinear convex counterpart is harder and admits significantly less efficient algorithms. This motivates the computational model of convex optimization, including the offline, online and stochastic settings, using a linear optimization oracle. In this computational model we give several new results that improve over the previous stateoftheart. Our main result is a novel conditional gradient algorithm for smooth and strongly convex optimization over polyhedral sets that performs only a single linear optimization step over the domain on each iteration and enjoys a linear convergence rate. This gives an exponential improvement in convergence rate over previous results. Based on this new conditional gradient algorithm we give the first algorithms for online convex optimization over polyhedral sets that perform only a single linear optimization step over the domain while having optimal regret guarantees, answering an open question of Kalai and Vempala, and Hazan and Kale. Our online algorithms also imply conditional gradient algorithms for nonsmooth and stochastic convex optimization with the same convergence rates as projected (sub)gradient methods. Key words. frankwolfe algorithm; conditional gradient methods; linear programming; firstorder methods; online convex optimization; online learning; stochastic optimization AMS subject classifications. 65K05; 90C05; 90C06; 90C25; 90C30; 90C27; 90C15