Results 1  10
of
23
Minimizing Finite Sums with the Stochastic Average Gradient
, 2013
"... We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradie ..."
Abstract

Cited by 42 (2 self)
 Add to MetaCart
(Show Context)
We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than blackbox SG methods. The convergence rate is improved from O(1 / √ k) to O(1/k) in general, and when the sum is stronglyconvex the convergence rate is improved from the sublinear O(1/k) to a linear convergence rate of the form O(ρ k) for ρ < 1. Further, in many cases the convergence rate of the new method is also faster than blackbox deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of nonuniform sampling strategies. 1
Incremental majorizationminimization optimization with application to largescale machine learning
, 2015
"... Majorizationminimization algorithms consist of successively minimizing a sequence of upper bounds of the objective function. These upper bounds are tight at the current estimate, and each iteration monotonically drives the objective function downhill. Such a simple principle is widely applicable ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
Majorizationminimization algorithms consist of successively minimizing a sequence of upper bounds of the objective function. These upper bounds are tight at the current estimate, and each iteration monotonically drives the objective function downhill. Such a simple principle is widely applicable and has been very popular in various scientific fields, especially in signal processing and statistics. We propose an incremental majorizationminimization scheme for minimizing a large sum of continuous functions, a problem of utmost importance in machine learning. We present convergence guarantees for nonconvex and convex optimization when the upper bounds approximate the objective up to a smooth error; we call such upper bounds “firstorder surrogate functions.” More precisely, we study asymptotic stationary point guarantees for nonconvex problems, and for convex ones, we provide convergence rates for the expected objective function value. We apply our scheme to composite optimization and obtain a new incremental proximal gradient algorithm with linear convergence rate for strongly convex functions. Our experiments show that our method is competitive with the state of the art for solving machine learning problems such as logistic regression when the number of training samples is large enough, and we demonstrate its usefulness for sparse estimation with nonconvex penalties.
Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
, 2011
"... We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network. In this context, it is of much interest to take into account the problem structure, and automatically select a subgraph with a small number of connected components. By exp ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network. In this context, it is of much interest to take into account the problem structure, and automatically select a subgraph with a small number of connected components. By exploiting prior knowledge, one can indeed improve the prediction performance and/or obtain better interpretable results. Regularization or penalty functions for selecting features in graphs have recently been proposed but they raise new algorithmic challenges. For example, they typically require solving a combinatorially hard selection problem among all connected subgraphs. In this paper, we propose computationally feasible strategies to select a sparse and “well connected” subset of features sitting on a directed acyclic graph (DAG). We introduce structured sparsity penalties over paths on a DAG called “path coding ” penalties. Unlike existing regularization functions, path coding penalties can both model long range interactions between features in the graph and be tractable. The penalties and their proximal operators involve path selection problems, which we efficiently solve by leveraging network flow optimization. We experimentally show on synthetic, image, and genomic data that our approach is scalable and lead to more connected subgraphs than other regularization functions for graphs.
Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems
"... Recent advances in optimization theory have shown that smooth strongly convex finite sums can be minimized faster than by treating them as a black box ”batch ” problem. In this work we introduce a new method in this class with a theoretical convergence rate four times faster than existing methods, ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Recent advances in optimization theory have shown that smooth strongly convex finite sums can be minimized faster than by treating them as a black box ”batch ” problem. In this work we introduce a new method in this class with a theoretical convergence rate four times faster than existing methods, for sums with sufficiently many terms. This method is also amendable to a sampling without replacement scheme that in practice gives further speedups. We give empirical results showing state of the art performance. 1.
A stochastic coordinate descent primaldual algorithm and applications to largescale composite optimization,
, 2014
"... AbstractBased on the idea of randomized coordinate descent of αaveraged operators, a randomized primaldual optimization algorithm is introduced, where a random subset of coordinates is updated at each iteration. The algorithm builds upon a variant of a recent (deterministic) algorithm proposed b ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
AbstractBased on the idea of randomized coordinate descent of αaveraged operators, a randomized primaldual optimization algorithm is introduced, where a random subset of coordinates is updated at each iteration. The algorithm builds upon a variant of a recent (deterministic) algorithm proposed by Vũ and Condat that includes the well known ADMM as a particular case. The obtained algorithm is used to solve asynchronously a distributed optimization problem. A network of agents, each having a separate cost function containing a differentiable term, seek to find a consensus on the minimum of the aggregate objective. The method yields an algorithm where at each iteration, a random subset of agents wake up, update their local estimates, exchange some data with their neighbors, and go idle. Numerical results demonstrate the attractive performance of the method. The general approach can be naturally adapted to other situations where coordinate descent convex optimization algorithms are used with a random choice of the coordinates.
Parallel successive convex approximation for nonsmooth nonconvex optimization
, 2014
"... Consider the problem of minimizing the sum of a smooth (possibly nonconvex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is up ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Consider the problem of minimizing the sum of a smooth (possibly nonconvex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is updated while the remaining variables are held fixed. With the recent advances in the developments of the multicore parallel processing technology, it is desirable to parallelize the BCD method by allowing multiple blocks to be updated simultaneously at each iteration of the algorithm. In this work, we propose an inexact parallel BCD approach where at each iteration, a subset of the variables is updated in parallel by minimizing convex approximations of the original objective function. We investigate the convergence of this parallel BCD method for both randomized and cyclic variable selection rules. We analyze the asymptotic and nonasymptotic convergence behavior of the algorithm for both convex and nonconvex objective functions. The numerical experiments suggest that for a special case of Lasso minimization problem, the cyclic block selection rule can outperform the randomized rule.
Fast Stochastic Alternating Direction Method of Multipliers
"... We propose a new stochastic alternating direction method of multipliers (ADMM) algorithm, which incrementally approximates the full gradient in the linearized ADMM formulation. Besides having a low periteration complexity as existing stochastic ADMM algorithms, it improves the convergence rate ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
We propose a new stochastic alternating direction method of multipliers (ADMM) algorithm, which incrementally approximates the full gradient in the linearized ADMM formulation. Besides having a low periteration complexity as existing stochastic ADMM algorithms, it improves the convergence rate on convex problems fromO(1/√T) toO(1/T), where T is the number of iterations. This matches the convergence rate of the batch ADMM algorithm, but without the need to visit all the samples in each iteration. Experiments on the graphguided fused lasso demonstrate that the new algorithm is significantly faster than stateoftheart stochastic and batch ADMM algorithms.
A Greedy Framework for FirstOrder Optimization
"... Introduction. Recent work has shown many connections between conditional gradient and other firstorder optimization methods, such as herding [3] and subgradient descent [2]. By considering a type of proximal conditional method, which we call boosted mirror descent (BMD), we are able to unify all of ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Introduction. Recent work has shown many connections between conditional gradient and other firstorder optimization methods, such as herding [3] and subgradient descent [2]. By considering a type of proximal conditional method, which we call boosted mirror descent (BMD), we are able to unify all of these algorithms into a single framework, which can be interpreted as taking successive argmins of a sequence of surrogate functions. Using a standard online learning analysis based on
Semistochastic quadratic bound methods
 In ICLR Workshop
, 2014
"... Partition functions arise in a variety of settings, including conditional random fields, logistic regression, and latent gaussian models. In this paper, we consider semistochastic quadratic bound (SQB) methods for maximum likelihood estimation based on partition function optimization. Batch method ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Partition functions arise in a variety of settings, including conditional random fields, logistic regression, and latent gaussian models. In this paper, we consider semistochastic quadratic bound (SQB) methods for maximum likelihood estimation based on partition function optimization. Batch methods based on the quadratic bound were recently proposed for this class of problems, and performed favorably in comparison to stateoftheart techniques. Semistochastic methods fall in between batch algorithms, which use all the data, and stochastic gradient type methods, which use small random selections at each iteration. We build semistochastic quadratic boundbased methods, and prove both global convergence (to a stationary point) under very weak assumptions, and linear convergence rate under stronger assumptions on the objective. To make the proposed methods faster and more stable, we consider inexact subproblem minimization and batchsize selection schemes. The efficacy of SQB methods is demonstrated via comparison with several stateoftheart techniques on commonly used datasets. 1
Iterative Splits of Quadratic Bounds for Scalable Binary Tensor Factorization
"... Binary matrices and tensors are popular data structures that need to be efficiently approximated by lowrank representations. A standard approach is to minimize the logistic loss, well suited for binary data. In many cases, the number m of nonzero elements in the tensor is much smaller than the t ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Binary matrices and tensors are popular data structures that need to be efficiently approximated by lowrank representations. A standard approach is to minimize the logistic loss, well suited for binary data. In many cases, the number m of nonzero elements in the tensor is much smaller than the total number n of possible entries in the tensor. This creates a problem for large tensors because the computation of the logistic loss has a linear time complexity with n. In this work, we show that an alternative approach is to minimize the quadratic loss (root mean square error) which leads to algorithms with a training time complexity that is reduced from O(n) to O(m), as proposed earlier in the restricted case of alternating leastsquare algorithms. In addition, we propose and study a greedy algorithm that partitions the tensor into smaller tensors, each approximated by a quadratic upper bound. This technique provides a timeaccuracy tradeoff between a fast but approximate algorithm and an accurate but slow algorithm. We show that this technique leads to a considerable speedup in learning of real world tensors. 1