Results 1  10
of
30
Incremental majorizationminimization optimization with application to largescale machine learning
, 2015
"... Majorizationminimization algorithms consist of successively minimizing a sequence of upper bounds of the objective function. These upper bounds are tight at the current estimate, and each iteration monotonically drives the objective function downhill. Such a simple principle is widely applicable ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
Majorizationminimization algorithms consist of successively minimizing a sequence of upper bounds of the objective function. These upper bounds are tight at the current estimate, and each iteration monotonically drives the objective function downhill. Such a simple principle is widely applicable and has been very popular in various scientific fields, especially in signal processing and statistics. We propose an incremental majorizationminimization scheme for minimizing a large sum of continuous functions, a problem of utmost importance in machine learning. We present convergence guarantees for nonconvex and convex optimization when the upper bounds approximate the objective up to a smooth error; we call such upper bounds “firstorder surrogate functions.” More precisely, we study asymptotic stationary point guarantees for nonconvex problems, and for convex ones, we provide convergence rates for the expected objective function value. We apply our scheme to composite optimization and obtain a new incremental proximal gradient algorithm with linear convergence rate for strongly convex functions. Our experiments show that our method is competitive with the state of the art for solving machine learning problems such as logistic regression when the number of training samples is large enough, and we demonstrate its usefulness for sparse estimation with nonconvex penalties.
Randomized dual coordinate ascent with arbitrary sampling
, 2014
"... We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primaldual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primaldual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to an arbitrary distribution. In contrast to typical analysis, we directly bound the decrease of the primaldual error (in expectation), without the need to first analyze the dual error. Depending on the choice of the sampling, we obtain efficient serial, parallel and distributed variants of the method. In the serial case, our bounds match the best known bounds for SDCA (both with uniform and importance sampling). With standard minibatching, our bounds predict initial dataindependent speedup as well as additional datadriven speedup which depends on spectral and sparsity properties of the data. We calculate theoretical speedup factors and find that they are excellent predictors of actual speedup in practice. Moreover, we illustrate that it is possible to design an efficient minibatch importance sampling. The distributed variant of Quartz is the first distributed SDCAlike method with an analysis for nonseparable data.
A universal catalyst for firstorder optimization.
 In Advances in Neural Information Processing Systems,
, 2015
"... Abstract We introduce a generic scheme for accelerating firstorder optimization methods in the sense of Nesterov, which builds upon a new analysis of the accelerated proximal point algorithm. Our approach consists of minimizing a convex objective by approximately solving a sequence of wellchosen ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Abstract We introduce a generic scheme for accelerating firstorder optimization methods in the sense of Nesterov, which builds upon a new analysis of the accelerated proximal point algorithm. Our approach consists of minimizing a convex objective by approximately solving a sequence of wellchosen auxiliary problems, leading to faster convergence. This strategy applies to a large class of algorithms, including gradient descent, block coordinate descent, SAG, SAGA, SDCA, SVRG, Finito/MISO, and their proximal variants. For all of these methods, we provide acceleration and explicit support for nonstrongly convex objectives. In addition to theoretical speedup, we also show that acceleration is useful in practice, especially for illconditioned problems where we measure significant improvements.
Unregularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. arXiv preprint arXiv:1506.07512,
, 2015
"... Abstract We develop a family of accelerated stochastic algorithms that optimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear leastsquares regression, across a wide range of problem settings. To achiev ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Abstract We develop a family of accelerated stochastic algorithms that optimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear leastsquares regression, across a wide range of problem settings. To achieve this, we establish a framework, based on the classical proximal point algorithm, useful for accelerating recent fast stochastic algorithms in a blackbox fashion. Empirically, we demonstrate that the resulting algorithms exhibit notions of stability that are advantageous in practice. Both in theory and in practice, the provided algorithms reap the computational benefits of adding a large strongly convex regularization term, without incurring a corresponding bias to the original ERM problem.
Coordinate descent with arbitrary sampling I: Algorithms and complexity
, 2014
"... The design and complexity analysis of randomized coordinate descent methods, and in particular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an inequality involving the objec ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
The design and complexity analysis of randomized coordinate descent methods, and in particular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an inequality involving the objective function and the sampling, capturing in a compact way certain smoothness properties of the function in a random subspace spanned by the sampled coordinates. ESO inequalities were previously established for special classes of samplings only, almost invariably for uniform samplings. In this paper we develop a systematic technique for deriving these inequalities for a large class of functions and for arbitrary samplings. We demonstrate that one can recover existing ESO results using our general approach, which is based on the study of eigenvalues associated with samplings and the data describing the function. 1
S2CD: SemiStochastic Coordinate Descent
, 2014
"... We propose a novel reduced variance method—semistochastic coordinate descent (S2CD)—for the problem of minimizing a strongly convex function represented as the average of a large number of smooth convex functions: f(x) = 1n i fi(x). Our method first performs a deterministic step (computation of th ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
We propose a novel reduced variance method—semistochastic coordinate descent (S2CD)—for the problem of minimizing a strongly convex function represented as the average of a large number of smooth convex functions: f(x) = 1n i fi(x). Our method first performs a deterministic step (computation of the gradient of f at the starting point), followed by a large number of stochastic steps. The process is repeated a few times, with the last stochastic iterate becoming the new starting point where the deterministic step is taken. The novelty of our method is in how the stochastic steps are performed. In each such step, we pick a random function fi and a random coordinate j—both using nonuniform distributions—and update a single coordinate of the decision vector only, based on the computation of the jth partial derivative of fi at two different points. Each random step of the method constitutes an unbiased estimate of the gradient of f and moreover, the squared norm of the steps goes to zero in expectation, meaning that the method enjoys a reduced variance property. The complexity of the method is the sum of two terms: O(n log(1/)) evaluations of gradients ∇fi and O(κ ̂ log(1/)) evaluations of partial derivatives∇jfi, where κ ̂ is a novel condition number. 1
Communicationefficient distributed optimization of selfconcordant empirical loss. arXiv preprint arXiv:1501.00263,
, 2015
"... Abstract We consider distributed convex optimization problems originated from sample average approximation of stochastic optimization, or empirical risk minimization in machine learning. We assume that each machine in the distributed computing system has access to a local empirical loss function, c ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract We consider distributed convex optimization problems originated from sample average approximation of stochastic optimization, or empirical risk minimization in machine learning. We assume that each machine in the distributed computing system has access to a local empirical loss function, constructed with i.i.d. data sampled from a common distribution. We propose a communicationefficient distributed algorithm to minimize the overall empirical loss, which is the average of the local empirical losses. The algorithm is based on an inexact damped Newton method, where the inexact Newton steps are computed by a distributed preconditioned conjugate gradient method. We analyze its iteration complexity and communication efficiency for minimizing selfconcordant empirical loss functions, and discuss the results for distributed ridge regression, logistic regression and binary classification with a smoothed hinge loss. In a standard setting for supervised learning, the required number of communication rounds of the algorithm does not increase with the sample size, and only grows slowly with the number of machines.
A linearlyconvergent stochastic lbfgs algorithm. arXiv preprint arXiv:1508.02087,
 Nesterov, Y. Introductory Lectures on Convex Optimization.
, 2015
"... Abstract We propose a new stochastic LBFGS algorithm and prove a linear convergence rate for strongly convex and smooth functions. Our algorithm draws heavily from a recent stochastic variant of LBFGS proposed in Byrd et al. ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract We propose a new stochastic LBFGS algorithm and prove a linear convergence rate for strongly convex and smooth functions. Our algorithm draws heavily from a recent stochastic variant of LBFGS proposed in Byrd et al.
Stochastic dual coordinate ascent with adaptive probabilities. ICML 2015. [2] Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss
"... This paper introduces AdaSDCA: an adaptive variant of stochastic dual coordinate ascent (SDCA) for solving the regularized empirical risk minimization problems. Our modification consists in allowing the method adaptively change the probability distribution over the dual variables throughout the ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
This paper introduces AdaSDCA: an adaptive variant of stochastic dual coordinate ascent (SDCA) for solving the regularized empirical risk minimization problems. Our modification consists in allowing the method adaptively change the probability distribution over the dual variables throughout the iterative process. AdaSDCA achieves provably better complexity bound than SDCA with the best fixed probability distribution, known as importance sampling. However, it is of a theoretical character as it is expensive to implement. We also propose AdaSDCA+: a practical variant which in our experiments outperforms existing nonadaptive methods. 1.
Online lowrank subspace clustering by basis dictionary pursuit. arXiv preprint arXiv:1503.08356,
, 2015
"... Abstract LowRank Representation (LRR) has been a significant method for segmenting data that are generated from a union of subspaces. It is also known that solving LRR is challenging in terms of time complexity and memory footprint, in that the size of the nuclear norm regularized matrix is nbyn ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract LowRank Representation (LRR) has been a significant method for segmenting data that are generated from a union of subspaces. It is also known that solving LRR is challenging in terms of time complexity and memory footprint, in that the size of the nuclear norm regularized matrix is nbyn (where n is the number of samples). In this paper, we thereby develop a novel online implementation of LRR that reduces the memory cost from O(n 2 ) to O(pd), with p being the ambient dimension and d being some estimated rank (d < p ≪ n). We also establish the theoretical guarantee that the sequence of solutions produced by our algorithm converges to a stationary point of the expected loss function asymptotically. Extensive experiments on synthetic and realistic datasets further substantiate that our algorithm is fast, robust and memory efficient.