Results 1  10
of
27
Parallel Coordinate Descent Methods for Big Data Optimization
, 2012
"... In this work we show that randomized (block) coordinate descent methods can be accelerated by parallelization when applied to the problem of minimizing the sum of a partially separable smooth convex function and a simple separable convex function. The theoretical speedup, as compared to the serial m ..."
Abstract

Cited by 74 (4 self)
 Add to MetaCart
(Show Context)
In this work we show that randomized (block) coordinate descent methods can be accelerated by parallelization when applied to the problem of minimizing the sum of a partially separable smooth convex function and a simple separable convex function. The theoretical speedup, as compared to the serial method, and referring to the number of iterations needed to approximately solve the problem with high probability, is a simple expression depending on the number of parallel processors and a natural and easily computable measure of separability of the smooth component of the objective function. In the worst case, when no degree of separability is present, there may be no speedup; in the best case, when the problem is separable, the speedup is equal to the number of processors. Our analysis also works in the mode when the number of blocks being updated at each iteration is random, which allows for modeling situations with busy or unreliable processors. We show that our algorithm is able to solve a LASSO problem involving a matrix with 20 billion nonzeros in 2 hours on a large memory node with 24 cores.
Accelerated, parallel and proximal coordinate descent. arXiv preprint arXiv:1312.5799
, 2013
"... We propose a new stochastic coordinate descent method for minimizing the sum of convex functions each of which depends on a small number of coordinates only. Our method (APPROX) is simultaneously Accelerated, Parallel and PROXimal; this is the first time such a method is proposed. In the special cas ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
We propose a new stochastic coordinate descent method for minimizing the sum of convex functions each of which depends on a small number of coordinates only. Our method (APPROX) is simultaneously Accelerated, Parallel and PROXimal; this is the first time such a method is proposed. In the special case when the number of processors is equal to the number of coordinates, the method converges at the rate 2ω̄L̄R2/(k+1)2, where k is the iteration counter, ω ̄ is an average degree of separability of the loss function, L ̄ is the average of Lipschitz constants associated with the coordinates and individual functions in the sum, and R is the distance of the initial point from the minimizer. We show that the method can be implemented without the need to perform fulldimensional vector operations, which is the major bottleneck of accelerated coordinate descent. The fact that the method depends on the average degree of separability, and not on the maximum degree of separability, can be attributed to the use of new safe large stepsizes, leading to improved expected separable overapproximation (ESO). These are of independent interest and can be utilized in all existing parallel stochastic coordinate descent algorithms based on the concept of ESO. 1
Inexact Coordinate Descent: Complexity and Preconditioning
, 2013
"... In this paper we consider the problem of minimizing a convex function using a randomized block coordinate descent method. One of the key steps at each iteration of the algorithm is determining the update to a block of variables. Existing algorithms assume that in order to compute the update, a parti ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
In this paper we consider the problem of minimizing a convex function using a randomized block coordinate descent method. One of the key steps at each iteration of the algorithm is determining the update to a block of variables. Existing algorithms assume that in order to compute the update, a particular subproblem is solved exactly. In his work we relax this requirement, and allow for the subproblem to be solved inexactly, leading to an inexact block coordinate descent method. Our approach incorporates the best known results for exact updates as a special case. Moreover, these theoretical guarantees are complemented by practical considerations: the use of iterative techniques to determine the update as well as the use of preconditioning for further acceleration.
Trading computation for communication: Distributed stochastic dual coordinate ascent
 in NIPS
, 2013
"... We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized lo ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
(Show Context)
We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances. 1
Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv:1309.5885
, 2013
"... We study the performance of a family of randomized parallel coordinate descent methods for minimizing the sum of a nonsmooth and separable convex functions. The problem class includes as a special case L1regularized L1 regression and the minimization of the exponential loss (“AdaBoost problem”). We ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
We study the performance of a family of randomized parallel coordinate descent methods for minimizing the sum of a nonsmooth and separable convex functions. The problem class includes as a special case L1regularized L1 regression and the minimization of the exponential loss (“AdaBoost problem”). We assume the input data defining the loss function is contained in a sparse m × n matrix A with at most ω nonzeros in each row. Our methods need O(nβ/τ) iterations to find an approximate solution with high probability, where τ is the number of processors and β = 1 + (ω − 1)(τ − 1)/(n − 1) for the fastest variant. The notation hides dependence on quantities such as the required accuracy and confidence levels and the distance of the starting iterate from an optimal point. Since β/τ is a decreasing function of τ, the method needs fewer iterations when more processors are used. Certain variants of our algorithms perform on average only O(nnz(A)/n) arithmetic operations during a single iteration per processor and, because β decreases when ω does, fewer iterations are needed for sparser problems. 1
Separable Approximations and Decomposition Methods for the Augmented Lagrangian
, 2013
"... In this paper we study decomposition methods based on separable approximations for minimizing the augmented Lagrangian. In particular, we study and compare the Diagonal Quadratic Approximation Method (DQAM) of Mulvey and Ruszczyński [13] and the Parallel Coordinate Descent Method (PCDM) of Richtárik ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
In this paper we study decomposition methods based on separable approximations for minimizing the augmented Lagrangian. In particular, we study and compare the Diagonal Quadratic Approximation Method (DQAM) of Mulvey and Ruszczyński [13] and the Parallel Coordinate Descent Method (PCDM) of Richtárik and Takáč [23]. We show that the two methods are equivalent for feasibility problems up to the selection of a single stepsize parameter. Furthermore, we prove an improved complexity bound for PCDM under strong convexity, and show that this bound is at least 8(L ′ / ¯ L)(ω − 1) 2 times better than the best known bound for DQAM, where ω is the degree of partial separability and L ′ and ¯ L are the maximum and average of the block Lipschitz constants of the gradient of the quadratic penalty appearing in the augmented Lagrangian. 1
Randomized dual coordinate ascent with arbitrary sampling. arXiv:1411.5873
, 2014
"... We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primaldual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primaldual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to an arbitrary distribution. In contrast to typical analysis, we directly bound the decrease of the primaldual error (in expectation), without the need to first analyze the dual error. Depending on the choice of the sampling, we obtain efficient serial, parallel and distributed variants of the method. In the serial case, our bounds match the best known bounds for SDCA (both with uniform and importance sampling). With standard minibatching, our bounds predict initial dataindependent speedup as well as additional datadriven speedup which depends on spectral and sparsity properties of the data. We calculate theoretical speedup factors and find that they are excellent predictors of actual speedup in practice. Moreover, we illustrate that it is possible to design an efficient minibatch importance sampling. The distributed variant of Quartz is the first distributed SDCAlike method with an analysis for nonseparable data. 1
Stochastic gradient descent with differentially private updates
"... Abstract—Differential privacy is a recent framework for computation on sensitive data, which has shown considerable promise in the regime of large datasets. Stochastic gradient methods are a popular approach for learning in the datarich regime because they are computationally tractable and scalabl ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Differential privacy is a recent framework for computation on sensitive data, which has shown considerable promise in the regime of large datasets. Stochastic gradient methods are a popular approach for learning in the datarich regime because they are computationally tractable and scalable. In this paper, we derive differentially private versions of stochastic gradient descent, and test them empirically. Our results show that standard SGD experiences high variability due to differential privacy, but a moderate increase in the batch size can improve performance significantly. I.
Coordinate descent with arbitrary sampling I: Algorithms and complexity
, 2014
"... The design and complexity analysis of randomized coordinate descent methods, and in particular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an inequality involving the objec ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
The design and complexity analysis of randomized coordinate descent methods, and in particular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an inequality involving the objective function and the sampling, capturing in a compact way certain smoothness properties of the function in a random subspace spanned by the sampled coordinates. ESO inequalities were previously established for special classes of samplings only, almost invariably for uniform samplings. In this paper we develop a systematic technique for deriving these inequalities for a large class of functions and for arbitrary samplings. We demonstrate that one can recover existing ESO results using our general approach, which is based on the study of eigenvalues associated with samplings and the data describing the function. 1
Stochastic dual coordinate ascent with adaptive probabilities. ICML 2015. [2] Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss
"... This paper introduces AdaSDCA: an adaptive variant of stochastic dual coordinate ascent (SDCA) for solving the regularized empirical risk minimization problems. Our modification consists in allowing the method adaptively change the probability distribution over the dual variables throughout the ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
This paper introduces AdaSDCA: an adaptive variant of stochastic dual coordinate ascent (SDCA) for solving the regularized empirical risk minimization problems. Our modification consists in allowing the method adaptively change the probability distribution over the dual variables throughout the iterative process. AdaSDCA achieves provably better complexity bound than SDCA with the best fixed probability distribution, known as importance sampling. However, it is of a theoretical character as it is expensive to implement. We also propose AdaSDCA+: a practical variant which in our experiments outperforms existing nonadaptive methods. 1.