Results 1  10
of
24
Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv:1309.5885
, 2013
"... We study the performance of a family of randomized parallel coordinate descent methods for minimizing the sum of a nonsmooth and separable convex functions. The problem class includes as a special case L1regularized L1 regression and the minimization of the exponential loss (“AdaBoost problem”). We ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
We study the performance of a family of randomized parallel coordinate descent methods for minimizing the sum of a nonsmooth and separable convex functions. The problem class includes as a special case L1regularized L1 regression and the minimization of the exponential loss (“AdaBoost problem”). We assume the input data defining the loss function is contained in a sparse m × n matrix A with at most ω nonzeros in each row. Our methods need O(nβ/τ) iterations to find an approximate solution with high probability, where τ is the number of processors and β = 1 + (ω − 1)(τ − 1)/(n − 1) for the fastest variant. The notation hides dependence on quantities such as the required accuracy and confidence levels and the distance of the starting iterate from an optimal point. Since β/τ is a decreasing function of τ, the method needs fewer iterations when more processors are used. Certain variants of our algorithms perform on average only O(nnz(A)/n) arithmetic operations during a single iteration per processor and, because β decreases when ω does, fewer iterations are needed for sparser problems. 1
Separable Approximations and Decomposition Methods for the Augmented Lagrangian
, 2013
"... In this paper we study decomposition methods based on separable approximations for minimizing the augmented Lagrangian. In particular, we study and compare the Diagonal Quadratic Approximation Method (DQAM) of Mulvey and Ruszczyński [13] and the Parallel Coordinate Descent Method (PCDM) of Richtárik ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
In this paper we study decomposition methods based on separable approximations for minimizing the augmented Lagrangian. In particular, we study and compare the Diagonal Quadratic Approximation Method (DQAM) of Mulvey and Ruszczyński [13] and the Parallel Coordinate Descent Method (PCDM) of Richtárik and Takáč [23]. We show that the two methods are equivalent for feasibility problems up to the selection of a single stepsize parameter. Furthermore, we prove an improved complexity bound for PCDM under strong convexity, and show that this bound is at least 8(L ′ / ¯ L)(ω − 1) 2 times better than the best known bound for DQAM, where ω is the degree of partial separability and L ′ and ¯ L are the maximum and average of the block Lipschitz constants of the gradient of the quadratic penalty appearing in the augmented Lagrangian. 1
Asynchronous stochastic coordinate descent: Parallelism and convergence properties
"... ar ..."
(Show Context)
Randomized dual coordinate ascent with arbitrary sampling. arXiv:1411.5873
, 2014
"... We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primaldual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
We study the problem of minimizing the average of a large number of smooth convex functions penalized with a strongly convex regularizer. We propose and analyze a novel primaldual method (Quartz) which at every iteration samples and updates a random subset of the dual variables, chosen according to an arbitrary distribution. In contrast to typical analysis, we directly bound the decrease of the primaldual error (in expectation), without the need to first analyze the dual error. Depending on the choice of the sampling, we obtain efficient serial, parallel and distributed variants of the method. In the serial case, our bounds match the best known bounds for SDCA (both with uniform and importance sampling). With standard minibatching, our bounds predict initial dataindependent speedup as well as additional datadriven speedup which depends on spectral and sparsity properties of the data. We calculate theoretical speedup factors and find that they are excellent predictors of actual speedup in practice. Moreover, we illustrate that it is possible to design an efficient minibatch importance sampling. The distributed variant of Quartz is the first distributed SDCAlike method with an analysis for nonseparable data. 1
Coordinate descent with arbitrary sampling I: Algorithms and complexity
, 2014
"... The design and complexity analysis of randomized coordinate descent methods, and in particular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an inequality involving the objec ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
The design and complexity analysis of randomized coordinate descent methods, and in particular of variants which update a random subset (sampling) of coordinates in each iteration, depends on the notion of expected separable overapproximation (ESO). This refers to an inequality involving the objective function and the sampling, capturing in a compact way certain smoothness properties of the function in a random subspace spanned by the sampled coordinates. ESO inequalities were previously established for special classes of samplings only, almost invariably for uniform samplings. In this paper we develop a systematic technique for deriving these inequalities for a large class of functions and for arbitrary samplings. We demonstrate that one can recover existing ESO results using our general approach, which is based on the study of eigenvalues associated with samplings and the data describing the function. 1
Stochastic dual coordinate ascent with adaptive probabilities. ICML 2015. [2] Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss
"... This paper introduces AdaSDCA: an adaptive variant of stochastic dual coordinate ascent (SDCA) for solving the regularized empirical risk minimization problems. Our modification consists in allowing the method adaptively change the probability distribution over the dual variables throughout the ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
This paper introduces AdaSDCA: an adaptive variant of stochastic dual coordinate ascent (SDCA) for solving the regularized empirical risk minimization problems. Our modification consists in allowing the method adaptively change the probability distribution over the dual variables throughout the iterative process. AdaSDCA achieves provably better complexity bound than SDCA with the best fixed probability distribution, known as importance sampling. However, it is of a theoretical character as it is expensive to implement. We also propose AdaSDCA+: a practical variant which in our experiments outperforms existing nonadaptive methods. 1.
Adding vs. Averaging in Distributed PrimalDual Optimization
"... ∗Authors contributed equally. Distributed optimization methods for largescale machine learning suffer from a communication bottleneck. It is difficult to reduce this bottleneck while still efficiently and accurately aggregating partial work from different machines. In this paper, we present a nove ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
∗Authors contributed equally. Distributed optimization methods for largescale machine learning suffer from a communication bottleneck. It is difficult to reduce this bottleneck while still efficiently and accurately aggregating partial work from different machines. In this paper, we present a novel generalization of the recent communicationefficient primaldual framework (COCOA) for distributed optimization. Our framework, COCOA+, allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes only allow conservative averaging. We give stronger (primaldual) convergence rate guarantees for both COCOA as well as our new variants, and generalize the theory for both methods to cover nonsmooth convex loss functions. We provide an extensive experimental comparison that shows the markedly improved performance of COCOA+ on several realworld distributed datasets, especially when scaling up the number of machines.
Stochastic Dual Coordinate Ascent with Alternating Direction Method of Multipliers
"... We propose a new stochastic dual coordinate ascent technique that can be applied to a wide range of regularized learning problems. Our method is based on alternating direction method of multipliers (ADMM) to deal with complex regularization functions such as structured regularizations. Although t ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We propose a new stochastic dual coordinate ascent technique that can be applied to a wide range of regularized learning problems. Our method is based on alternating direction method of multipliers (ADMM) to deal with complex regularization functions such as structured regularizations. Although the original ADMM is a batch method, the proposed method offers a stochastic update rule where each iteration requires only one or few sample observations. Moreover, our method can naturally afford minibatch update and it gives speed up of convergence. We show that, under mild assumptions, our method converges exponentially. The numerical experiments show that our method actually performs efficiently. 1.
A Novel, Simple Interpretation of Nesterov’s Accelerated Method as a Combination of Gradient and Mirror Descent. ArXiv eprints, abs/1407.1537
, 2014
"... Firstorder methods play a central role in largescale convex optimization. Even though many variations exist, each suited to a particular problem form, almost all such methods fundamentally rely on two types of algorithmic steps and two corresponding types of analysis: gradientdescent steps, which ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Firstorder methods play a central role in largescale convex optimization. Even though many variations exist, each suited to a particular problem form, almost all such methods fundamentally rely on two types of algorithmic steps and two corresponding types of analysis: gradientdescent steps, which yield primal progress, and mirrordescent steps, which yield dual progress. In this paper, we observe that the performances of these two types of step are complementary, so that faster algorithms can be designed by coupling the two steps and combining their analyses. In particular, we show how to obtain a conceptually simple interpretation of Nesterov’s accelerated gradient method [Nes83, Nes04, Nes05], a cornerstone algorithm in convex optimization. Nesterov’s method is the optimal firstorder method for the class of smooth convex optimization problems. However, to the best of our knowledge, the proof of the fast convergence of Nesterov’s method has not found a clear interpretation and is still regarded by many as crucially relying on an “algebraic trick”[Jud13]. We apply our novel insights to express Nesterov’s algorithm as a natural coupling of gradient descent and mirror descent and to write its proof of convergence as a simple combination of the convergence analyses of the two underlying steps. We believe that the complementary view of gradient descent and mirror descent proposed in this paper will prove very useful in the design of firstorder methods as it allows us to design fast algorithms in a conceptually easier way. For instance, our view greatly facilitates the adaptation of nontrivial variants of Nesterov’s method to specific scenarios, such as packing and covering problems [AO14b, AO14a]. ar X iv