#### DMCA

## Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms (2016)

### Citations

595 | Max-margin Markov networks
- Koller, Taskar, et al.
- 2003
(Show Context)
Citation Context ... Then τκ is roughly a constant regardless how τ is chosen. 4. Experiments In this section, we experimentally demonstrate performance gains from the three key features of our algorithm: minibatches of data, parallel workers, and asynchrony. 4.1. Minibatches of Data We conduct simulations to study the effect of mini-batch size τ , where larger τ implies greater degrees of parallelism as each worker can solve one or more subproblems in a mini-batch. In our simulation, we re-use the structural SVM setup from Lacoste-Julien et al. (2013) for a sequence labeling task on a subset of the OCR dataset (Taskar et al., 2004) (n = 6251, d = 4082). The dual problem has block-separable probability simplex constraint therefore allowing us to run AP-BCFW, and each subproblem can be solved efficiently using the Viterbi algorithm (more details are included in Appendix C). The speedup on this dataset is shown in Figure 2(a). For this dataset, we use λ = 1 with weighted averaging and line-search throughout (no delay is allowed). We measure the speedup for a particular τ > 1 in terms of the number of iterations (Algorithm 1) required to converge relative to τ = 1, which corresponds to BCFW. Figure 2(a) shows that AP-BCFW a... |

408 |
Distributed asynchronous deterministic and stochastic gradient optimization algorithms
- Tsitsiklis, Bertsekas, et al.
- 1986
(Show Context)
Citation Context ...lel computation. Our analysis follows the structure in (Lacoste-Julien et al., 2013), but uses different stepsizes that must be carefully chosen. Our results contain BCFW as a special case. Lacoste-Julien et al. (2013) primarily focus on more explicit (and stronger) guarantee for BCFW on structural SVM, while we mainly focus on a more general class of problems; the particular subroutine needed by structural SVM requires special treatment though (see Appendix C). Parallelization of sequential algorithms. The idea of parallelizing sequential optimization algorithms is not new. It dates back to (Tsitsiklis et al., 1986) for stochastic gradient methods; more recently Lee et al. (2014); Liu et al. (2014); Richtarik & Takac (2015) study parallelization of BCD. The conditions under which these parallel BCD methods succeed, e.g., expected separable overapproximation (ESO), and coordinate Lipschitz conditions, bear a close resemblance to our conditions in Section 3.2, but are not the same due to differences in how solutions are updated and what subproblems arise. In particular, our conditions are affine invariant. We provide detailed comparisons to parallel coordinate descent in Appendix D.5. Asynchronous algor... |

308 |
An algorithm for quadratic programming
- Frank, Wolfe
- 1956
(Show Context)
Citation Context ...lock-Coordinate Frank-Wolfe (BCFW) method (Lacoste-Julien et al., 2013), but our analysis subsumes BCFW and reveals problemdependent quantities that govern the speedups of our methods over BCFW. A notable feature of our algorithms is that they do not depend on worst-case bounded delays, but only (mildly) on expected delays, making them robust to stragglers and faulty worker threads. We present experiments on structural SVM and Group Fused Lasso, and observe significant speedups over competing state-of-the-art (and synchronous) methods. 1. Introduction The classical Frank-Wolfe (FW) algorithm (Frank & Wolfe, 1956) has witnessed a huge surge of interest recently (Ahipasaoglu et al., 2008; Clarkson, 2010; Jaggi, 2011; 2013). The FW algorithm iteratively minimizes a smooth function f (typically convex) over a compact convex set M ⊂ Rm. Unlike methods based on projection, FW uses just a linear oracle that solves minx∈M 〈x, g〉, which can be much simpler and faster than projection. This feature underlies the great popularity of FW, which Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). has by now witnessed s... |

215 | Learning structural SVMs with latent variables - Yu, Joachims - 2009 |

158 | Efficiency of coordinate descent methods on huge-scale optimization problems
- Nesterov
(Show Context)
Citation Context ...Such problems arise in many applications, notably, structural SVMs (Lacoste-Julien et al., 2013), routing (LeBlanc et al., 1975), group fused lasso (Alaız et al., 2013; Bleakley & Vert, 2011), trace-norm based tensor completion (Liu et al., 2013), reduced rank nonparametric regression (Foygel et al., 2012), and structured submodular minimization (Jegelka et al., 2013), among others. A standard approach to solve (1) is via block-coordinate (gradient) descent (BCD), which forms a local quadratic model for a block of variables, and then solves a projection subproblem (Beck & Tetruashvili, 2013; Nesterov, 2012; Richtarik & Takac, 2015). However, for many problems, including the ones noted above, projection can be expensive (e.g., projecting onto the trace norm ball, onto base polytopes Fujishige & Isotani, 2011), and in some cases even computationally intractable (Collins et al., 2008). Frank-Wolfe methods excel in such scenarios as they rely Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms only on linear oracles that solve mins∈M〈s,∇f(·)〉. For M = ∏ iMi, this breaks into the n independent problems min s(i)∈Mi 〈s(i),∇(i)f(x)〉, 1 ≤ i ≤ n, (2) where∇(i) denotes the gradient w.r.t. ... |

156 | Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. arXiv preprint arXiv:1106.5730 - Niu, Recht, et al. - 2011 |

90 | Exponentiated gradient algorithms for conditional random fields and max-margin markov networks
- Collins, Globerson, et al.
(Show Context)
Citation Context ...egression (Foygel et al., 2012), and structured submodular minimization (Jegelka et al., 2013), among others. A standard approach to solve (1) is via block-coordinate (gradient) descent (BCD), which forms a local quadratic model for a block of variables, and then solves a projection subproblem (Beck & Tetruashvili, 2013; Nesterov, 2012; Richtarik & Takac, 2015). However, for many problems, including the ones noted above, projection can be expensive (e.g., projecting onto the trace norm ball, onto base polytopes Fujishige & Isotani, 2011), and in some cases even computationally intractable (Collins et al., 2008). Frank-Wolfe methods excel in such scenarios as they rely Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms only on linear oracles that solve mins∈M〈s,∇f(·)〉. For M = ∏ iMi, this breaks into the n independent problems min s(i)∈Mi 〈s(i),∇(i)f(x)〉, 1 ≤ i ≤ n, (2) where∇(i) denotes the gradient w.r.t. coordinates x(i). It is obvious that these n subproblems can be solved in parallel (an idea dating back to at least as early as LeBlanc et al., 1975). However, having to update all the coordinates at each iteration is expensive, hampering the use of FW on big-data problems. This draw... |

89 | An efficient approach to solving the road network equilibrium traffic assignment problem - LeBlanc, Morlok, et al. - 1975 |

84 | Revisiting Frank-Wolfe: Projection-free sparse convex optimization
- Jaggi
- 2013
(Show Context)
Citation Context ... we show stronger results using results from load-balancing on max-load bounds. • Insightful deterministic conditions under which minibatching provably improves the convergence rate for a class of problems (sometimes by orders of magnitude). • Experiments that demonstrate on real data how our algorithm solves a structural SVM problem several times faster than the state-of-the-art. In short, our results contribute towards making FW more attractive for big-data applications. To add perspective, we compare our methods to closely related works below; we refer the reader to Freund & Grigas (2014); Jaggi (2013); Lacoste-Julien et al. (2013); Zhang et al. (2012) for additional notes and references. BCFW and Structural SVM. Our algorithm AP-BCFW extends and generalizes BCFW to parallel computation. Our analysis follows the structure in (Lacoste-Julien et al., 2013), but uses different stepsizes that must be carefully chosen. Our results contain BCFW as a special case. Lacoste-Julien et al. (2013) primarily focus on more explicit (and stronger) guarantee for BCFW on structural SVM, while we mainly focus on a more general class of problems; the particular subroutine needed by structural SVM requires spe... |

84 | Tensor completion for estimating missing values in visual data
- Liu, Musialski, et al.
(Show Context)
Citation Context ...hms, for the particular setting where the constraint setM is block-separable; thus, we solve min x f(x) s.t. x = [x(1), ..., x(n)] ∈ n∏ i=1 Mi, (1) whereMi ⊂ Rmi (1 ≤ i ≤ n) is a compact convex set and x(i) are coordinate partitions of x. This setting for FW was considered in Lacoste-Julien et al. (2013), who introduced the Block-Coordinate Frank-Wolfe (BCFW) method. Such problems arise in many applications, notably, structural SVMs (Lacoste-Julien et al., 2013), routing (LeBlanc et al., 1975), group fused lasso (Alaız et al., 2013; Bleakley & Vert, 2011), trace-norm based tensor completion (Liu et al., 2013), reduced rank nonparametric regression (Foygel et al., 2012), and structured submodular minimization (Jegelka et al., 2013), among others. A standard approach to solve (1) is via block-coordinate (gradient) descent (BCD), which forms a local quadratic model for a block of variables, and then solves a projection subproblem (Beck & Tetruashvili, 2013; Nesterov, 2012; Richtarik & Takac, 2015). However, for many problems, including the ones noted above, projection can be expensive (e.g., projecting onto the trace norm ball, onto base polytopes Fujishige & Isotani, 2011), and in some cases even... |

81 | Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm
- Clarkson
(Show Context)
Citation Context ...es BCFW and reveals problemdependent quantities that govern the speedups of our methods over BCFW. A notable feature of our algorithms is that they do not depend on worst-case bounded delays, but only (mildly) on expected delays, making them robust to stragglers and faulty worker threads. We present experiments on structural SVM and Group Fused Lasso, and observe significant speedups over competing state-of-the-art (and synchronous) methods. 1. Introduction The classical Frank-Wolfe (FW) algorithm (Frank & Wolfe, 1956) has witnessed a huge surge of interest recently (Ahipasaoglu et al., 2008; Clarkson, 2010; Jaggi, 2011; 2013). The FW algorithm iteratively minimizes a smooth function f (typically convex) over a compact convex set M ⊂ Rm. Unlike methods based on projection, FW uses just a linear oracle that solves minx∈M 〈x, g〉, which can be much simpler and faster than projection. This feature underlies the great popularity of FW, which Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). has by now witnessed several extensions such as regularized FW (Bredies et al., 2009; Harchaoui et al., 2015; Zh... |

76 | Parallel coordinate descent methods for big data optimization. arXiv preprint arXiv:1212.0873 - Richtárik, Takáč - 2012 |

56 | Block-coordinate Frank-Wolfe optimization for structural SVMs. arXiv preprint arXiv:1207.4747
- Lacoste-Julien, Jaggi, et al.
- 2012
(Show Context)
Citation Context ...ai † WDAI@CS.CMU.EDU Willie Neiswanger † WILLIE@CS.CMU.EDU Suvrit Sra ‡ SUVRIT@MIT.EDU Eric P. Xing † EPXING@CS.CMU.EDU † Carnegie Mellon University, 5000 Forbes Ave, PA 15213, USA ‡ Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA Abstract We study parallel and distributed Frank-Wolfe algorithms; the former on shared memory machines with mini-batching, and the latter in a delayed update framework. In both cases, we perform computations asynchronously whenever possible. We assume block-separable constraints as in Block-Coordinate Frank-Wolfe (BCFW) method (Lacoste-Julien et al., 2013), but our analysis subsumes BCFW and reveals problemdependent quantities that govern the speedups of our methods over BCFW. A notable feature of our algorithms is that they do not depend on worst-case bounded delays, but only (mildly) on expected delays, making them robust to stragglers and faulty worker threads. We present experiments on structural SVM and Group Fused Lasso, and observe significant speedups over competing state-of-the-art (and synchronous) methods. 1. Introduction The classical Frank-Wolfe (FW) algorithm (Frank & Wolfe, 1956) has witnessed a huge surge of interest recently (A... |

34 |
A generalized conditional gradient method and its connection to an iterative shrinkage method
- Bredies, Lorenz, et al.
(Show Context)
Citation Context ...cently (Ahipasaoglu et al., 2008; Clarkson, 2010; Jaggi, 2011; 2013). The FW algorithm iteratively minimizes a smooth function f (typically convex) over a compact convex set M ⊂ Rm. Unlike methods based on projection, FW uses just a linear oracle that solves minx∈M 〈x, g〉, which can be much simpler and faster than projection. This feature underlies the great popularity of FW, which Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). has by now witnessed several extensions such as regularized FW (Bredies et al., 2009; Harchaoui et al., 2015; Zhang et al., 2013), linearly convergent special cases (Garber & Hazan, 2013; Lacoste-Julien & Jaggi, 2015), stochastic versions (Hazan & Kale, 2012; Lafond et al., 2015; Ouyang & Gray, 2010), and a randomized block-coordinate FW (Lacoste-Julien et al., 2013). Despite this progress, parallel and distributed FW variants are barely known. We fill this gap and develop new asynchronous FW algorithms, for the particular setting where the constraint setM is block-separable; thus, we solve min x f(x) s.t. x = [x(1), ..., x(n)] ∈ n∏ i=1 Mi, (1) whereMi ⊂ Rmi (1 ≤ i ≤ n) is a ... |

30 | Accelerated, parallel and proximal coordinate descent. arXiv preprint arXiv:1312.5799 - Fercoq, Richtárik - 2013 |

28 |
On the convergence of block coordinate descent type methods
- Beck, Tetruashvili
(Show Context)
Citation Context ...Frank-Wolfe (BCFW) method. Such problems arise in many applications, notably, structural SVMs (Lacoste-Julien et al., 2013), routing (LeBlanc et al., 1975), group fused lasso (Alaız et al., 2013; Bleakley & Vert, 2011), trace-norm based tensor completion (Liu et al., 2013), reduced rank nonparametric regression (Foygel et al., 2012), and structured submodular minimization (Jegelka et al., 2013), among others. A standard approach to solve (1) is via block-coordinate (gradient) descent (BCD), which forms a local quadratic model for a block of variables, and then solves a projection subproblem (Beck & Tetruashvili, 2013; Nesterov, 2012; Richtarik & Takac, 2015). However, for many problems, including the ones noted above, projection can be expensive (e.g., projecting onto the trace norm ball, onto base polytopes Fujishige & Isotani, 2011), and in some cases even computationally intractable (Collins et al., 2008). Frank-Wolfe methods excel in such scenarios as they rely Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms only on linear oracles that solve mins∈M〈s,∇f(·)〉. For M = ∏ iMi, this breaks into the n independent problems min s(i)∈Mi 〈s(i),∇(i)f(x)〉, 1 ≤ i ≤ n, (2) where∇(i) denotes the ... |

26 | The group fused lasso for multiple change-point detection. arXiv
- Bleakley, Vert
- 2011
(Show Context)
Citation Context ...wn. We fill this gap and develop new asynchronous FW algorithms, for the particular setting where the constraint setM is block-separable; thus, we solve min x f(x) s.t. x = [x(1), ..., x(n)] ∈ n∏ i=1 Mi, (1) whereMi ⊂ Rmi (1 ≤ i ≤ n) is a compact convex set and x(i) are coordinate partitions of x. This setting for FW was considered in Lacoste-Julien et al. (2013), who introduced the Block-Coordinate Frank-Wolfe (BCFW) method. Such problems arise in many applications, notably, structural SVMs (Lacoste-Julien et al., 2013), routing (LeBlanc et al., 1975), group fused lasso (Alaız et al., 2013; Bleakley & Vert, 2011), trace-norm based tensor completion (Liu et al., 2013), reduced rank nonparametric regression (Foygel et al., 2012), and structured submodular minimization (Jegelka et al., 2013), among others. A standard approach to solve (1) is via block-coordinate (gradient) descent (BCD), which forms a local quadratic model for a block of variables, and then solves a projection subproblem (Beck & Tetruashvili, 2013; Nesterov, 2012; Richtarik & Takac, 2015). However, for many problems, including the ones noted above, projection can be expensive (e.g., projecting onto the trace norm ball, onto base polyt... |

26 | An asynchronous parallel stochastic coordinate descent algorithm. arXiv preprint arXiv:1311.1873 - Liu, Wright, et al. - 2013 |

25 | Projection-free online learning
- Hazan, Kale
- 2012
(Show Context)
Citation Context ...⊂ Rm. Unlike methods based on projection, FW uses just a linear oracle that solves minx∈M 〈x, g〉, which can be much simpler and faster than projection. This feature underlies the great popularity of FW, which Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). has by now witnessed several extensions such as regularized FW (Bredies et al., 2009; Harchaoui et al., 2015; Zhang et al., 2013), linearly convergent special cases (Garber & Hazan, 2013; Lacoste-Julien & Jaggi, 2015), stochastic versions (Hazan & Kale, 2012; Lafond et al., 2015; Ouyang & Gray, 2010), and a randomized block-coordinate FW (Lacoste-Julien et al., 2013). Despite this progress, parallel and distributed FW variants are barely known. We fill this gap and develop new asynchronous FW algorithms, for the particular setting where the constraint setM is block-separable; thus, we solve min x f(x) s.t. x = [x(1), ..., x(n)] ∈ n∏ i=1 Mi, (1) whereMi ⊂ Rmi (1 ≤ i ≤ n) is a compact convex set and x(i) are coordinate partitions of x. This setting for FW was considered in Lacoste-Julien et al. (2013), who introduced the Block-Coordinate Frank-Wolf... |

22 | Conditional gradient algorithms for normregularized smooth convex optimization. arXiv preprint arXiv:1302.2325
- Harchaoui, Juditsky, et al.
- 2013
(Show Context)
Citation Context ... al., 2008; Clarkson, 2010; Jaggi, 2011; 2013). The FW algorithm iteratively minimizes a smooth function f (typically convex) over a compact convex set M ⊂ Rm. Unlike methods based on projection, FW uses just a linear oracle that solves minx∈M 〈x, g〉, which can be much simpler and faster than projection. This feature underlies the great popularity of FW, which Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). has by now witnessed several extensions such as regularized FW (Bredies et al., 2009; Harchaoui et al., 2015; Zhang et al., 2013), linearly convergent special cases (Garber & Hazan, 2013; Lacoste-Julien & Jaggi, 2015), stochastic versions (Hazan & Kale, 2012; Lafond et al., 2015; Ouyang & Gray, 2010), and a randomized block-coordinate FW (Lacoste-Julien et al., 2013). Despite this progress, parallel and distributed FW variants are barely known. We fill this gap and develop new asynchronous FW algorithms, for the particular setting where the constraint setM is block-separable; thus, we solve min x f(x) s.t. x = [x(1), ..., x(n)] ∈ n∏ i=1 Mi, (1) whereMi ⊂ Rmi (1 ≤ i ≤ n) is a compact convex set and x... |

18 | A submodular function minimization algorithm based on the minimum-norm base
- Fujishige, Isotani
(Show Context)
Citation Context ...ce-norm based tensor completion (Liu et al., 2013), reduced rank nonparametric regression (Foygel et al., 2012), and structured submodular minimization (Jegelka et al., 2013), among others. A standard approach to solve (1) is via block-coordinate (gradient) descent (BCD), which forms a local quadratic model for a block of variables, and then solves a projection subproblem (Beck & Tetruashvili, 2013; Nesterov, 2012; Richtarik & Takac, 2015). However, for many problems, including the ones noted above, projection can be expensive (e.g., projecting onto the trace norm ball, onto base polytopes Fujishige & Isotani, 2011), and in some cases even computationally intractable (Collins et al., 2008). Frank-Wolfe methods excel in such scenarios as they rely Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms only on linear oracles that solve mins∈M〈s,∇f(·)〉. For M = ∏ iMi, this breaks into the n independent problems min s(i)∈Mi 〈s(i),∇(i)f(x)〉, 1 ≤ i ≤ n, (2) where∇(i) denotes the gradient w.r.t. coordinates x(i). It is obvious that these n subproblems can be solved in parallel (an idea dating back to at least as early as LeBlanc et al., 1975). However, having to update all the coordinates at each iter... |

18 | Sparse convex optimization methods for machine learning - Jaggi - 2011 |

16 | Accelerated training for matrix-norm regularization: A boosting approach
- Zhang, Yu, et al.
- 2012
(Show Context)
Citation Context ... load-balancing on max-load bounds. • Insightful deterministic conditions under which minibatching provably improves the convergence rate for a class of problems (sometimes by orders of magnitude). • Experiments that demonstrate on real data how our algorithm solves a structural SVM problem several times faster than the state-of-the-art. In short, our results contribute towards making FW more attractive for big-data applications. To add perspective, we compare our methods to closely related works below; we refer the reader to Freund & Grigas (2014); Jaggi (2013); Lacoste-Julien et al. (2013); Zhang et al. (2012) for additional notes and references. BCFW and Structural SVM. Our algorithm AP-BCFW extends and generalizes BCFW to parallel computation. Our analysis follows the structure in (Lacoste-Julien et al., 2013), but uses different stepsizes that must be carefully chosen. Our results contain BCFW as a special case. Lacoste-Julien et al. (2013) primarily focus on more explicit (and stronger) guarantee for BCFW on structural SVM, while we mainly focus on a more general class of problems; the particular subroutine needed by structural SVM requires special treatment though (see Appendix C). Paralleliza... |

10 | A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv preprint arXiv:1301.4666
- Garber, Hazan
- 2013
(Show Context)
Citation Context ...imizes a smooth function f (typically convex) over a compact convex set M ⊂ Rm. Unlike methods based on projection, FW uses just a linear oracle that solves minx∈M 〈x, g〉, which can be much simpler and faster than projection. This feature underlies the great popularity of FW, which Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). has by now witnessed several extensions such as regularized FW (Bredies et al., 2009; Harchaoui et al., 2015; Zhang et al., 2013), linearly convergent special cases (Garber & Hazan, 2013; Lacoste-Julien & Jaggi, 2015), stochastic versions (Hazan & Kale, 2012; Lafond et al., 2015; Ouyang & Gray, 2010), and a randomized block-coordinate FW (Lacoste-Julien et al., 2013). Despite this progress, parallel and distributed FW variants are barely known. We fill this gap and develop new asynchronous FW algorithms, for the particular setting where the constraint setM is block-separable; thus, we solve min x f(x) s.t. x = [x(1), ..., x(n)] ∈ n∏ i=1 Mi, (1) whereMi ⊂ Rmi (1 ≤ i ≤ n) is a compact convex set and x(i) are coordinate partitions of x. This setting for FW was considered in Laco... |

8 | Parameter server for distributed machine learning - Li, Zhou, et al. - 2013 |

8 | Fast stochastic Frank-Wolfe algorithms for nonlinear SVMs
- Ouyang, Gray
- 2010
(Show Context)
Citation Context ... FW uses just a linear oracle that solves minx∈M 〈x, g〉, which can be much simpler and faster than projection. This feature underlies the great popularity of FW, which Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). has by now witnessed several extensions such as regularized FW (Bredies et al., 2009; Harchaoui et al., 2015; Zhang et al., 2013), linearly convergent special cases (Garber & Hazan, 2013; Lacoste-Julien & Jaggi, 2015), stochastic versions (Hazan & Kale, 2012; Lafond et al., 2015; Ouyang & Gray, 2010), and a randomized block-coordinate FW (Lacoste-Julien et al., 2013). Despite this progress, parallel and distributed FW variants are barely known. We fill this gap and develop new asynchronous FW algorithms, for the particular setting where the constraint setM is block-separable; thus, we solve min x f(x) s.t. x = [x(1), ..., x(n)] ∈ n∏ i=1 Mi, (1) whereMi ⊂ Rmi (1 ≤ i ≤ n) is a compact convex set and x(i) are coordinate partitions of x. This setting for FW was considered in Lacoste-Julien et al. (2013), who introduced the Block-Coordinate Frank-Wolfe (BCFW) method. Such problems arise in man... |

6 | Petuum: A framework for iterative-convergent distributed ml. arXiv preprint arXiv:1312.7651
- Dai, Wei, et al.
- 2013
(Show Context)
Citation Context ...ate x(k) = x(k−1) + γk ∑ i∈S(s[i] − x (k−1) [i] ) with γk = 2nτ τ2k+2n or via line-search. 3. Broadcast x(k) (or just x(k) − x(k−1)) to O. 4. Break if converged. end for Output: x(k). For the shared-memory model, the computational work is divided amongst worker threads, each of which has access to a pool of coordinates that it may work on, as well as to the shared parameters. This setup matches the system assumptions in (Liu et al., 2014; Niu et al., 2011; Richtarik & Takac, 2015), and most modern multicore machines permit such an arrangement. On a distributed system, the parameter server (Dai et al., 2013; Li et al., 2013) broadcasts the most recent parameter vector periodically to each worker and workers keep sending updates to the parameter vector after solving the subroutines corresponding to a randomly chosen parameter. In either setting, we do not wait for slower workers or synchronize the parameters at any point of the algorithm, therefore many updates sent from the workers could be calculated based on a delayed parameter. For convenience, we treat the pool of all workers as a single “cloud” oracle O that keeps sending updates of form {i, s(i)} to the server, where i selects a block and ... |

5 | Coordinate descent with arbitrary sampling i: Algorithms and complexity. arXiv preprint arXiv:1412.8060, - Qu, Richtarik - 2014 |

4 | New Analysis and Results for the Frank-Wolfe Method. ArXiv e-prints,
- Freund, Grigas
- 2013
(Show Context)
Citation Context ...lay is actually bounded, we show stronger results using results from load-balancing on max-load bounds. • Insightful deterministic conditions under which minibatching provably improves the convergence rate for a class of problems (sometimes by orders of magnitude). • Experiments that demonstrate on real data how our algorithm solves a structural SVM problem several times faster than the state-of-the-art. In short, our results contribute towards making FW more attractive for big-data applications. To add perspective, we compare our methods to closely related works below; we refer the reader to Freund & Grigas (2014); Jaggi (2013); Lacoste-Julien et al. (2013); Zhang et al. (2012) for additional notes and references. BCFW and Structural SVM. Our algorithm AP-BCFW extends and generalizes BCFW to parallel computation. Our analysis follows the structure in (Lacoste-Julien et al., 2013), but uses different stepsizes that must be carefully chosen. Our results contain BCFW as a special case. Lacoste-Julien et al. (2013) primarily focus on more explicit (and stronger) guarantee for BCFW on structural SVM, while we mainly focus on a more general class of problems; the particular subroutine needed by structural SV... |

4 | On model parallelization and scheduling strategies for distributed machine learning.
- Lee, Kim, et al.
- 2014
(Show Context)
Citation Context ...t al., 2013), but uses different stepsizes that must be carefully chosen. Our results contain BCFW as a special case. Lacoste-Julien et al. (2013) primarily focus on more explicit (and stronger) guarantee for BCFW on structural SVM, while we mainly focus on a more general class of problems; the particular subroutine needed by structural SVM requires special treatment though (see Appendix C). Parallelization of sequential algorithms. The idea of parallelizing sequential optimization algorithms is not new. It dates back to (Tsitsiklis et al., 1986) for stochastic gradient methods; more recently Lee et al. (2014); Liu et al. (2014); Richtarik & Takac (2015) study parallelization of BCD. The conditions under which these parallel BCD methods succeed, e.g., expected separable overapproximation (ESO), and coordinate Lipschitz conditions, bear a close resemblance to our conditions in Section 3.2, but are not the same due to differences in how solutions are updated and what subproblems arise. In particular, our conditions are affine invariant. We provide detailed comparisons to parallel coordinate descent in Appendix D.5. Asynchronous algorithms. Asynchronous algorithms that allow delayed parameter updat... |

4 | Polar operators for structured sparse estimation
- Zhang, Yu, et al.
- 2013
(Show Context)
Citation Context ...10; Jaggi, 2011; 2013). The FW algorithm iteratively minimizes a smooth function f (typically convex) over a compact convex set M ⊂ Rm. Unlike methods based on projection, FW uses just a linear oracle that solves minx∈M 〈x, g〉, which can be much simpler and faster than projection. This feature underlies the great popularity of FW, which Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). has by now witnessed several extensions such as regularized FW (Bredies et al., 2009; Harchaoui et al., 2015; Zhang et al., 2013), linearly convergent special cases (Garber & Hazan, 2013; Lacoste-Julien & Jaggi, 2015), stochastic versions (Hazan & Kale, 2012; Lafond et al., 2015; Ouyang & Gray, 2010), and a randomized block-coordinate FW (Lacoste-Julien et al., 2013). Despite this progress, parallel and distributed FW variants are barely known. We fill this gap and develop new asynchronous FW algorithms, for the particular setting where the constraint setM is block-separable; thus, we solve min x f(x) s.t. x = [x(1), ..., x(n)] ∈ n∏ i=1 Mi, (1) whereMi ⊂ Rmi (1 ≤ i ≤ n) is a compact convex set and x(i) are coordinate pa... |

3 | Group fused lasso - Aláız, Barbero, et al. - 2013 |

3 | On the Global Linear Convergence of Frank-Wolfe Optimization Variants. - Lacoste-Julien, Jaggi - 2015 |

3 | The power of two choices in randomized load balancing. Parallel and Distributed Systems, - Mitzenmacher - 2001 |