Results 1 
9 of
9
Distributed FrankWolfe algorithm: A unified framework for communicationefficient sparse learning. arXiv preprint arXiv:1404.2644
, 2014
"... Learning sparse combinations is a frequent theme in machine learning. In this paper, we study its associated optimization problem in the distributed setting where the elements to be combined are not centrally located but spread over a network. We address the key challenges of balancing communication ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Learning sparse combinations is a frequent theme in machine learning. In this paper, we study its associated optimization problem in the distributed setting where the elements to be combined are not centrally located but spread over a network. We address the key challenges of balancing communication costs and optimization errors. To this end, we propose a distributed FrankWolfe (dFW) algorithm. We obtain theoretical guarantees on the optimization error and communication cost that do not depend on the total number of combining elements. We further show that the communication cost of dFW is optimal by deriving a lowerbound on the communication cost required to construct an approximate solution. We validate our theoretical analysis with empirical studies on synthetic and realworld data, which demonstrate that dFW outperforms both baselines and competing methods. We also study the performance of dFW when the conditions of our analysis are relaxed, and show that dFW is fairly robust. 1
S.: Iteration bounds for finding stationary points of structured nonconvex optimization. Working Paper
, 2014
"... In this paper we study proximal conditionalgradient (CG) and proximal gradientprojection type algorithms for a blockstructured constrained nonconvex optimization model, which arises naturally from tensor data analysis. First, we introduce a new notion of stationarity, which is suitable for the s ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
In this paper we study proximal conditionalgradient (CG) and proximal gradientprojection type algorithms for a blockstructured constrained nonconvex optimization model, which arises naturally from tensor data analysis. First, we introduce a new notion of stationarity, which is suitable for the structured problem under consideration. We then propose two types of firstorder algorithms for the model based on the proximal conditionalgradient (CG) method and the proximal gradientprojection method respectively. If the nonconvex objective function is in the form of mathematical expectation, we then discuss how to incorporate randomized sampling to avoid computing the expectations exactly. For the general block optimization model, the proximal subroutines are performed for each block according to either the blockcoordinatedescent (BCD) or the maximumblockimprovement (MBI) updating rule. If the gradient of the nonconvex part of the objective f satisfies ‖∇f(x) − ∇f(y)‖q ≤ M‖x − y‖δp where δ = p/q with 1/p + 1/q = 1, then we prove that the new algorithms have an overall iteration complexity bound of O(1/q) in finding an stationary solution. If f is concave then the iteration complexity reduces to O(1/). Our numerical experiments for tensor approximation problems show promising performances of the new solution algorithms.
Suykens, “Hybrid conditional gradientsmoothing algorithms with applications to sparse and low rank regularization
 Regularization, Optimization, Kernels, and Support Vector Machines
, 2014
"... Conditional gradient methods are old and well studied optimization algorithms. Their origin dates at least to the 50’s and the FrankWolfe algorithm for quadratic programming [18] but they apply to much more general optimization problems. General formulations of conditional gradient algorithms have ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Conditional gradient methods are old and well studied optimization algorithms. Their origin dates at least to the 50’s and the FrankWolfe algorithm for quadratic programming [18] but they apply to much more general optimization problems. General formulations of conditional gradient algorithms have been studied in the
Solving Variational Inequalities with Monotone Operators on Domains Given by Linear Minimization Oracles
, 2015
"... ..."
ForwardBackward Greedy Algorithms for Atomic Norm Regularization
, 2014
"... In many signal processing applications, one aims to reconstruct a signal that has a simple representation with respect to a certain basis or frame. Fundamental elements of the basis known as “atoms ” allow us to define “atomic norms” that can be used to construct convex regularizers for the reconstr ..."
Abstract
 Add to MetaCart
In many signal processing applications, one aims to reconstruct a signal that has a simple representation with respect to a certain basis or frame. Fundamental elements of the basis known as “atoms ” allow us to define “atomic norms” that can be used to construct convex regularizers for the reconstruction problem. Efficient algorithms are available to solve the reconstruction problems in certain special cases, but an approach that works well for general atomic norms remains to be found. This paper describes an optimization algorithm called CoGEnT, which produces solutions with succinct atomic representations for reconstruction problems, generally formulated with atomicnorm constraints. CoGEnT combines a greedy selection scheme based on the conditional gradient approach with a backward (or “truncation”) step that exploits the quadratic nature of the objective to reduce the basis size. We establish convergence properties and validate the algorithm via extensive numerical experiments on a suite of signal processing applications. Our algorithm and analysis are also novel in that they allow for inexact forward steps. In practice, CoGEnT significantly outperforms the basic conditional gradient method, and indeed many methods that are tailored to specific applications, when the truncation steps are defined appropriately. We also introduce several novel applications that are enabled by the atomicnorm framework, including tensor completion, moment problems in signal processing, and graph deconvolution.
Editor: U.N.Known
, 804
"... We extend the wellknown BFGS quasiNewton method and its limitedmemory variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: The local quadratic model, the identification of a descent direc ..."
Abstract
 Add to MetaCart
(Show Context)
We extend the wellknown BFGS quasiNewton method and its limitedmemory variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: The local quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We apply the resulting subLBFGS algorithm to L2regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings we develop a new, efficient, exact line search algorithm. We prove its worstcase time complexity bounds, and show that it can also extend a recently developed bundle method to the multiclass and multilabel settings. We also apply the directionfinding component of our algorithm to L1regularized risk minimization with logistic loss. In all these contexts our methods perform comparable to or better than specialized stateoftheart solvers on a number of publicly available datasets. Open source software implementing our algorithms is freely available for download.
A Distributed FrankWolfe Algorithm for CommunicationEfficient Sparse Learning
"... Learning sparse combinations is a frequent theme in machine learning. In this paper, we study its associated optimization problem in the distributed setting where the elements to be combined are not centrally located but spread over a network. We address the key challenges of balancing communication ..."
Abstract
 Add to MetaCart
(Show Context)
Learning sparse combinations is a frequent theme in machine learning. In this paper, we study its associated optimization problem in the distributed setting where the elements to be combined are not centrally located but spread over a network. We address the key challenges of balancing communication costs and optimization errors. To this end, we propose a distributed FrankWolfe (dFW) algorithm. We obtain theoretical guarantees on the optimization error and communication cost that do not depend on the total number of combining elements. We further show that the communication cost of dFW is optimal by deriving a lowerbound on the communication cost required to construct an approximate solution. We validate our theoretical analysis with empirical studies on synthetic and realworld data, which demonstrate that dFW outperforms both baselines and competing methods. We also study the performance of dFW when the conditions of our analysis are relaxed, and show that dFW is fairly robust. 1
Similarity Learning for HighDimensional Sparse Data
"... Abstract A good measure of similarity between data points is crucial to many tasks in machine learning. Similarity and metric learning methods learn such measures automatically from data, but they do not scale well respect to the dimensionality of the data. In this paper, we propose a method that c ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract A good measure of similarity between data points is crucial to many tasks in machine learning. Similarity and metric learning methods learn such measures automatically from data, but they do not scale well respect to the dimensionality of the data. In this paper, we propose a method that can learn efficiently similarity measure from highdimensional sparse data. The core idea is to parameterize the similarity measure as a convex combination of rankone matrices with specific sparsity structures. The parameters are then optimized with an approximate FrankWolfe procedure to maximally satisfy relative similarity constraints on the training data. Our algorithm greedily incorporates one pair of features at a time into the similarity measure, providing an efficient way to control the number of active features and thus reduce overfitting. It enjoys very appealing convergence guarantees and its time and memory complexity depends on the sparsity of the data instead of the dimension of the feature space. Our experiments on realworld highdimensional datasets demonstrate its potential for classification, dimensionality reduction and data exploration.