Results 1  10
of
16
Revisiting frankwolfe: Projectionfree sparse convex optimization
 In ICML
, 2013
"... We provide stronger and more general primaldual convergence results for FrankWolfetype algorithms (a.k.a. conditional gradient) for constrained convex optimization, enabled by a simple framework of duality gap certificates. Our analysis also holds if the linear subproblems are only solved approxi ..."
Abstract

Cited by 86 (2 self)
 Add to MetaCart
(Show Context)
We provide stronger and more general primaldual convergence results for FrankWolfetype algorithms (a.k.a. conditional gradient) for constrained convex optimization, enabled by a simple framework of duality gap certificates. Our analysis also holds if the linear subproblems are only solved approximately (as well as if the gradients are inexact), and is proven to be worstcase optimal in the sparsity of the obtained solutions. On the application side, this allows us to unify a large variety of existing sparse greedy methods, in particular for optimization over convex hulls of an atomic set, even if those sets can only be approximated, including sparse (or structured sparse) vectors or matrices, lowrank matrices, permutation matrices, or maxnorm bounded matrices. We present a new general framework for convex optimization over matrix factorizations, where every FrankWolfe iteration will consist of a lowrank update, and discuss the broad application areas of this approach. 1.
A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization
, 2013
"... Linear optimization is many times algorithmically simpler than nonlinear convex optimization. Linear optimization over matroid polytopes, matching polytopes and path polytopes are example of problems for which we have simple and efficient combinatorial algorithms, but whose nonlinear convex count ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
Linear optimization is many times algorithmically simpler than nonlinear convex optimization. Linear optimization over matroid polytopes, matching polytopes and path polytopes are example of problems for which we have simple and efficient combinatorial algorithms, but whose nonlinear convex counterpart is harder and admits significantly less efficient algorithms. This motivates the computational model of convex optimization, including the offline, online and stochastic settings, using a linear optimization oracle. In this computational model we give several new results that improve over the previous stateoftheart. Our main result is a novel conditional gradient algorithm for smooth and strongly convex optimization over polyhedral sets that performs only a single linear optimization step over the domain on each iteration and enjoys a linear convergence rate. This gives an exponential improvement in convergence rate over previous results. Based on this new conditional gradient algorithm we give the first algorithms for online convex optimization over polyhedral sets that perform only a single linear optimization step over the domain while having optimal regret guarantees, answering an open question of Kalai and Vempala, and Hazan and Kale. Our online algorithms also imply conditional gradient algorithms for nonsmooth and stochastic convex optimization with the same convergence rates as projected (sub)gradient methods. Key words. frankwolfe algorithm; conditional gradient methods; linear programming; firstorder methods; online convex optimization; online learning; stochastic optimization AMS subject classifications. 65K05; 90C05; 90C06; 90C25; 90C30; 90C27; 90C15
The complexity of largescale convex programming under a linear optimization oracle.
, 2013
"... Abstract This paper considers a general class of iterative optimization algorithms, referred to as linearoptimizationbased convex programming (LCP) methods, for solving largescale convex programming (CP) problems. The LCP methods, covering the classic conditional gradient (CG) method (a.k.a., Fra ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Abstract This paper considers a general class of iterative optimization algorithms, referred to as linearoptimizationbased convex programming (LCP) methods, for solving largescale convex programming (CP) problems. The LCP methods, covering the classic conditional gradient (CG) method (a.k.a., FrankWolfe method) as a special case, can only solve a linear optimization subproblem at each iteration. In this paper, we first establish a series of lower complexity bounds for the LCP methods to solve different classes of CP problems, including smooth, nonsmooth and certain saddlepoint problems. We then formally establish the theoretical optimality or nearly optimality, in the largescale case, for the CG method and its variants to solve different classes of CP problems. We also introduce several new optimal LCP methods, obtained by properly modifying Nesterov's accelerated gradient method, and demonstrate their possible advantages over the classic CG for solving certain classes of largescale CP problems.
Duality between subgradient and conditional gradient methods
 HAL00861118, VERSION 1  12 SEP 2013
, 2013
"... Given a convex optimization problem and its dual, there are many possible firstorder ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Given a convex optimization problem and its dual, there are many possible firstorder
OptimallyWeighted Herding is Bayesian Quadrature
"... Herding and kernel herding are deterministic methods of choosing samples which summarise a probability distribution. A related task is choosing samples for estimating integrals using Bayesian quadrature. We show that the criterion minimised when selecting samples in kernel herding is equivalent to t ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Herding and kernel herding are deterministic methods of choosing samples which summarise a probability distribution. A related task is choosing samples for estimating integrals using Bayesian quadrature. We show that the criterion minimised when selecting samples in kernel herding is equivalent to the posterior variance in Bayesian quadrature. We then show that sequential Bayesian quadrature can be viewed as a weighted version of kernel herding which achieves performance superior to any other weighted herding method. We demonstrate empirically a rate of convergence faster than O(1/N). Our results also imply an upper bound on the empirical error of the Bayesian quadrature estimate. 1
QuasiMonte Carlo Feature Maps for ShiftInvariant Kernels. ICML
, 2014
"... Abstract We consider the problem of improving the efficiency of randomized Fourier feature maps to accelerate training and testing speed of kernel methods on large data sets. These approximate feature maps arise as Monte Carlo approximations to integral representations of shiftinvariant kernel fun ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract We consider the problem of improving the efficiency of randomized Fourier feature maps to accelerate training and testing speed of kernel methods on large data sets. These approximate feature maps arise as Monte Carlo approximations to integral representations of shiftinvariant kernel functions (e.g., Gaussian kernel). In this paper, we propose to use QuasiMonte Carlo (QMC) approximations instead, where the relevant integrands are evaluated on a lowdiscrepancy sequence of points as opposed to random point sets as in the Monte Carlo approach. We derive a new discrepancy measure called box discrepancy based on theoretical characterizations of the integration error with respect to a given sequence. We then propose to learn QMC sequences adapted to our setting based on explicit box discrepancy minimization. Our theoretical analyses are complemented with empirical results that demonstrate the effectiveness of classical and adaptive QMC techniques for this problem.
Conditional gradient sliding for convex optimization
, 2014
"... Abstract In this paper, we present a new conditional gradient type method for convex optimization by utilizing a linear optimization (LO) oracle to minimize a series of linear functions over the feasible set. Different from the classic conditional gradient method, the conditional gradient sliding ( ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract In this paper, we present a new conditional gradient type method for convex optimization by utilizing a linear optimization (LO) oracle to minimize a series of linear functions over the feasible set. Different from the classic conditional gradient method, the conditional gradient sliding (CGS) algorithm developed herein can skip the computation of gradients from time to time, and as a result, can achieve the optimal complexity bounds in terms of not only the number of calls to the LO oracle, but also the number of gradient evaluations. More specifically, we show that the CGS method requires O(1/ √ ) and O(log(1/ )) gradient evaluations, respectively, for solving smooth and strongly convex problems, while still maintaining the optimal O(1/ ) bound on the number of calls to the LO oracle. We also develop variants of the CGS method which can achieve the optimal complexity bounds for solving stochastic optimization problems and an important class of saddle point optimization problems. To the best of our knowledge, this is the first time that these types of projectionfree optimal firstorder methods have been developed in the literature. Some preliminary numerical results have also been provided to demonstrate the advantages of the CGS method.
A Greedy Framework for FirstOrder Optimization
"... Introduction. Recent work has shown many connections between conditional gradient and other firstorder optimization methods, such as herding [3] and subgradient descent [2]. By considering a type of proximal conditional method, which we call boosted mirror descent (BMD), we are able to unify all of ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Introduction. Recent work has shown many connections between conditional gradient and other firstorder optimization methods, such as herding [3] and subgradient descent [2]. By considering a type of proximal conditional method, which we call boosted mirror descent (BMD), we are able to unify all of these algorithms into a single framework, which can be interpreted as taking successive argmins of a sequence of surrogate functions. Using a standard online learning analysis based on
MEASURING SAMPLE QUALITY WITH STEIN’S METHOD
, 2015
"... To improve the efficiency of Monte Carlo estimation, practitioners are turning to biased Markov chain Monte Carlo procedures that trade off asymptotic exactness for computational speed. The reasoning is sound: a reduction in variance due to more rapid sampling can outweigh the bias introduced. Howev ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
To improve the efficiency of Monte Carlo estimation, practitioners are turning to biased Markov chain Monte Carlo procedures that trade off asymptotic exactness for computational speed. The reasoning is sound: a reduction in variance due to more rapid sampling can outweigh the bias introduced. However, the inexactness creates new challenges for sampler and parameter selection, since standard measures of sample quality like effective sample size do not account for asymptotic bias. To address these challenges, we introduce a new computable quality measure based on Stein’s method that bounds the discrepancy between sample and target expectations over a large class of test functions. We use our tool to compare exact, biased, and deterministic sample sequences and illustrate applications to hyperparameter selection, convergence rate assessment, and quantifying biasvariance tradeoffs in posterior inference. 1
• Assumptions – f: Rn → R Lipschitzcontinuous ⇒ f ∗ has compact support C
, 2013
"... Wolfe’s universal algorithm www.di.ens.fr/~fbach/wolfe_anonymous.pdf Conditional gradients everywhere • Conditional gradient and subgradient method – Fenchel duality – Generalized conditional gradient and mirror descent • Conditional gradient and greedy algorithms – Relationship with basis pursuit, ..."
Abstract
 Add to MetaCart
(Show Context)
Wolfe’s universal algorithm www.di.ens.fr/~fbach/wolfe_anonymous.pdf Conditional gradients everywhere • Conditional gradient and subgradient method – Fenchel duality – Generalized conditional gradient and mirror descent • Conditional gradient and greedy algorithms – Relationship with basis pursuit, matching pursuit • Conditional gradient and herding – Properties of conditional gradient iterates – Relationships with sampling Composite optimization problems min x∈Rp h(x) + f(Ax)