Results 1 - 10
of
14
Revisiting frank-wolfe: Projection-free sparse convex optimization
- In ICML
, 2013
"... We provide stronger and more general primal-dual convergence results for Frank-Wolfe-type algorithms (a.k.a. conditional gradient) for constrained convex optimization, enabled by a simple framework of duality gap certificates. Our analysis also holds if the linear subproblems are only solved approxi ..."
Abstract
-
Cited by 76 (2 self)
- Add to MetaCart
(Show Context)
We provide stronger and more general primal-dual convergence results for Frank-Wolfe-type algorithms (a.k.a. conditional gradient) for constrained convex optimization, enabled by a simple framework of duality gap certificates. Our analysis also holds if the linear subproblems are only solved approximately (as well as if the gradients are inexact), and is proven to be worst-case optimal in the sparsity of the obtained solutions. On the application side, this allows us to unify a large variety of existing sparse greedy methods, in particular for optimization over convex hulls of an atomic set, even if those sets can only be approximated, including sparse (or structured sparse) vectors or matrices, low-rank matrices, permutation matrices, or max-norm bounded matrices. We present a new general framework for convex optimization over matrix factorizations, where every Frank-Wolfe iteration will consist of a low-rank update, and discuss the broad application areas of this approach. 1.
Duality between subgradient and conditional gradient methods
- hal-00861118, version 1 - 12 Sep 2013
, 2013
"... ..."
A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization
, 1301
"... Abstract. Linear optimization is many times algorithmically simpler than non-linear convex optimization. Linear optimization over matroid polytopes, matching polytopes and path polytopes are example of problems for which we have simple and efficient combinatorial algorithms, but whose non-linear con ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Linear optimization is many times algorithmically simpler than non-linear convex optimization. Linear optimization over matroid polytopes, matching polytopes and path polytopes are example of problems for which we have simple and efficient combinatorial algorithms, but whose non-linear convex counterpart is harder and admits significantly less efficient algorithms. This mo-tivates the computational model of convex optimization, including the offline, online and stochastic settings, using a linear optimization oracle. In this computational model we give several new results that improve over the previous state-of-the-art. Our main result is a novel conditional gradient algo-rithm for smooth and strongly convex optimization over polyhedral sets that performs only a single linear optimization step over the domain on each iteration and enjoys a linear convergence rate. This gives an exponential improvement in convergence rate over previous results. Based on this new conditional gradient algorithm we give the first algorithms for online convex optimization over polyhedral sets that perform only a single linear optimization step over the domain while having optimal regret guarantees, answering an open question of Kalai and Vempala, and Hazan and Kale. Our online algorithms also imply conditional gradient algorithms for non-smooth and stochastic convex optimization with the same convergence rates as projected (sub)gradient methods. Key words. frank-wolfe algorithm; conditional gradient methods; linear programming; first-order methods; online convex optimization; online learning; stochastic optimization AMS subject classifications. 65K05; 90C05; 90C06; 90C25; 90C30; 90C27; 90C15
Optimally-Weighted Herding is Bayesian Quadrature
"... Herding and kernel herding are deterministic methods of choosing samples which summarise a probability distribution. A related task is choosing samples for estimating integrals using Bayesian quadrature. We show that the criterion minimised when selecting samples in kernel herding is equivalent to t ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Herding and kernel herding are deterministic methods of choosing samples which summarise a probability distribution. A related task is choosing samples for estimating integrals using Bayesian quadrature. We show that the criterion minimised when selecting samples in kernel herding is equivalent to the posterior variance in Bayesian quadrature. We then show that sequential Bayesian quadrature can be viewed as a weighted version of kernel herding which achieves performance superior to any other weighted herding method. We demonstrate empirically a rate of convergence faster than O(1/N). Our results also imply an upper bound on the empirical error of the Bayesian quadrature estimate. 1
A Greedy Framework for First-Order Optimization
"... Introduction. Recent work has shown many connections between conditional gradient and other first-order optimization methods, such as herding [3] and subgradient descent [2]. By considering a type of proximal conditional method, which we call boosted mirror descent (BMD), we are able to unify all of ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Introduction. Recent work has shown many connections between conditional gradient and other first-order optimization methods, such as herding [3] and subgradient descent [2]. By considering a type of proximal conditional method, which we call boosted mirror descent (BMD), we are able to unify all of these algorithms into a single framework, which can be interpreted as taking successive arg-mins of a sequence of surrogate functions. Using a standard online learning analysis based on
• Assumptions – f: Rn → R Lipschitz-continuous ⇒ f ∗ has compact support C
, 2013
"... Wolfe’s universal algorithm www.di.ens.fr/~fbach/wolfe_anonymous.pdf Conditional gradients everywhere • Conditional gradient and subgradient method – Fenchel duality – Generalized conditional gradient and mirror descent • Conditional gradient and greedy algorithms – Relationship with basis pursuit, ..."
Abstract
- Add to MetaCart
(Show Context)
Wolfe’s universal algorithm www.di.ens.fr/~fbach/wolfe_anonymous.pdf Conditional gradients everywhere • Conditional gradient and subgradient method – Fenchel duality – Generalized conditional gradient and mirror descent • Conditional gradient and greedy algorithms – Relationship with basis pursuit, matching pursuit • Conditional gradient and herding – Properties of conditional gradient iterates – Relationships with sampling Composite optimization problems min x∈Rp h(x) + f(Ax)
Predicting the Future Behavior of a Time-Varying Probability Distribution
"... We study the problem of predicting the future, though only in the probabilistic sense of estimating a future state of a time-varying probability distribution. This is not only an interesting academic problem, but solving this extrapo-lation problem also has many practical application, e.g. for train ..."
Abstract
- Add to MetaCart
(Show Context)
We study the problem of predicting the future, though only in the probabilistic sense of estimating a future state of a time-varying probability distribution. This is not only an interesting academic problem, but solving this extrapo-lation problem also has many practical application, e.g. for training classifiers that have to operate under time-varying conditions. Our main contribution is a method for predicting the next step of the time-varying distribution from a given sequence of sample sets from earlier time steps. For this we rely on two recent machine learning techniques: embedding proba-bility distributions into a reproducing kernel Hilbert space, and learning operators by vector-valued regression. We illustrate the working principles and the practical usefulness of our method by experiments on synthetic and real data. We also highlight an exemplary application: training a classifier in a domain adaptation setting without having access to examples from the test time distribution at training time. 1.
MEASURING SAMPLE QUALITY WITH STEIN’S METHOD
, 2015
"... To improve the efficiency of Monte Carlo estimation, practitioners are turning to biased Markov chain Monte Carlo procedures that trade off asymptotic exactness for computational speed. The reasoning is sound: a reduction in variance due to more rapid sampling can outweigh the bias introduced. Howev ..."
Abstract
- Add to MetaCart
(Show Context)
To improve the efficiency of Monte Carlo estimation, practitioners are turning to biased Markov chain Monte Carlo procedures that trade off asymptotic exactness for computational speed. The reasoning is sound: a reduction in variance due to more rapid sampling can outweigh the bias introduced. However, the inexactness creates new challenges for sampler and parameter selection, since standard mea-sures of sample quality like effective sample size do not account for asymptotic bias. To address these challenges, we introduce a new computable quality measure based on Stein’s method that bounds the discrepancy between sample and target expectations over a large class of test functions. We use our tool to compare exact, biased, and deterministic sample sequences and illustrate applications to hyper-parameter selection, convergence rate assessment, and quantifying bias-variance tradeoffs in posterior inference. 1
cal Methods
"... 5.2. SiGMa- Simple Greedy Matching: a tool for aligning large knowledge-bases 3 ..."
Abstract
- Add to MetaCart
(Show Context)
5.2. SiGMa- Simple Greedy Matching: a tool for aligning large knowledge-bases 3