Results 1  10
of
133
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
, 2010
"... Stochastic subgradient methods are widely used, well analyzed, and constitute effective tools for optimization and online learning. Stochastic gradient methods ’ popularity and appeal are largely due to their simplicity, as they largely follow predetermined procedural schemes. However, most common s ..."
Abstract

Cited by 311 (3 self)
 Add to MetaCart
(Show Context)
Stochastic subgradient methods are widely used, well analyzed, and constitute effective tools for optimization and online learning. Stochastic gradient methods ’ popularity and appeal are largely due to their simplicity, as they largely follow predetermined procedural schemes. However, most common subgradient approaches are oblivious to the characteristics of the data being observed. We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradientbased learning. The adaptation, in essence, allows us to find needles in haystacks in the form of very predictive but rarely seenfeatures. Ourparadigmstemsfromrecentadvancesinstochasticoptimizationandonlinelearning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. In a companion paper, we validate experimentally our theoretical analysis and show that the adaptive subgradient approach outperforms stateoftheart, but nonadaptive, subgradient algorithms. 1
Adaptive forwardbackward greedy algorithm for learning sparse representations
 IEEE Trans. Inform. Theory
, 2011
"... Consider linear prediction models where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with nonzero coefficients and reconstructing the target function from noisy observations. Two heuristics that ..."
Abstract

Cited by 101 (9 self)
 Add to MetaCart
(Show Context)
Consider linear prediction models where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with nonzero coefficients and reconstructing the target function from noisy observations. Two heuristics that are widely used in practice are forward and backward greedy algorithms. First, we show that neither idea is adequate. Second, we propose a novel combination that is based on the forward greedy algorithm but takes backward steps adaptively whenever beneficial. We prove strong theoretical results showing that this procedure is effective in learning sparse representations. Experimental results support our theory. 1
Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling
 IEEE TRANSACTIONS ON AUTOMATIC CONTROL
, 2010
"... The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. It arises in various application domains, including distributed tracking and localization, multiagen ..."
Abstract

Cited by 97 (12 self)
 Add to MetaCart
(Show Context)
The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. It arises in various application domains, including distributed tracking and localization, multiagent coordination, estimation in sensor networks, and largescale machine learning. We develop and analyze distributed algorithms based on dual subgradient averaging, and we provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis allows us to clearly separate the convergence of the optimization algorithm itself and the effects of communication dependent on the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network and confirm this prediction’s sharpness both by theoretical lower bounds and simulations for various networks. Our approach includes the cases of deterministic optimization and communication as well as problems with stochastic optimization and/or communication.
Proximal Methods for Hierarchical Sparse Coding
, 2010
"... Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced treestructured sparse regularizatio ..."
Abstract

Cited by 83 (18 self)
 Add to MetaCart
Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced treestructured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and we propose in this paper efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the treestructured sparse approximation problem at the same computational cost as traditional ones using the ℓ1norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally organize in a prespecified arborescent structure, leading to a better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models.
Informationtheoretic lower bounds on the oracle complexity of convex optimization.
, 2010
"... Abstract Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these comple ..."
Abstract

Cited by 74 (11 self)
 Add to MetaCart
Abstract Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexitytheoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of largescale learning and estimation problems.
Optimal distributed online prediction using minibatches
, 2010
"... Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of webscale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this wor ..."
Abstract

Cited by 73 (9 self)
 Add to MetaCart
Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of webscale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed minibatch algorithm, a method of converting many serial gradientbased online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closelyrelated distributed stochastic optimization problem, achieving an asymptotically linear speedup over multiple processors. Finally, we demonstrate the merits of our approach on a webscale online prediction problem.
A stochastic gradient method with an exponential convergence rate for finite training sets.
 In NIPS,
, 2012
"... Abstract We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient ..."
Abstract

Cited by 73 (10 self)
 Add to MetaCart
(Show Context)
Abstract We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. Numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms.
Composite Objective Mirror Descent
"... We present a new method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously known firstorder algorithms, such as the projected gradient method, mirror descent, and forwardbackward splitting, our method yields n ..."
Abstract

Cited by 67 (8 self)
 Add to MetaCart
We present a new method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously known firstorder algorithms, such as the projected gradient method, mirror descent, and forwardbackward splitting, our method yields new analysis and algorithms. We also derive specific instantiations of our method for commonly used regularization functions, such as ℓ1, mixed norm, and tracenorm. 1
Distributed delayed stochastic optimization
, 2011
"... We analyze the convergence of gradientbased optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimizationalgorithmswhereamasternodeperformsparameterupdateswhile worker nodes compute stoc ..."
Abstract

Cited by 55 (6 self)
 Add to MetaCart
(Show Context)
We analyze the convergence of gradientbased optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimizationalgorithmswhereamasternodeperformsparameterupdateswhile worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contributionistoshowthatforsmoothstochasticproblems,thedelaysareasymptotically negligible. In application to distributed optimization, we show nnode architectures whose optimization error in stochastic problems—in spite of asynchronous delays—scales asymptotically as O(1 / √ nT), which is known to be optimal even in the absence of delays. 1