Results 1  10
of
74
A stochastic gradient method with an exponential convergence rate for finite training sets.
 In NIPS,
, 2012
"... Abstract We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient ..."
Abstract

Cited by 73 (10 self)
 Add to MetaCart
(Show Context)
Abstract We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. Numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms.
Beyond the regret minimization barrier: an optimal algorithm for stochastic stronglyconvex optimization
 In Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of JMLR Workshop and Conference Proceedings
, 2011
"... We give a novel algorithm for stochastic stronglyconvex optimization in the gradient oracle model which returns an O ( 1 T)approximate solution after T gradient updates. This rate of convergence is optimal in the gradient oracle model. This improves upon the previously log(T) known best rate of O( ..."
Abstract

Cited by 58 (3 self)
 Add to MetaCart
We give a novel algorithm for stochastic stronglyconvex optimization in the gradient oracle model which returns an O ( 1 T)approximate solution after T gradient updates. This rate of convergence is optimal in the gradient oracle model. This improves upon the previously log(T) known best rate of O( T), which was obtained by applying an online stronglyconvex optimization algorithm with regret O(log(T)) to the batch setting. We complement this result by proving that any algorithm has expected regret of Ω(log(T)) in the online stochastic stronglyconvex optimization setting. This lower bound holds even in the fullinformation setting which reveals more information to the algorithm than just gradients. This shows that any onlinetobatch conversion is inherently suboptimal for stochastic stronglyconvex optimization. This is the first formal evidence that online convex optimization is strictly more difficult than batch stochastic convex optimization. 1
Distributed delayed stochastic optimization
, 2011
"... We analyze the convergence of gradientbased optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimizationalgorithmswhereamasternodeperformsparameterupdateswhile worker nodes compute stoc ..."
Abstract

Cited by 55 (6 self)
 Add to MetaCart
(Show Context)
We analyze the convergence of gradientbased optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimizationalgorithmswhereamasternodeperformsparameterupdateswhile worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contributionistoshowthatforsmoothstochasticproblems,thedelaysareasymptotically negligible. In application to distributed optimization, we show nnode architectures whose optimization error in stochastic problems—in spite of asynchronous delays—scales asymptotically as O(1 / √ nT), which is known to be optimal even in the absence of delays. 1
NonAsymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
"... We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and leastsquares regression, and is commonly ref ..."
Abstract

Cited by 47 (10 self)
 Add to MetaCart
(Show Context)
We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and leastsquares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a nonasymptotic analysis of the convergence of two wellknown algorithms, stochastic gradient descent (a.k.a. RobbinsMonro algorithm) as well as a simple modification where iterates are averaged (a.k.a. PolyakRuppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets. 1
Minimizing Finite Sums with the Stochastic Average Gradient
, 2013
"... We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradie ..."
Abstract

Cited by 42 (2 self)
 Add to MetaCart
We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than blackbox SG methods. The convergence rate is improved from O(1 / √ k) to O(1/k) in general, and when the sum is stronglyconvex the convergence rate is improved from the sublinear O(1/k) to a linear convergence rate of the form O(ρ k) for ρ < 1. Further, in many cases the convergence rate of the new method is also faster than blackbox deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of nonuniform sampling strategies. 1
Stochastic Gradient Descent for Nonsmooth Optimization: Convergence Results and Optimal Averaging Schemes
"... Stochastic Gradient Descent (SGD) is one of the simplest and most popular stochastic optimization methods. While it has already been theoretically studied for decades, the classical analysis usually required nontrivial smoothness assumptions, which do not apply to many modern applications of SGD wit ..."
Abstract

Cited by 36 (6 self)
 Add to MetaCart
Stochastic Gradient Descent (SGD) is one of the simplest and most popular stochastic optimization methods. While it has already been theoretically studied for decades, the classical analysis usually required nontrivial smoothness assumptions, which do not apply to many modern applications of SGD with nonsmooth objective functions such as support vector machines. In this paper, we investigate the performance of SGD without such smoothness assumptions, as well as a running average scheme to convert the SGD iterates to a solution with optimal optimization accuracy. In this framework, we prove that after T rounds, the suboptimality of the last SGD iterate scales as O(log(T) / √ T) for nonsmooth convex objective functions, and O(log(T)/T) in the nonsmooth strongly convex case. To the best of our knowledge, these are the first bounds of this kind, and almost match the minimaxoptimal rates obtainable by appropriate averaging schemes. We also propose a new and simple averaging scheme, which not only attains optimal rates, but can also be easily computed onthefly (in contrast, the suffix averaging scheme proposed in Rakhlin et al. (2011) is not as simple to implement). Finally, we provide some experimental illustrations. 1.
CommunicationEfficient Algorithms for Statistical Optimization
"... We study two communicationefficient algorithms for distributed statistical optimization on largescale data. The first algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provid ..."
Abstract

Cited by 27 (5 self)
 Add to MetaCart
(Show Context)
We study two communicationefficient algorithms for distributed statistical optimization on largescale data. The first algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves meansquared error that decays as O(N −1 +(N/m) −2). Wheneverm ≤ √ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of the bootstrap. Requiring only a single round of communication, it has meansquared error that decays asO(N −1 +(N/m) −3), and so is more robust to the amount of parallelization. We complement our theoretical results with experiments on largescale problems from the internet search domain. In particular, we show that our methods efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which consists ofN ≈ 2.4×10 8 samples andd ≥ 700,000 dimensions. 1
On the Universality of Online Mirror Descent
"... We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. 1 ..."
Abstract

Cited by 23 (6 self)
 Add to MetaCart
(Show Context)
We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. 1
RANDOMIZED SMOOTHING FOR STOCHASTIC OPTIMIZATION
, 2012
"... We analyze convergence rates of stochastic optimization algorithms for nonsmooth convex optimization problems. By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates of stochastic optimization procedures, both in expectation and with high probabi ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
(Show Context)
We analyze convergence rates of stochastic optimization algorithms for nonsmooth convex optimization problems. By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates of stochastic optimization procedures, both in expectation and with high probability, that have optimal dependence on the variance of the gradient estimates. To the best of our knowledge, these are the first variancebased rates for nonsmooth optimization. We give several applications of our results to statistical estimation problems and provide experimental results that demonstrate the effectiveness of the proposed algorithms. We also describe how a combination of our algorithm with recent work on decentralized optimization yields a distributed stochastic optimization algorithm that is orderoptimal.
Query Complexity of DerivativeFree Optimization
"... This paper provides lower bounds on the convergence rate of Derivative Free Optimization (DFO) with noisy function evaluations, exposing a fundamental and unavoidable gap between the performance of algorithms with access to gradients and those with access to only function evaluations. However, there ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
(Show Context)
This paper provides lower bounds on the convergence rate of Derivative Free Optimization (DFO) with noisy function evaluations, exposing a fundamental and unavoidable gap between the performance of algorithms with access to gradients and those with access to only function evaluations. However, there are situations in which DFO is unavoidable, and for such situations we propose a new DFO algorithm that is proved to be near optimal for the class of strongly convex objective functions. A distinctive feature of the algorithm is that it uses only Booleanvalued function comparisons, rather than function evaluations. This makes the algorithm useful in an even wider range of applications, such as optimization based on paired comparisons from human subjects, for example. We also show that regardless of whether DFO is based on noisy function evaluations or Booleanvalued function comparisons, the convergence rate is the same. 1