#### DMCA

## A stochastic gradient method with an exponential convergence rate for finite training sets. (2012)

### Cached

### Download Links

Venue: | In NIPS, |

Citations: | 73 - 10 self |

### Citations

972 | T.: Regularization and variable selection via the elastic net - Zou, Hastie - 2005 |

960 |
A Stochastic Approximation Method
- Robbins, Munro
- 1951
(Show Context)
Citation Context ...ere there is a large amount of redundancy between examples. The most wildly successful class of algorithms for taking advantage of this type of problem structure are stochastic gradient (SG) methods (=-=Robbins and Monro, 1951-=-; Bottou and LeCun, 2003). Although the theory behind SG methods allows them to be applied more generally, in the context of machine learning SG methods are typically used to solve the problem of opti... |

663 | Rcv1: A new benchmark collection for text categorization research
- Lewis, Yang, et al.
- 2004
(Show Context)
Citation Context ...L. Thus, with this small step size, there is not a large difference between these three methods. However, our next result shows that, if the number of training examples is slightly larger than L/µ (which will often be the case, as discussed in Section 6), then the SAG iterations can use a much larger step size and obtain a better convergence rate that depends on the number of training examples but not on µ or L. 5 Data set Training Examples Variables Reference quantum 50 000 78 [Caruana et al., 2004] protein 145 751 74 [Caruana et al., 2004] sido 12 678 4 932 [Guyon, 2008] rcv1 20 242 47 236 [Lewis et al., 2004] covertype 581 012 54 Blackard, Jock, and Dean [Frank and Asuncion, 2010] Table 1: Binary data sets used in experiments. Proposition 2 If µL > 8 n , with a step size of αk = 1 2nµ the SAG iterations satisfy for k > n that E [ g(xk)− g(x∗) ] 6 C ( 1− 1 8n )k , with C = [ 16L 3n ‖x0 − x∗‖2 + 4σ 2 3nµ ( 8 log ( 1 + µn 4L ) + 1 )] . In this result we assume that the first n iterations of the algorithm use stochastic gradient descent and that we initialize the subsequent SAG iterations with the average of the iterates, which is why we state the result for k > n. This leads to an O(1/k) rate with a... |

540 |
Introductory Lectures on Convex Optimization. A Basic Course, volume 87 of Applied Optimization
- Nesterov
- 2004
(Show Context)
Citation Context ...o denote the unique minimizer of g, the FG method with a constant step size achieves a linear convergence rate, g(x k ) − g(x ∗ ) = O(ρ k ), for some ρ < 1 which depends on the condition number of g (=-=Nesterov, 2004-=-, Theorem 2.1.15). Linear convergence is also known as geometric or exponential convergence, because the error is cut by a fixed fraction on each iteration. Despite the fast convergence rate of the FG... |

491 |
G.: Stochastic approximation algorithms and applications. New-York: Springer-Verlag
- Kushner, Yin
- 1997
(Show Context)
Citation Context ...totic efficiency as Newton-like second-order SG methods and also leads to increased robustness of the convergence rate to the exact sequence of step sizes (Polyak and Juditsky, 1992). Baher’s method (=-=Kushner and Yin, 2003-=-, §1.3.4) combines gradient averaging with online iterate averaging and also displays appealing asymptotic properties. However, the convergence rates of these averaging methods remain sublinear. Stoch... |

409 |
UCI machine learning repository
- Frank, Asuncion
- 2010
(Show Context)
Citation Context ...antum 50 000 78 [Caruana et al., 2004] protein 145 751 74 [Caruana et al., 2004] sido 12 678 4 932 [Guyon, 2008] rcv1 20 242 47 236 [Lewis et al., 2004] covertype 581 012 54 Blackard, Jock, and Dean [=-=Frank and Asuncion, 2010-=-] Table 1: Binary data sets used in experiments. Proposition 2 If µL > 8 n , with a step size of αk = 1 2nµ the SAG iterations satisfy for k > n that E [ g(xk)− g(x∗)] 6 C (1− 1 8n )k , with C = [ 16L... |

338 |
Problem complexity and method efficiency in optimization. Wiley-Interscience series in discrete mathematics.
- Nemirovskii, Yudin
- 1983
(Show Context)
Citation Context ...e can be shown to be optimal for strongly-convex optimization in a model of computation where the algorithm only accesses the function through unbiased measurements of its objective and gradient [see =-=Nemirovski and Yudin, 1983-=-, Agarwal et al., 2010]. Thus, we cannot hope to obtain a better convergence rate if the algorithm only has access to unbiased measurements of the gradient. However, in the case of a finite training s... |

270 | O.: The tradeoffs of large scale learning
- Bottou, Bousquet
- 2008
(Show Context)
Citation Context ...ting objective often reaches its minimum quicker than existing SG methods, and we could expect to improve the constant in the O(1/k) convergence rate, as is the case with online second-order methods (=-=Bottou and Bousquet, 2008-=-). Algorithm extensions: Our analysis and experiments focused on using a particular gradient approximation within the simplest possible gradient method. However, there are a variety of alternative gra... |

267 | Robust stochastic approximation approach to stochastic programming. - Nemirovski, Juditsky, et al. - 2009 |

193 |
Acceleration of stochastic approximation by averaging,”
- Polyak, Juditsky
- 1992
(Show Context)
Citation Context ...ice of step-sizes, this gives the same asymptotic efficiency as Newton-like second-order SG methods and also leads to increased robustness of the convergence rate to the exact sequence of step sizes (=-=Polyak and Juditsky, 1992-=-). Baher’s method (Kushner and Yin, 2003, §1.3.4) combines gradient averaging with online iterate averaging and also displays appealing asymptotic properties. However, the convergence rates of these a... |

161 | Efficiency of coordinate descent methods on huge-scale optimization problems.
- Nesterov
- 2012
(Show Context)
Citation Context ...at selecting the coordinate to update has a cost of O(1). If we select coordinates uniformly at random, then the convergence rate for p iterations of coordinate descent with a step-size of 1/Lj g is [=-=Nesterov, 2010-=-, Theorem 2] ( 1 − µg pL j g ) p = ( 1 − ) p ( λ + mσ/n = 1 − p(λ + Mj/n) ) p ( nλ + mσ ≤ exp p(nλ + Mj) nλ + mσ − nλ + Mj Here, we see that applying a coordinate-descent method can be much more effic... |

143 | Primal-dual subgradient methods for convex problems - Nesterov - 2005 |

139 | Piecewise linear regularized solution paths, The Annals of Statistics 35(3 - Rosset, Zhu - 2007 |

133 | Dual averaging method for regularized stochastic learning and online optimization.
- Xiao
- 2010
(Show Context)
Citation Context ...to improved practical performance, it still requires the use of a decreasing sequence of step sizes and is not known to lead to a faster convergence rate. Gradient Averaging: Closely related to momentum is using the sample average of all previous gradients, xk+1 = xk − αk k k∑ j=1 f ′ij (xj), 3 which is similar to the SAG iteration in the form (6) but where all previous gradients are used. This approach is used in the dual averaging method of Nesterov [2009], and while this averaging procedure leads to convergence for a constant step size and can improve the constants in the convergence rate [Xiao, 2010], it does not improve on the O(1/k) rate. Iterate Averaging: Rather than averaging the gradients, some authors propose to use the basic SG iteration but use the average of the xk over all k as the final estimator. With a suitable choice of stepsizes, this gives the same asymptotic efficiency as Newton-like second-order SG methods and also leads to increased robustness of the convergence rate to the exact sequence of step sizes [Polyak and Juditsky, 1992]. Baher’s method [see Kushner and Yin, 2003, §1.3.4] combines gradient averaging with online iterate averaging, and also displays appealing a... |

109 | A method for unconstrained convex minimization problem with the rate of convergence O - Nesterov - 1983 |

78 | A scalable modular convex solver for regularized risk minimization. - Teo, Le, et al. - 2007 |

76 | Deep learning via hessian-free optimization.
- Martens
- 2010
(Show Context)
Citation Context ...ch as nonlinear conjugate gradient, quasi-Newton, and Hessian-free Newton methods. Several authors have presented stochastic variants of these algorithms [Sunehag et al., 2009, Ghadimi and Lan, 2010, =-=Martens, 2010-=-]. Under certain conditions these variants are convergent and improve on the constant in the O(1/k) rate [Sunehag et al., 2009]. Alternately, if we split the convergence rate into a deterministic and ... |

74 | Information-theoretic lower bounds on the oracle complexity of convex optimization.
- Agarwal, Bartlett, et al.
- 2010
(Show Context)
Citation Context ... for strongly-convex optimization in a model of computation where the algorithm only accesses the function through unbiased measurements of its objective and gradient [see Nemirovski and Yudin, 1983, =-=Agarwal et al., 2010-=-]. Thus, we cannot hope to obtain a better convergence rate if the algorithm only has access to unbiased measurements of the gradient. However, in the case of a finite training set where we may choose... |

71 | Y.: Large scale online learning
- Bottou, LeCun
- 2004
(Show Context)
Citation Context ...nt of redundancy between examples. The most wildly successful class of algorithms for taking advantage of this type of problem structure are stochastic gradient (SG) methods (Robbins and Monro, 1951; =-=Bottou and LeCun, 2003-=-). Although the theory behind SG methods allows them to be applied more generally, in the context of machine learning SG methods are typically used to solve the problem of optimizing a sample average ... |

71 | Méthode générale pour la résolution des systèmes d’équations simultanées - Cauchy |

71 | Local gain adaptation in stochastic gradient descent - Schraudolph - 1999 |

67 | A new class of incremental gradient methods for least squares problems.
- Bertsekas
- 1997
(Show Context)
Citation Context ...onvergence rate. Bertsekas proposes to go through the data cyclically with a specialized weighting that allows the method to achieve a linear convergence rate for strongly-convex quadratic functions (=-=Bertsekas, 1997-=-). However, the weighting is numerically unstable and the linear convergence rate presented treats full cycles as iterations. A related strategy is to group the fi functions into ‘batches’ of increasi... |

65 | Convergence rate of incremental subgradient algorithms. Stochastic Optimization: Algorithms and Applications,
- Nedic, Bertsekas
- 2000
(Show Context)
Citation Context ... the O(1/k) rate. Constant step size: If the SG iterations are used with a constant step size (rather than a decreasing sequence), then the convergence rate of the method can be split into two parts [=-=Nedic and Bertsekas, 2000-=-, Proposition 2.4], where the first part depends on k and converges linearly to 0 and the second part is independent of k but does not converge to 0. Thus, with a constant step size the SG iterations ... |

63 | A convergent incremental gradient method with a constant step size. - Blatt, Hero, et al. - 2007 |

61 |
Accelerated stochastic approximation.
- Kesten
- 1958
(Show Context)
Citation Context ...ze. In particular, accelerated SG methods use a constant step size by default, and only decrease the step size on iterations where the inner-product between successive gradient estimates is negative (=-=Kesten, 1958-=-; Delyon and Juditsky, 1993). This leads to convergence of the method and allows it to potentially achieve periods of linear convergence where the step size stays constant. However, the overall conver... |

58 | Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. - Hazan, Kale - 2011 |

55 | Distributed delayed stochastic optimization. - Agarwal, Duchi - 2012 |

52 | Large-scale sparse logistic regression. - Liu, Chen, et al. - 2009 |

48 | Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework.
- Ghadimi, Lan
- 2012
(Show Context)
Citation Context ...atic approximations such as non-linear conjugate gradient, quasi-Newton, and Hessian-free Newton methods. Several authors have presented stochastic variants of these algorithms (Sunehag et al., 2009; =-=Ghadimi and Lan, 2010-=-; Xiao, 2010). Under certain conditions these variants are convergent and improve on the constant in the O(1/k) rate (Sunehag et al., 2009). Alternately, if we split the convergence rate into a determ... |

47 | Non-asymptotic analysis of stochastic approximation algorithms for machine learning. - Bach, Moulines - 2011 |

38 | Fast Rates for Regularized Objectives - Sridharan, Shalev-Shwartz, et al. - 2008 |

37 | An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. - Tseng - 1998 |

37 |
A stochastic approximation method. The annals of mathematical statistics,
- Robbins, Monro
- 1951
(Show Context)
Citation Context ...ere there is a large amount of redundancy between examples. The most wildly successful class of algorithms for taking advantage of this type of problem structure are stochastic gradient (SG) methods [=-=Robbins and Monro, 1951-=-, Bottou and LeCun, 2003]. Although the theory behind SG methods allows them to be applied more generally, in the context of machine learning SG methods are typically used to solve the problem of opti... |

33 | Incremental gradient algorithms with stepsizes bounded away from zero.
- Solodov
- 1998
(Show Context)
Citation Context ...not make further progress. Indeed, convergence of the basic SG method with a constant step size has only been shown under extremely strong assumptions about the relationship between the functions fi (=-=Solodov, 1998-=-). This contrasts with the method we present in this work which converges to the optimal solution using a constant step size and does so with a linear rate (without additional assumptions). Accelerate... |

32 | Hybrid deterministic-stochastic methods for data fitting. - Friedlander, Schmidt - 2012 |

16 |
Accelerated stochastic approximation.
- Delyon, Juditsky
- 1993
(Show Context)
Citation Context ...lar, accelerated SG methods use a constant step size by default, and only decrease the step size on iterations where the inner-product between successive gradient estimates is negative (Kesten, 1958; =-=Delyon and Juditsky, 1993-=-). This leads to convergence of the method and allows it to potentially achieve periods of linear convergence where the step size stays constant. However, the overall convergence rate of the method re... |

15 | KDD-Cup 2004: results and analysis.
- Caruana, Joachims, et al.
- 2004
(Show Context)
Citation Context ... [2008], while n SAG iterations using this step size behave in a similar way to an FG method with a step size of 1/2L. Thus, with this small step size, there is not a large difference between these three methods. However, our next result shows that, if the number of training examples is slightly larger than L/µ (which will often be the case, as discussed in Section 6), then the SAG iterations can use a much larger step size and obtain a better convergence rate that depends on the number of training examples but not on µ or L. 5 Data set Training Examples Variables Reference quantum 50 000 78 [Caruana et al., 2004] protein 145 751 74 [Caruana et al., 2004] sido 12 678 4 932 [Guyon, 2008] rcv1 20 242 47 236 [Lewis et al., 2004] covertype 581 012 54 Blackard, Jock, and Dean [Frank and Asuncion, 2010] Table 1: Binary data sets used in experiments. Proposition 2 If µL > 8 n , with a step size of αk = 1 2nµ the SAG iterations satisfy for k > n that E [ g(xk)− g(x∗) ] 6 C ( 1− 1 8n )k , with C = [ 16L 3n ‖x0 − x∗‖2 + 4σ 2 3nµ ( 8 log ( 1 + µn 4L ) + 1 )] . In this result we assume that the first n iterations of the algorithm use stochastic gradient descent and that we initialize the subsequent SAG iterations... |

10 | Asymptotically optimal regularization in smooth parametric models. Neural Information Processing Systems,
- Liang, Bach, et al.
- 2009
(Show Context)
Citation Context ...es the uniform bound R on the norm of each data point. Thus, the constraint µL > 8 n is satisfied when λ > 8R n . In low-dimensional settings, the optimal regularization parameter is of the form C/n [=-=Liang et al., 2009-=-] where C is a scalar constant, and may thus violate the constraint. However, the improvement with respect to regularization parameters of the form λ = C/ √ n is known to be asymptotically negligible,... |

9 | Variable metric stochastic approximation theory
- Sunehag, Trumpf, et al.
- 2009
(Show Context)
Citation Context ...hniques based on quadratic approximations such as non-linear conjugate gradient, quasi-Newton, and Hessian-free Newton methods. Several authors have presented stochastic variants of these algorithms (=-=Sunehag et al., 2009-=-; Ghadimi and Lan, 2010; Xiao, 2010). Under certain conditions these variants are convergent and improve on the constant in the O(1/k) rate (Sunehag et al., 2009). Alternately, if we split the converg... |

3 |
Sido: A phamacology dataset,
- Guyon
- 2008
(Show Context)
Citation Context ...5 ha l-0 06 74 99 5,sv er sio ns1s- 2 8sFe bs20 12 Data set Training Examples Variables Reference quantum 50 000 78 [Caruana et al., 2004] protein 145 751 74 [Caruana et al., 2004] sido 12 678 4 932 [=-=Guyon, 2008-=-] rcv1 20 242 47 236 [Lewis et al., 2004] covertype 581 012 54 Blackard, Jock, and Dean [Frank and Asuncion, 2010] Table 1: Binary data sets used in experiments. Proposition 2 If µL > 8 n , with a ste... |

1 | g(x k ) − g(x ∗ ) ] � − g(x ∗ ) � 2L n ‖x0 − x ∗ ‖ 2 + 4σ2 nµ log - unknown authors - 2011 |

1 | version 3 - 6 Jul 2012 - hal-00674995 - 2009 |

1 | version 4 - 11 Mar 2013 - hal-00674995 |

1 | Fast rates for regularized objectives. NIPS - Sridharan, Shalev-Shwartz, et al. - 2008 |

1 | Large scale online learning. Neural Information Processing Systems, 2003. ha l-0 5, v er sio n - 2 Fe b - Bottou, LeCun |

1 | Fast rates for regularized objectives. Neural Information Processing Systems, - Sridharan, Shalev-Shwartz, et al. - 2008 |

1 |
Variable metric stochastic approximation
- Sunehag, Trumpf, et al.
- 2009
(Show Context)
Citation Context ...ent averaging with online iterate averaging, and also displays appealing asymptotic properties. However, the convergence rates of these averaging methods remain sublinear. Stochastic versions of FG methods: Various options are available to accelerate the convergence of the FG method for smooth functions, such as the accelerated full gradient (AFG) method of Nesterov [1983], as well as classical techniques based on quadratic approximations such as nonlinear conjugate gradient, quasi-Newton, and Hessian-free Newton methods. Several authors have presented stochastic variants of these algorithms [Sunehag et al., 2009, Ghadimi and Lan, 2010, Martens, 2010]. Under certain conditions these variants are convergent and improve on the constant in the O(1/k) rate [Sunehag et al., 2009]. Alternately, if we split the convergence rate into a deterministic and stochastic part, it improves the convergence rate of the deterministic part [Ghadimi and Lan, 2010]. However, as with all other methods we have discussed thus far in this section, we are not aware of any existing method of this flavor that improves on the O(1/k) rate. Constant step size: If the SG iterations are used with a constant step size (rather than a de... |