Results 1  10
of
51
Large Scale Distributed Deep Networks
"... Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a ..."
Abstract

Cited by 92 (11 self)
 Add to MetaCart
(Show Context)
Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for largescale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of LBFGS. Downpour SGD and Sandblaster LBFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves stateoftheart performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradientbased machine learning algorithm. 1
A Stochastic Gradient Method with an Exponential Convergence Rate for StronglyConvex Optimization with Finite Training Sets. arXiv preprint arXiv:1202.6258
, 2012
"... We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in ..."
Abstract

Cited by 76 (11 self)
 Add to MetaCart
We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training objective and reducing the testing objective quickly. 1
A reliable effective terascale linear learning system
, 2011
"... We present a system and a set of techniques for learning linear predictors with convex losses on terascale data sets, with trillions of features,1 billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. Individually none of the component techniques are n ..."
Abstract

Cited by 65 (6 self)
 Add to MetaCart
(Show Context)
We present a system and a set of techniques for learning linear predictors with convex losses on terascale data sets, with trillions of features,1 billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. Individually none of the component techniques are new, but the careful synthesis required to obtain an efficient implementation is. The result is, up to our knowledge, the most scalable and efficient linear learning system reported in the literature.2 We describe and thoroughly evaluate the components of the system, showing the importance of the various design choices.
Minimizing Finite Sums with the Stochastic Average Gradient
, 2013
"... We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradie ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than blackbox SG methods. The convergence rate is improved from O(1 / √ k) to O(1/k) in general, and when the sum is stronglyconvex the convergence rate is improved from the sublinear O(1/k) to a linear convergence rate of the form O(ρ k) for ρ < 1. Further, in many cases the convergence rate of the new method is also faster than blackbox deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of nonuniform sampling strategies. 1
Recent Advances of Largescale Linear Classification
"... Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much ..."
Abstract

Cited by 33 (6 self)
 Add to MetaCart
(Show Context)
Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much faster. Recently, many research works have developed efficient optimization methods to construct linear classifiers and applied them to some largescale applications. In this paper, we give a comprehensive survey on the recent development of this active research area.
Better MiniBatch Algorithms via Accelerated Gradient Methods
"... Minibatch algorithms have been proposed as a way to speedup stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a sig ..."
Abstract

Cited by 28 (6 self)
 Add to MetaCart
(Show Context)
Minibatch algorithms have been proposed as a way to speedup stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speedup and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1
Minibatch primal and dual methods for SVMs
 In 30th International Conference on Machine Learning
, 2013
"... We address the issue of using minibatches in stochastic optimization of SVMs. We show that the same quantity, the spectral norm of the data, controls the parallelization speedup obtained for both primal stochastic subgradient descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and us ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
(Show Context)
We address the issue of using minibatches in stochastic optimization of SVMs. We show that the same quantity, the spectral norm of the data, controls the parallelization speedup obtained for both primal stochastic subgradient descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and use it to derive novel variants of minibatched SDCA. Our guarantees for both methods are expressed in terms of the original nonsmooth primal problem based on the hingeloss. 1.
Accelerated minibatch stochastic dual coordinate ascent. arxiv
"... Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the minibatch setting that is often used in practice. Our main contribution is to introduce an accelerated mini ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the minibatch setting that is often used in practice. Our main contribution is to introduce an accelerated minibatch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007]. 1
More effective distributed ML via a stale synchronous parallel parameter server
 In NIPS
, 2013
"... We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy ..."
Abstract

Cited by 26 (15 self)
 Add to MetaCart
(Show Context)
We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easytouse shared interface for read/write access to an ML model’s values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fullysynchronous and asynchronous schemes. 1
Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems
"... Abstract—Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle webscale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gra ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle webscale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) are two popular approaches to compute matrix factorization. There has been a recent flurry of activity to parallelize these algorithms. However, due to the cubic time complexity in the target rank, ALS is not scalable to largescale datasets. On the other hand, SGD conducts efficient updates but usually suffers from slow convergence that is sensitive to the parameters. Coordinate descent, a classical optimization approach, has been used for many other largescale problems, but its application to matrix factorization for recommender systems has not been explored thoroughly. In this paper, we show that coordinate descent based methods have a more efficient update rule compared to ALS, and are faster and have more stable convergence than SGD. We study different update sequences and propose the CCD++ algorithm, which updates rankone factors one by one. In addition, CCD++ can be easily parallelized on both multicore and distributed systems. We empirically show that CCD++ is much faster than ALS and SGD in both settings. As an example, on a synthetic dataset with 2 billion ratings, CCD++ is 4 times faster than both SGD and ALS using a distributed system with 20 machines. KeywordsRecommender systems, Matrix factorization, Low rank approximation, Parallelization.