Results 1  10
of
77
Scalable inference in latent variable models
 In WSDM
, 2012
"... Latent variable techniques are pivotal in tasks ranging from predicting user click patterns and targeting ads to organizing the news and managing user generated content. Latent variable techniques like topic modeling, clustering, and subspace estimation provide substantial insight into the latent st ..."
Abstract

Cited by 62 (16 self)
 Add to MetaCart
(Show Context)
Latent variable techniques are pivotal in tasks ranging from predicting user click patterns and targeting ads to organizing the news and managing user generated content. Latent variable techniques like topic modeling, clustering, and subspace estimation provide substantial insight into the latent structure of complex data with little or no external guidance making them ideal for reasoning about largescale, rapidly evolving datasets. Unfortunately, due to the data dependencies and global state introduced by latent variables and the iterative nature of latent variable inference, latentvariable techniques are often prohibitively expensive to apply to largescale, streaming datasets. In this paper we present a scalable parallel framework for efficient inference in latent variable models over streaming webscale data. Our framework addresses three key challenges: 1) synchronizing the global state which includes global latent variables (e.g., cluster centers and dictionaries); 2) efficiently storing and retrieving the large local state which includes the datapoints and their corresponding latent variables (e.g., cluster membership); and 3) sequentially incorporating streaming data (e.g., the news). We address these challenges by introducing: 1) a novel deltabased aggregation system with a bandwidthefficient communication protocol; 2) scheduleaware outofcore storage; and 3) approximate forward sampling to rapidly incorporate new data. We demonstrate stateoftheart performance of our framework by easily tackling datasets two orders of magnitude larger than those addressed by the current stateoftheart. Furthermore, we provide an optimized and easily customizable opensource implementation of the framework 1.
Recent Advances of Largescale Linear Classification
"... Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much ..."
Abstract

Cited by 32 (6 self)
 Add to MetaCart
(Show Context)
Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much faster. Recently, many research works have developed efficient optimization methods to construct linear classifiers and applied them to some largescale applications. In this paper, we give a comprehensive survey on the recent development of this active research area.
Accelerated, parallel and proximal coordinate descent
, 2014
"... We propose a new stochastic coordinate descent method for minimizing the sum of convex functions each of which depends on a small number of coordinates only. Our method (APPROX) is simultaneously Accelerated, Parallel and PROXimal; this is the first time such a method is proposed. In the special cas ..."
Abstract

Cited by 31 (6 self)
 Add to MetaCart
We propose a new stochastic coordinate descent method for minimizing the sum of convex functions each of which depends on a small number of coordinates only. Our method (APPROX) is simultaneously Accelerated, Parallel and PROXimal; this is the first time such a method is proposed. In the special case when the number of processors is equal to the number of coordinates, the method converges at the rate 2ω̄L̄R2/(k+1)2, where k is the iteration counter, ω ̄ is an average degree of separability of the loss function, L ̄ is the average of Lipschitz constants associated with the coordinates and individual functions in the sum, and R is the distance of the initial point from the minimizer. We show that the method can be implemented without the need to perform fulldimensional vector operations, which is the major bottleneck of accelerated coordinate descent. The fact that the method depends on the average degree of separability, and not on the maximum degree of separability, can be attributed to the use of new safe large stepsizes, leading to improved expected separable overapproximation (ESO). These are of independent interest and can be utilized in all existing parallel stochastic coordinate descent algorithms based on the concept of ESO.
More effective distributed ML via a stale synchronous parallel parameter server
 In NIPS
, 2013
"... We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy ..."
Abstract

Cited by 30 (16 self)
 Add to MetaCart
(Show Context)
We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easytouse shared interface for read/write access to an ML model’s values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fullysynchronous and asynchronous schemes. 1
Minibatch primal and dual methods for SVMs
 In 30th International Conference on Machine Learning
, 2013
"... We address the issue of using minibatches in stochastic optimization of SVMs. We show that the same quantity, the spectral norm of the data, controls the parallelization speedup obtained for both primal stochastic subgradient descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and us ..."
Abstract

Cited by 26 (7 self)
 Add to MetaCart
We address the issue of using minibatches in stochastic optimization of SVMs. We show that the same quantity, the spectral norm of the data, controls the parallelization speedup obtained for both primal stochastic subgradient descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and use it to derive novel variants of minibatched SDCA. Our guarantees for both methods are expressed in terms of the original nonsmooth primal problem based on the hingeloss. 1.
Accelerated minibatch stochastic dual coordinate ascent
"... Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the minibatch setting that is often used in practice. Our main contribution is to introduce an accelerated mini ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the minibatch setting that is often used in practice. Our main contribution is to introduce an accelerated minibatch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007].
On the nonasymptotic convergence of cyclic coordinate descent methods
 SIAM Journal on Optimization
"... Abstract. Cyclic coordinate descent is a classic optimization method that has witnessed a resurgence of interest in signal processing, statistics, and machine learning. Reasons for this renewed interest include the simplicity, speed, and stability of the method, as well as its competitive performan ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Cyclic coordinate descent is a classic optimization method that has witnessed a resurgence of interest in signal processing, statistics, and machine learning. Reasons for this renewed interest include the simplicity, speed, and stability of the method, as well as its competitive performance on 1 regularized smooth optimization problems. Surprisingly, very little is known about its nonasymptotic convergence behavior on these problems. Most existing results either just prove convergence or provide asymptotic rates. We fill this gap in the literature by proving O(1/k) convergence rates (where k is the iteration count) for two variants of cyclic coordinate descent under an isotonicity assumption. Our analysis proceeds by comparing the objective values attained by the two variants with each other, as well as with the gradient descent algorithm. We show that the iterates generated by the cyclic coordinate descent methods remain better than those of gradient descent uniformly over time.
Communicationefficient distributed dual coordinate ascent
 In Advances in Neural Information Processing Systems (NIPS
, 2014
"... ar ..."