Results 1  10
of
21
Parallel Coordinate Descent Methods for Big Data Optimization
, 2012
"... In this work we show that randomized (block) coordinate descent methods can be accelerated by parallelization when applied to the problem of minimizing the sum of a partially separable smooth convex function and a simple separable convex function. The theoretical speedup, as compared to the serial m ..."
Abstract

Cited by 74 (4 self)
 Add to MetaCart
(Show Context)
In this work we show that randomized (block) coordinate descent methods can be accelerated by parallelization when applied to the problem of minimizing the sum of a partially separable smooth convex function and a simple separable convex function. The theoretical speedup, as compared to the serial method, and referring to the number of iterations needed to approximately solve the problem with high probability, is a simple expression depending on the number of parallel processors and a natural and easily computable measure of separability of the smooth component of the objective function. In the worst case, when no degree of separability is present, there may be no speedup; in the best case, when the problem is separable, the speedup is equal to the number of processors. Our analysis also works in the mode when the number of blocks being updated at each iteration is random, which allows for modeling situations with busy or unreliable processors. We show that our algorithm is able to solve a LASSO problem involving a matrix with 20 billion nonzeros in 2 hours on a large memory node with 24 cores.
A Fast Parallel SGD for Matrix Factorization in Shared Memory Systems
"... Matrix factorization is known to be an effective method for recommender systems that are given only the ratings from users to items. Currently, stochastic gradient descent (SGD) is one of the most popular algorithms for matrix factorization. However, as a sequential approach, SGD is difficult to be ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Matrix factorization is known to be an effective method for recommender systems that are given only the ratings from users to items. Currently, stochastic gradient descent (SGD) is one of the most popular algorithms for matrix factorization. However, as a sequential approach, SGD is difficult to be parallelized for handling webscale problems. In this paper, we develop a fast parallel SGD method, FPSGD, for shared memory systems. By dramatically reducing the cachemiss rate and carefully addressing the load balance of threads, FPSGD is more efficient than stateoftheart parallel algorithms for matrix factorization.
Highperformance distributed ml at scale through parameter server consistency models
 In AAAI
, 2015
"... As Machine Learning (ML) applications embrace greater data size and model complexity, practitioners turn to distributed clusters to satisfy the increased computational and memory demands. Effective use of clusters for ML programs requires considerable expertise in writing distributed code, but exi ..."
Abstract

Cited by 6 (6 self)
 Add to MetaCart
(Show Context)
As Machine Learning (ML) applications embrace greater data size and model complexity, practitioners turn to distributed clusters to satisfy the increased computational and memory demands. Effective use of clusters for ML programs requires considerable expertise in writing distributed code, but existing highlyabstracted frameworks like Hadoop that pose low barriers to distributedprogramming have not, in practice, matched the performance seen in highly specialized and advanced ML implementations. The recent Parameter Server (PS) paradigm is a middle ground between these extremes, allowing easy conversion of singlemachine parallel ML programs into distributed ones, while maintaining high throughput through relaxed “consistency models ” that allow asynchronous (and, hence, inconsistent) parameter reads. However, due to insufficient theoretical study, it is not clear which of these consistency models can really ensure correct ML algorithm output; at the same time, there remain many theoreticallymotivated but undiscovered opportunities to maximize computational throughput. Inspired by this challenge, we study both the theoretical guarantees and empirical behavior of iterativeconvergent ML algorithms in existing PS consistency models. We then use the gleaned insights to improve a consistency model using an “eager ” PS communication mechanism, and implement it as a new PS system that enables ML programs to reach their solution more quickly.
On model parallelization and scheduling strategies for distributed machine learning
 In NIPS
, 2014
"... Distributed machine learning has typically been approached from a data parallel perspective, where big data are partitioned to multiple workers and an algorithm is executed concurrently over different data subsets under various synchronization schemes to ensure speedup and/or correctness. A siblin ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Distributed machine learning has typically been approached from a data parallel perspective, where big data are partitioned to multiple workers and an algorithm is executed concurrently over different data subsets under various synchronization schemes to ensure speedup and/or correctness. A sibling problem that has received relatively less attention is how to ensure efficient and correct model parallel execution of ML algorithms, where parameters of an ML program are partitioned to different workers and undergone concurrent iterative updates. We argue that model and data parallelisms impose rather different challenges for system design, algorithmic adjustment, and theoretical analysis. In this paper, we develop a system for modelparallelism, STRADS, that provides a programming abstraction for scheduling parameter updates by discovering and leveraging changing structural properties of ML programs. STRADS enables a flexible tradeoff between scheduling efficiency and fidelity to intrinsic dependencies within the models, and improves memory efficiency of distributed ML. We demonstrate the efficacy of modelparallel algorithms implemented on STRADS versus popular implementations for topic modeling, matrix factorization, and Lasso. 1
Primitives for dynamic big model parallelism
 In Advances in Neural Information Processing Systems (NIPS
, 2014
"... When training large machine learning models with many variables or parameters, a single machine is often inadequate since the model may be too large to fit in memory, while training can take a long time even with stochastic updates. A natural recourse is to turn to distributed cluster computing, in ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
When training large machine learning models with many variables or parameters, a single machine is often inadequate since the model may be too large to fit in memory, while training can take a long time even with stochastic updates. A natural recourse is to turn to distributed cluster computing, in order to harness additional memory and processors. However, naive, unstructured parallelization of ML algorithms can make inefficient use of distributed memory, while failing to obtain proportional convergence speedups — or can even result in divergence. We develop a framework of primitives for dynamic modelparallelism, STRADS, in order to explore partitioning and update scheduling of model variables in distributed ML algorithms — thus improving their memory efficiency while presenting new opportunities to speed up convergence without compromising inference correctness. We demonstrate the efficacy of modelparallel algorithms implemented in STRADS versus popular implementations for Topic Modeling, Matrix Factorization and Lasso. 1 ar
PASSCoDe: Parallel ASynchronous Stochastic dual Coordinate Descent
 Machine Learning
, 2011
"... Stochastic Dual Coordinate Descent (DCD) is one of the most efficient ways to solve the family of `2regularized empirical risk minimization problems, including linear SVM, logistic regression, and many others. The vanilla implementation of DCD is quite slow; however, by maintaining primal variab ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Stochastic Dual Coordinate Descent (DCD) is one of the most efficient ways to solve the family of `2regularized empirical risk minimization problems, including linear SVM, logistic regression, and many others. The vanilla implementation of DCD is quite slow; however, by maintaining primal variables while updating dual variables, the time complexity of DCD can be significantly reduced. Such a strategy forms the core algorithm in the widelyused LIBLINEAR package. In this paper, we parallelize the DCD algorithms in LIBLINEAR. In recent research, several synchronized parallel DCD algorithms have been proposed, however, they fail to achieve good speedup in the shared memory multicore setting. In this paper, we propose a family of parallel asynchronous stochastic dual coordinate descent algorithms (PASSCoDe). Each thread repeatedly selects a random dual variable and conducts coordinate updates using the primal variables that are stored in the shared memory. We analyze the convergence properties of DCD when different locking/atomic mechanisms are applied. For implementation with atomic operations, we show linear convergence under mild conditions. For implementation without any atomic operations or locking, we present a novel error analysis for PASSCoDe under the multicore environment, showing that the converged solution is the exact solution for a primal problem with a perturbed regularizer. Experimental results show that our methods are much faster than previous parallel coordinate descent solvers. 1.
Petuum: A New Platform for Distributed Machine Learning on Big Data
 IEEE Transactions on Big Data
, 2015
"... How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrialscale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or petabytes)? Contemporary parallelization strate ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrialscale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or petabytes)? Contemporary parallelization strategies employ finegrained operations and scheduling beyond the classic bulksynchronous processing paradigm popularized by MapReduce, or even specialized operators relying on graphical representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of different ML programs at scale. We propose a generalpurpose framework that systematically addresses data and modelparallel
0A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems
"... Matrix factorization is known to be an effective method for recommender systems that are given only the ratings from users to items. Currently, stochastic gradient (SG) method is one of the most popular algorithms for matrix factorization. However, as a sequential approach, SG is difficult to be par ..."
Abstract
 Add to MetaCart
(Show Context)
Matrix factorization is known to be an effective method for recommender systems that are given only the ratings from users to items. Currently, stochastic gradient (SG) method is one of the most popular algorithms for matrix factorization. However, as a sequential approach, SG is difficult to be parallelized for handling webscale problems. In this paper, we develop a fast parallel SG method, FPSG, for shared memory systems. By dramatically reducing the cachemiss rate and carefully addressing the load balance of threads, FPSG is more efficient than stateoftheart parallel algorithms for matrix factorization.
Under consideration for publication in Knowledge and Information Systems SharedMemory and SharedNothing Stochastic Gradient Descent Algorithms for Matrix Completion
, 2013
"... Abstract. We provide parallel algorithms for largescale matrix completion on problems with millions of rows, millions of columns, and billions of revealed entries. We focus on inmemory algorithms that run either in a sharedmemory environment on a powerful compute node or in a sharednothing envir ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We provide parallel algorithms for largescale matrix completion on problems with millions of rows, millions of columns, and billions of revealed entries. We focus on inmemory algorithms that run either in a sharedmemory environment on a powerful compute node or in a sharednothing environment on a small cluster of commodity nodes; even very large problems can be handled effectively in these settings. Our ASGD, DSGDMR, DSGD++, and CSGD algorithms are novel variants of the popular stochastic gradient descent (SGD) algorithm, with the latter three algorithms based on a new “stratified SGD ” approach. All of the algorithms are cachefriendly and exploit threadlevel parallelism, inmemory processing, and asynchronous communication. We investigate the performance of both new and existing algorithms via a theoretical complexity analysis and a set of largescale experiments. The results show that CSGD is more scalable, and up to 60 % faster, than the bestperforming alternative method in the sharedmemory setting. DSGD++ is superior in terms of overall runtime, memory consumption, and scalability in the sharednothing setting. For example, DSGD++ can solve a difficult matrix completion problem on a highvariance matrix with 10M rows, 1M columns, and 10B revealed entries in around 40 minutes on 16 compute nodes. In general, algorithms based on stochastic gradient descent appear to perform better than algorithms based on alternating minimizations, such as the PALS and DALS alternating leastsquares algorithms.
Distributed Matrix Completion and Robust Factorization
"... † These authors contributed equally. If learning methods are to scale to the massive sizes of modern data sets, it is essential for the field of machine learning to embrace parallel and distributed computing. Inspired by the recent development of matrix factorization methods with rich theory but poo ..."
Abstract
 Add to MetaCart
(Show Context)
† These authors contributed equally. If learning methods are to scale to the massive sizes of modern data sets, it is essential for the field of machine learning to embrace parallel and distributed computing. Inspired by the recent development of matrix factorization methods with rich theory but poor computational complexity and by the relative ease of mapping matrices onto distributed architectures, we introduce a scalable divideandconquer framework for noisy matrix factorization. We present a thorough theoretical analysis of this framework in which we characterize the statistical errors introduced by the “divide ” step and control their magnitude in the “conquer ” step, so that the overall algorithm enjoys highprobability estimation guarantees comparable to those of its base algorithm. We also present experiments in collaborative filtering and video background modeling that demonstrate the nearlinear to superlinear speedups attainable with this approach.