Results 1 
6 of
6
LightLDA: Big topic models on modest compute clusters. page arXiv:1412.1576 [stat.ML
, 2014
"... When building largescale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrialsized clusters with thousands of nodes, which are ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
When building largescale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrialsized clusters with thousands of nodes, which are out of reach for most practitioners and academic researchers. We consider this challenge in the context of topic modeling on webscale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1millionword vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens — a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highlyefficient O(1)MetropolisHastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges
On model parallelization and scheduling strategies for distributed machine learning
 In NIPS
, 2014
"... Distributed machine learning has typically been approached from a data parallel perspective, where big data are partitioned to multiple workers and an algorithm is executed concurrently over different data subsets under various synchronization schemes to ensure speedup and/or correctness. A siblin ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
Distributed machine learning has typically been approached from a data parallel perspective, where big data are partitioned to multiple workers and an algorithm is executed concurrently over different data subsets under various synchronization schemes to ensure speedup and/or correctness. A sibling problem that has received relatively less attention is how to ensure efficient and correct model parallel execution of ML algorithms, where parameters of an ML program are partitioned to different workers and undergone concurrent iterative updates. We argue that model and data parallelisms impose rather different challenges for system design, algorithmic adjustment, and theoretical analysis. In this paper, we develop a system for modelparallelism, STRADS, that provides a programming abstraction for scheduling parameter updates by discovering and leveraging changing structural properties of ML programs. STRADS enables a flexible tradeoff between scheduling efficiency and fidelity to intrinsic dependencies within the models, and improves memory efficiency of distributed ML. We demonstrate the efficacy of modelparallel algorithms implemented on STRADS versus popular implementations for topic modeling, matrix factorization, and Lasso. 1
Petuum: A New Platform for Distributed Machine Learning on Big Data
 IEEE Transactions on Big Data
, 2015
"... How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrialscale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or petabytes)? Contemporary parallelization strate ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrialscale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or petabytes)? Contemporary parallelization strategies employ finegrained operations and scheduling beyond the classic bulksynchronous processing paradigm popularized by MapReduce, or even specialized operators relying on graphical representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of different ML programs at scale. We propose a generalpurpose framework that systematically addresses data and modelparallel
Managed Communication and Consistency for Fast DataParallel Iterative Analytics
"... At the core of Machine Learning (ML) analytics is often an expertsuggested model, whose parameters are refined by iteratively processing a training dataset until convergence. The completion time (i.e. convergence time) and quality of the learned model not only depends on the rate at which the refin ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
At the core of Machine Learning (ML) analytics is often an expertsuggested model, whose parameters are refined by iteratively processing a training dataset until convergence. The completion time (i.e. convergence time) and quality of the learned model not only depends on the rate at which the refinements are generated but also the quality of each refinement. While dataparallel ML applications often employ a loose consistency model when updating shared model parameters to maximize parallelism, the accumulated error may seriously impact the quality of refinements and thus delay completion time, a problem that usually gets worse with scale. Although more immediate propagation of updates reduces the accumulated error, this strategy is limited by physical network bandwidth. Additionally, the performance of
Largescale Distributed Dependent Nonparametric Trees
"... Practical applications of Bayesian nonparametric (BNP) models have been limited, due to their high computational complexity and poor scaling on large data. In this paper, we consider dependent nonparametric trees (DNTs), a powerful infinite model that captures timeevolving hierarchies, and devel ..."
Abstract
 Add to MetaCart
(Show Context)
Practical applications of Bayesian nonparametric (BNP) models have been limited, due to their high computational complexity and poor scaling on large data. In this paper, we consider dependent nonparametric trees (DNTs), a powerful infinite model that captures timeevolving hierarchies, and develop a largescale distributed training system. Our major contributions include: (1) an effective memoized variational inference for DNTs, with a novel birthmerge strategy for exploring the unbounded tree space; (2) a modelparallel scheme for concurrent tree growing/pruning and efficient model alignment, through conflictfree model partitioning and lightweight synchronization; (3) a dataparallel scheme for variational parameter updates that allows distributed processing of massive data. Using 64 cores in 36 hours, our system learns a 10Knode DNT topic model on 8M documents that captures both highfrequency and longtail topics. Our data and model scales are ordersofmagnitude larger than recent results on the hierarchical Dirichlet process, and the nearlinear scalability indicates great potential for even bigger problem sizes. 1.
Distributed Machine Learning via Sufficient Factor Broadcasting
"... Matrixparametrized models, including multiclass logistic regression and sparse coding, are used in machine learning (ML) applications ranging from computer vision to computational biology. When these models are applied to largescale ML problems starting at millions of samples and tens of thousands ..."
Abstract
 Add to MetaCart
(Show Context)
Matrixparametrized models, including multiclass logistic regression and sparse coding, are used in machine learning (ML) applications ranging from computer vision to computational biology. When these models are applied to largescale ML problems starting at millions of samples and tens of thousands of classes, their parameter matrix can grow at an unexpected rate, resulting in high parameter synchronization costs that greatly slow down distributed learning. To address this issue, we propose a Sufficient Factor Broadcasting (SFB) computation model for efficient distributed learning of a large family of matrixparameterized models, which share the following property: the parameter update computed on each data sample is a rank1 matrix, i.e. the outer product of two “sufficient factors” (SFs). By broadcasting the SFs among worker machines and reconstructing the update matrices locally at each worker, SFB improves communication efficiency — communication costs are linear in the parameter matrix’s dimensions, rather than quadratic — without affecting computational correctness. We present a theoretical convergence analysis of SFB, and empirically corroborate its efficiency on four different matrixparametrized ML models. 1