Results 1  10
of
25
A Fast Parallel SGD for Matrix Factorization in Shared Memory Systems
"... Matrix factorization is known to be an effective method for recommender systems that are given only the ratings from users to items. Currently, stochastic gradient descent (SGD) is one of the most popular algorithms for matrix factorization. However, as a sequential approach, SGD is difficult to be ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Matrix factorization is known to be an effective method for recommender systems that are given only the ratings from users to items. Currently, stochastic gradient descent (SGD) is one of the most popular algorithms for matrix factorization. However, as a sequential approach, SGD is difficult to be parallelized for handling webscale problems. In this paper, we develop a fast parallel SGD method, FPSGD, for shared memory systems. By dramatically reducing the cachemiss rate and carefully addressing the load balance of threads, FPSGD is more efficient than stateoftheart parallel algorithms for matrix factorization.
PASSCoDe: Parallel ASynchronous Stochastic dual Coordinate Descent
 Machine Learning
, 2011
"... Stochastic Dual Coordinate Descent (DCD) is one of the most efficient ways to solve the family of `2regularized empirical risk minimization problems, including linear SVM, logistic regression, and many others. The vanilla implementation of DCD is quite slow; however, by maintaining primal variab ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Stochastic Dual Coordinate Descent (DCD) is one of the most efficient ways to solve the family of `2regularized empirical risk minimization problems, including linear SVM, logistic regression, and many others. The vanilla implementation of DCD is quite slow; however, by maintaining primal variables while updating dual variables, the time complexity of DCD can be significantly reduced. Such a strategy forms the core algorithm in the widelyused LIBLINEAR package. In this paper, we parallelize the DCD algorithms in LIBLINEAR. In recent research, several synchronized parallel DCD algorithms have been proposed, however, they fail to achieve good speedup in the shared memory multicore setting. In this paper, we propose a family of parallel asynchronous stochastic dual coordinate descent algorithms (PASSCoDe). Each thread repeatedly selects a random dual variable and conducts coordinate updates using the primal variables that are stored in the shared memory. We analyze the convergence properties of DCD when different locking/atomic mechanisms are applied. For implementation with atomic operations, we show linear convergence under mild conditions. For implementation without any atomic operations or locking, we present a novel error analysis for PASSCoDe under the multicore environment, showing that the converged solution is the exact solution for a primal problem with a perturbed regularizer. Experimental results show that our methods are much faster than previous parallel coordinate descent solvers. 1.
Highperformance distributed ml at scale through parameter server consistency models
 In AAAI
, 2015
"... As Machine Learning (ML) applications embrace greater data size and model complexity, practitioners turn to distributed clusters to satisfy the increased computational and memory demands. Effective use of clusters for ML programs requires considerable expertise in writing distributed code, but exi ..."
Abstract

Cited by 6 (6 self)
 Add to MetaCart
(Show Context)
As Machine Learning (ML) applications embrace greater data size and model complexity, practitioners turn to distributed clusters to satisfy the increased computational and memory demands. Effective use of clusters for ML programs requires considerable expertise in writing distributed code, but existing highlyabstracted frameworks like Hadoop that pose low barriers to distributedprogramming have not, in practice, matched the performance seen in highly specialized and advanced ML implementations. The recent Parameter Server (PS) paradigm is a middle ground between these extremes, allowing easy conversion of singlemachine parallel ML programs into distributed ones, while maintaining high throughput through relaxed “consistency models ” that allow asynchronous (and, hence, inconsistent) parameter reads. However, due to insufficient theoretical study, it is not clear which of these consistency models can really ensure correct ML algorithm output; at the same time, there remain many theoreticallymotivated but undiscovered opportunities to maximize computational throughput. Inspired by this challenge, we study both the theoretical guarantees and empirical behavior of iterativeconvergent ML algorithms in existing PS consistency models. We then use the gleaned insights to improve a consistency model using an “eager ” PS communication mechanism, and implement it as a new PS system that enables ML programs to reach their solution more quickly.
Role Discovery in Networks
, 2014
"... Roles represent nodelevel connectivity patterns such as starcenter, staredge nodes, nearcliques or nodes that act as bridges to different regions of the graph. Intuitively, two nodes belong to the same role if they are struturally similar. Roles have been mainly of interest to sociologists, b ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Roles represent nodelevel connectivity patterns such as starcenter, staredge nodes, nearcliques or nodes that act as bridges to different regions of the graph. Intuitively, two nodes belong to the same role if they are struturally similar. Roles have been mainly of interest to sociologists, but more recently, roles have become increasingly useful in other domains. Traditionally, the notion of roles were defined based on graph equivalences such as structural, regular, and stochastic equivalences. We briefly revisit the notions and instead propose a more general formulation of roles based on the similarity of a feature representation (in contrast to the graph representation). This leads us to propose a taxonomy of two general classes of techniques for discovering roles which includes (i) graphbased roles and (ii) featurebased roles. This survey focuses primarily on featurebased roles. In particular, we also introduce a flexible framework for discovering roles using the notion of structural similarity on a featurebased representation. The framework consists of two fundamental components: (1) role feature construction and (2) role assignment using the learned feature representation. We discuss the relevant decisions for discovering featurebased roles and highlight the advantages and disadvantages of the many techniques that can be used for this purpose. Finally, we discuss potential applications and future directions and challenges.
On model parallelization and scheduling strategies for distributed machine learning
 In NIPS
, 2014
"... Distributed machine learning has typically been approached from a data parallel perspective, where big data are partitioned to multiple workers and an algorithm is executed concurrently over different data subsets under various synchronization schemes to ensure speedup and/or correctness. A siblin ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
Distributed machine learning has typically been approached from a data parallel perspective, where big data are partitioned to multiple workers and an algorithm is executed concurrently over different data subsets under various synchronization schemes to ensure speedup and/or correctness. A sibling problem that has received relatively less attention is how to ensure efficient and correct model parallel execution of ML algorithms, where parameters of an ML program are partitioned to different workers and undergone concurrent iterative updates. We argue that model and data parallelisms impose rather different challenges for system design, algorithmic adjustment, and theoretical analysis. In this paper, we develop a system for modelparallelism, STRADS, that provides a programming abstraction for scheduling parameter updates by discovering and leveraging changing structural properties of ML programs. STRADS enables a flexible tradeoff between scheduling efficiency and fidelity to intrinsic dependencies within the models, and improves memory efficiency of distributed ML. We demonstrate the efficacy of modelparallel algorithms implemented on STRADS versus popular implementations for topic modeling, matrix factorization, and Lasso. 1
Petuum: A New Platform for Distributed Machine Learning on Big Data
 IEEE Transactions on Big Data
, 2015
"... How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrialscale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or petabytes)? Contemporary parallelization strate ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrialscale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or petabytes)? Contemporary parallelization strategies employ finegrained operations and scheduling beyond the classic bulksynchronous processing paradigm popularized by MapReduce, or even specialized operators relying on graphical representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of different ML programs at scale. We propose a generalpurpose framework that systematically addresses data and modelparallel
Primitives for dynamic big model parallelism
 In Advances in Neural Information Processing Systems (NIPS
, 2014
"... When training large machine learning models with many variables or parameters, a single machine is often inadequate since the model may be too large to fit in memory, while training can take a long time even with stochastic updates. A natural recourse is to turn to distributed cluster computing, in ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
When training large machine learning models with many variables or parameters, a single machine is often inadequate since the model may be too large to fit in memory, while training can take a long time even with stochastic updates. A natural recourse is to turn to distributed cluster computing, in order to harness additional memory and processors. However, naive, unstructured parallelization of ML algorithms can make inefficient use of distributed memory, while failing to obtain proportional convergence speedups — or can even result in divergence. We develop a framework of primitives for dynamic modelparallelism, STRADS, in order to explore partitioning and update scheduling of model variables in distributed ML algorithms — thus improving their memory efficiency while presenting new opportunities to speed up convergence without compromising inference correctness. We demonstrate the efficacy of modelparallel algorithms implemented in STRADS versus popular implementations for Topic Modeling, Matrix Factorization and Lasso. 1 ar
Polynomial networks and factorization machines: New insights and efficient training algorithms.
 In Proceedings of International Conference on Machine Learning (ICML),
, 2016
"... Abstract Polynomial networks and factorization machines are two recentlyproposed models that can efficiently use feature interactions in classification and regression tasks. In this paper, we revisit both models from a unified perspective. Based on this new view, we study the properties of both mo ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract Polynomial networks and factorization machines are two recentlyproposed models that can efficiently use feature interactions in classification and regression tasks. In this paper, we revisit both models from a unified perspective. Based on this new view, we study the properties of both models and propose new efficient training algorithms. Key to our approach is to cast parameter learning as a lowrank symmetric tensor estimation problem, which we solve by multiconvex optimization. We demonstrate our approach on regression and recommender system tasks.
Scalable Exemplar Clustering and Facility Location via Augmented Block Coordinate Descent with Column Generation
"... Abstract In recent years exemplar clustering has become a popular tool for applications in document and video summarization, active learning, and clustering with general similarity, where cluster centroids are required to be a subset of the data samples rather than their linear combinations. The pr ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract In recent years exemplar clustering has become a popular tool for applications in document and video summarization, active learning, and clustering with general similarity, where cluster centroids are required to be a subset of the data samples rather than their linear combinations. The problem is also wellknown as facility location in the operations research literature. While the problem has welldeveloped convex relaxation with approximation and recovery guarantees, its number of variables grows quadratically with the number of samples. Therefore, stateoftheart methods can hardly handle more than 10 4 samples (i.e. 10 8 variables). In this work, we propose an AugmentedLagrangian with Block Coordinate Descent (ALBCD) algorithm that utilizes problem structure to obtain closedform solution for each block subproblem, and exploits lowrank representation of the dissimilarity matrix to search active columns without computing the entire matrix. Experiments show our approach to be orders of magnitude faster than existing approaches and can handle problems of up to 10 6 samples. We also demonstrate successful applications of the algorithm on worldscale facility location, document summarization and active learning.