Results 1  10
of
126
Large Scale Distributed Deep Networks
"... Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a ..."
Abstract

Cited by 92 (11 self)
 Add to MetaCart
(Show Context)
Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for largescale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of LBFGS. Downpour SGD and Sandblaster LBFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves stateoftheart performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradientbased machine learning algorithm. 1
Accelerated minibatch stochastic dual coordinate ascent. arxiv
"... Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the minibatch setting that is often used in practice. Our main contribution is to introduce an accelerated mini ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the minibatch setting that is often used in practice. Our main contribution is to introduce an accelerated minibatch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007]. 1
More effective distributed ML via a stale synchronous parallel parameter server
 In NIPS
, 2013
"... We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy ..."
Abstract

Cited by 26 (15 self)
 Add to MetaCart
(Show Context)
We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easytouse shared interface for read/write access to an ML model’s values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fullysynchronous and asynchronous schemes. 1
WTF: The who to follow service at twitter
"... Wtf (“Who to Follow”) is Twitter’s user recommendation service, which is responsible for creating millions of connections daily between users based on shared interests, common connections, and other related factors. This paper provides an architectural overview and shares lessons we learned in build ..."
Abstract

Cited by 24 (4 self)
 Add to MetaCart
(Show Context)
Wtf (“Who to Follow”) is Twitter’s user recommendation service, which is responsible for creating millions of connections daily between users based on shared interests, common connections, and other related factors. This paper provides an architectural overview and shares lessons we learned in building and running the service over the past few years. Particularly noteworthy was our design decision to process the entire Twitter graph in memory on a single server, which significantly reduced architectural complexity and allowed us to develop and deploy the service in only a few months. At the core of our architecture is Cassovary, an opensource inmemory graph processing engine we built from scratch for Wtf. Besides powering Twitter’s user recommendations, Cassovary is also used for search, discovery, promoted products, and other services as well. We describe and evaluate a few graph recommendation algorithms implemented in Cassovary, including a novel approach based on a combination of random walks and SALSA. Looking into the future, we revisit the design of our architecture and comment on its limitations, which are presently being addressed in a secondgeneration system under development.
From "Think Like a Vertex " to "Think Like a Graph"
"... To meet the challenge of processing rapidly growing graph and network data created by modern applications, a number of distributed graph processing systems have emerged, such as Pregel and GraphLab. All these systems divide input graphs into partitions, and employ a “think like a vertex ” programmin ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
(Show Context)
To meet the challenge of processing rapidly growing graph and network data created by modern applications, a number of distributed graph processing systems have emerged, such as Pregel and GraphLab. All these systems divide input graphs into partitions, and employ a “think like a vertex ” programming model to support iterative graph computation. This vertexcentric model is easy to program and has been proved useful for many graph algorithms. However, this model hides the partitioning information from the users, thus prevents many algorithmspecific optimizations. This often results in longer execution time due to excessive network messages (e.g. in Pregel) or heavy scheduling overhead to ensure data consistency (e.g. in GraphLab). To address this limitation, we propose a new “think like a graph ” programming paradigm. Under this graphcentric model, the partition structure is opened up to the users, and can be utilized so that communication within a partition can bypass the heavy message passing or scheduling machinery. We implemented this model in a new system, called Giraph++, based on Apache Giraph, an open source implementation of Pregel. We explore the applicability of the graphcentric model to three categories of graph algorithms, and demonstrate its flexibility and superior performance, especially on wellpartitioned data. For example, on a web graph with 118 million vertices and 855 million edges, the graphcentric version of connected component detection algorithm runs 63X faster and uses 204X fewer network messages than its vertexcentric counterpart. 1.
Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems
"... Abstract—Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle webscale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gra ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle webscale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) are two popular approaches to compute matrix factorization. There has been a recent flurry of activity to parallelize these algorithms. However, due to the cubic time complexity in the target rank, ALS is not scalable to largescale datasets. On the other hand, SGD conducts efficient updates but usually suffers from slow convergence that is sensitive to the parameters. Coordinate descent, a classical optimization approach, has been used for many other largescale problems, but its application to matrix factorization for recommender systems has not been explored thoroughly. In this paper, we show that coordinate descent based methods have a more efficient update rule compared to ALS, and are faster and have more stable convergence than SGD. We study different update sequences and propose the CCD++ algorithm, which updates rankone factors one by one. In addition, CCD++ can be easily parallelized on both multicore and distributed systems. We empirically show that CCD++ is much faster than ALS and SGD in both settings. As an example, on a synthetic dataset with 2 billion ratings, CCD++ is 4 times faster than both SGD and ALS using a distributed system with 20 machines. KeywordsRecommender systems, Matrix factorization, Low rank approximation, Parallelization.
A Distributed Data Management Using MapReduce
"... MapReduce is a framework for processing and managing large scale data sets in a distributed cluster, which has been used for applications such as generating search indexes, document clustering, access log analysis, and various other forms of data analytics. MapReduce adopts a flexible computation mo ..."
Abstract

Cited by 17 (7 self)
 Add to MetaCart
MapReduce is a framework for processing and managing large scale data sets in a distributed cluster, which has been used for applications such as generating search indexes, document clustering, access log analysis, and various other forms of data analytics. MapReduce adopts a flexible computation model with a simple interface consisting of map and reduce functions whose implementations can be customized by application developers. Since its introduction, a substantial amount of research efforts have been directed towards making it more usable and efficient for supporting databasecentric operations. In this paper we aim to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.
Solving the straggler problem with bounded staleness
"... Abstract. Many important applications fall into the broad class of iterative convergent algorithms. Parallel implementations of these algorithms are naturally expressed using the Bulk Synchronous Parallel (BSP) model of computation. However, implementations using BSP are plagued by the straggler pro ..."
Abstract

Cited by 13 (8 self)
 Add to MetaCart
(Show Context)
Abstract. Many important applications fall into the broad class of iterative convergent algorithms. Parallel implementations of these algorithms are naturally expressed using the Bulk Synchronous Parallel (BSP) model of computation. However, implementations using BSP are plagued by the straggler problem, where every transient slowdown of any given thread can delay all other threads. This paper presents the Stale Synchronous Parallel (SSP) model as a generalization of BSP that preserves many of its advantages, while avoiding the straggler problem. Algorithms using SSP can execute efficiently, even with significant delays in some threads, addressing the oftfaced straggler problem. 1
Asynchronous LargeScale Graph Processing Made Easy
"... Scaling large iterative graph processing applications through parallel computing is a very important problem. Several graph processing frameworks have been proposed that insulate developers from lowlevel details of parallel programming. Most of these frameworks are based on the bulk synchronous par ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
Scaling large iterative graph processing applications through parallel computing is a very important problem. Several graph processing frameworks have been proposed that insulate developers from lowlevel details of parallel programming. Most of these frameworks are based on the bulk synchronous parallel (BSP) model in order to simplify application development. However, in the BSP model, vertices are processed in fixed rounds, which often leads to slow convergence. Asynchronous executions can significantly accelerate convergence by intelligently ordering vertex updates and incorporating the most recent updates. Unfortunately, asynchronous models do not provide the programming simplicity and scalability advantages of the BSP model. In this paper, we combine the easy programmability of the BSP model with the high performance of asynchronous execution. We have designed GRACE, a new graph programming platform that separates application logic from execution policies. GRACE provides a synchronous iterative graph programming model for users to easily implement, test, and debug their applications. It also contains a carefully designed and implemented parallel execution engine for both synchronous and userspecified builtin asynchronous execution policies. Our experiments show that asynchronous execution in GRACE can yield convergence rates comparable to fully asynchronous executions, while still achieving the nearlinear scalability of a synchronous BSP system. 1.
Blazes: Coordination Analysis for Distributed Programs
"... All rights reserved. ..."
(Show Context)