Results 1  10
of
86
Resilient distributed datasets: A faulttolerant abstraction for inmemory cluster computing
, 2011
"... We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform inmemory computations on large clusters in a faulttolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algo ..."
Abstract

Cited by 239 (27 self)
 Add to MetaCart
(Show Context)
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform inmemory computations on large clusters in a faulttolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarsegrained transformations rather than finegrained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks. 1
Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud
, 2012
"... While highlevel data parallel frameworks, like MapReduce, simplify the design and implementation of largescale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill ..."
Abstract

Cited by 141 (2 self)
 Add to MetaCart
While highlevel data parallel frameworks, like MapReduce, simplify the design and implementation of largescale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graphparallel computation while ensuring data consistency and achieving a high degree of parallel performance in the sharedmemory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees. We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic ChandyLamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 12 orders of magnitude performance gains over Hadoopbased implementations.
PowerGraph: Distributed GraphParallel Computation on Natural Graphs
"... Largescale graphstructured computation is central to tasks ranging from targeted advertising to natural language processing and has led to the development of several graphparallel abstractions including Pregel and GraphLab. However, the natural graphs commonly found in the realworld have highly ..."
Abstract

Cited by 128 (4 self)
 Add to MetaCart
(Show Context)
Largescale graphstructured computation is central to tasks ranging from targeted advertising to natural language processing and has led to the development of several graphparallel abstractions including Pregel and GraphLab. However, the natural graphs commonly found in the realworld have highly skewed powerlaw degree distributions, which challenge the assumptions made by these abstractions, limiting performance and scalability. In this paper, we characterize the challenges of computation on natural graphs in the context of existing graphparallel abstractions. We then introduce the PowerGraph abstraction which exploits the internal structure of graph programs to address these challenges. Leveraging the PowerGraph abstraction we introduce a new approach to distributed graph placement and representation that exploits the structure of powerlaw graphs. We provide a detailed analysis and experimental evaluation comparing PowerGraph to two popular graphparallel systems. Finally, we describe three different implementation strategies for PowerGraph and discuss their relative merits with empirical evaluations on largescale realworld problems demonstrating order of magnitude gains. 1
GraphChi: Largescale Graph Computation On just a PC
 In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, OSDI’12
, 2012
"... Current systems for graph computation require a distributed computing cluster to handle very large realworld problems, such as analysis on social networks or the web graph. While distributed computational resources have become more accessible, developing distributed graph algorithms still remains c ..."
Abstract

Cited by 115 (6 self)
 Add to MetaCart
(Show Context)
Current systems for graph computation require a distributed computing cluster to handle very large realworld problems, such as analysis on social networks or the web graph. While distributed computational resources have become more accessible, developing distributed graph algorithms still remains challenging, especially to nonexperts. In this work, we present GraphChi, a diskbased system for computing efficiently on graphs with billions of edges. By using a wellknown method to break large graphs into small parts, and a novel parallel sliding windows method, GraphChi is able to execute several advanced data mining, graph mining, and machine learning algorithms on very large graphs, using just a single consumerlevel computer. We further extend GraphChi to support graphs that evolve over time, and demonstrate that, on a single computer, GraphChi can process over one hundred thousand graph updates per second, while simultaneously performing computation. We show, through experiments and theoretical analysis, that GraphChi performs well on both SSDs and rotational hard drives. By repeating experiments reported for existing distributed systems, we show that, with only fraction of the resources, GraphChi can solve the same problems in very reasonable time. Our work makes largescale graph computation available to anyone with a modern PC. 1
CIEL: a universal execution engine for distributed dataflow computing
 in Proceedings of the 8th USENIX Symposium on Networked System Design and Implementation (NSDI). USENIX
"... This paper introduces CIEL, a universal execution engine for distributed dataflow programs. Like previous execution engines, CIEL masks the complexity of distributed programming. Unlike those systems, a CIEL job can make datadependent controlflow decisions, which enables it to compute iterative a ..."
Abstract

Cited by 78 (11 self)
 Add to MetaCart
(Show Context)
This paper introduces CIEL, a universal execution engine for distributed dataflow programs. Like previous execution engines, CIEL masks the complexity of distributed programming. Unlike those systems, a CIEL job can make datadependent controlflow decisions, which enables it to compute iterative and recursive algorithms. We have also developed Skywriting, a Turingcomplete scripting language that runs directly on CIEL. The execution engine provides transparent fault tolerance and distribution to Skywriting scripts and highperformance code written in other programming languages. We have deployed CIEL on a cloud computing platform, and demonstrate that it achieves scalable performance for both iterative and noniterative algorithms. 1
PACMan: Coordinated Memory Caching for Parallel Jobs
"... Abstract – Dataintensive analytics on large clusters is important for modern Internet services. As machines in these clusters have large memories, inmemory caching of inputs is an effective way to speed up these analytics jobs. The key challenge, however, is that these jobs run multiple tasks in p ..."
Abstract

Cited by 54 (12 self)
 Add to MetaCart
(Show Context)
Abstract – Dataintensive analytics on large clusters is important for modern Internet services. As machines in these clusters have large memories, inmemory caching of inputs is an effective way to speed up these analytics jobs. The key challenge, however, is that these jobs run multiple tasks in parallel and a job is sped up only when inputs of all such parallel tasks are cached. Indeed, a single task whose input is not cached can slow down the entire job. To meet this “allornothing ” property, we have built PACMan, a caching service that coordinates access to the distributed caches. This coordination is essential to improve job completion times and cluster efficiency. To this end, we have implemented two cache replacement policies on top of PACMan’s coordinated infrastructure – LIFE that minimizes average completion time by evicting large incomplete inputs, and LFUF that maximizes cluster efficiency by evicting less frequently accessed inputs. Evaluations on production workloads from Facebook and Microsoft Bing show that PACMan reduces average completion time of jobs by 53 % and 51 % (small interactive jobs improve by 77%), and improves efficiency of the cluster by 47 % and 54%, respectively. 1
iMapReduce: A Distributed Computing Framework for Iterative Computation
, 2012
"... Iterative computation is pervasive in many applications such as data mining, web ranking, graph analysis, online social network analysis, and so on. These iterative applications typically involve massive data sets containing millions or billions of data records. This poses demand of distributed co ..."
Abstract

Cited by 34 (10 self)
 Add to MetaCart
Iterative computation is pervasive in many applications such as data mining, web ranking, graph analysis, online social network analysis, and so on. These iterative applications typically involve massive data sets containing millions or billions of data records. This poses demand of distributed computing frameworks for processing massive data sets on a cluster of machines. MapReduce is an example of such a framework. However,
Priter: a distributed framework for prioritized iterative computations
 IN: PROCEEDINGS OF THE 2ND ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC ’11
, 2011
"... Iterative computations are pervasive among data analysis applications in the cloud, including Web search, online social network analysis, recommendation systems, and so on. These cloud applications typically involve data sets of massive scale. Fast convergence of the iterative computation on the mas ..."
Abstract

Cited by 33 (10 self)
 Add to MetaCart
(Show Context)
Iterative computations are pervasive among data analysis applications in the cloud, including Web search, online social network analysis, recommendation systems, and so on. These cloud applications typically involve data sets of massive scale. Fast convergence of the iterative computation on the massive data set is essential for these applications. In this paper, we explore the opportunity for accelerating iterative computations and propose a distributed computing framework, PrIter, which enables fast iterative computation by providing the support of prioritized iteration. Instead of performing computations on all data records without discrimination, PrIter prioritizes the computations that help convergence the most, so that the convergence speed of iterative process is significantly improved. We evaluate PrIter on a local cluster of machines as well as on Amazon EC2 Cloud. The results show that PrIter achieves up to 50x speedup over Hadoop for a series of iterative algorithms.
Kineograph: taking the pulse of a fastchanging and connected world
 In Proceedings of the 7th ACM european conference on Computer Systems, EuroSys ’12
, 2012
"... Kineograph is a distributed system that takes a stream of incoming data to construct a continuously changing graph, which captures the relationships that exist in the data feed. As a computing platform, Kineograph further supports graphmining algorithms to extract timely insights from the fastchan ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
(Show Context)
Kineograph is a distributed system that takes a stream of incoming data to construct a continuously changing graph, which captures the relationships that exist in the data feed. As a computing platform, Kineograph further supports graphmining algorithms to extract timely insights from the fastchanging graph structure. To accommodate graphmining algorithms that assume a static underlying graph, Kineograph creates a series of consistent snapshots, using a novel and efficient epoch commit protocol. To keep up with continuous updates on the graph, Kineograph includes an incremental graphcomputation engine. We have developed three applications on top of Kineograph to analyze Twitter data: user ranking, approximate shortest paths, and controversial topic detection. For these applications, Kineograph takes a live Twitter data feed and maintains a graph of edges between all users and hashtags. Our evaluation shows that with 40 machines processing 100K tweets per second, Kineograph is able to continuously compute global properties, such as user ranks, with less than 2.5minute timeliness guarantees. This rate of traffic is more than 10 times the reported peak rate of Twitter as of October 2011.
More effective distributed ML via a stale synchronous parallel parameter server
 In NIPS
, 2013
"... We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy ..."
Abstract

Cited by 30 (16 self)
 Add to MetaCart
(Show Context)
We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easytouse shared interface for read/write access to an ML model’s values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fullysynchronous and asynchronous schemes. 1