Results 1  10
of
34
GPS: A Graph Processing System
"... GPS (for Graph Processing System) is a complete opensource system we developed for scalable, faulttolerant, and easytoprogram execution of algorithms on extremely large graphs. GPS is similar to Google’s proprietary Pregel system [MAB+ 11], with some useful additional functionality described in ..."
Abstract

Cited by 68 (3 self)
 Add to MetaCart
(Show Context)
GPS (for Graph Processing System) is a complete opensource system we developed for scalable, faulttolerant, and easytoprogram execution of algorithms on extremely large graphs. GPS is similar to Google’s proprietary Pregel system [MAB+ 11], with some useful additional functionality described in the paper. In distributed graph processing systems like GPS and Pregel, graph partitioning is the problem of deciding which vertices of the graph are assigned to which compute nodes. In addition to presenting the GPS system itself, we describe how we have used GPS to study the effects of different graph partitioning schemes. We present our experiments on the performance of GPS under different static partitioning schemes—assigning vertices to workers “intelligently ” before the computation starts—and with GPS’s dynamic repartitioning feature, which reassigns vertices to different compute nodes during the computation by observing their message sending patterns.
Priter: a distributed framework for prioritized iterative computations
 IN: PROCEEDINGS OF THE 2ND ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC ’11
, 2011
"... Iterative computations are pervasive among data analysis applications in the cloud, including Web search, online social network analysis, recommendation systems, and so on. These cloud applications typically involve data sets of massive scale. Fast convergence of the iterative computation on the mas ..."
Abstract

Cited by 33 (10 self)
 Add to MetaCart
(Show Context)
Iterative computations are pervasive among data analysis applications in the cloud, including Web search, online social network analysis, recommendation systems, and so on. These cloud applications typically involve data sets of massive scale. Fast convergence of the iterative computation on the massive data set is essential for these applications. In this paper, we explore the opportunity for accelerating iterative computations and propose a distributed computing framework, PrIter, which enables fast iterative computation by providing the support of prioritized iteration. Instead of performing computations on all data records without discrimination, PrIter prioritizes the computations that help convergence the most, so that the convergence speed of iterative process is significantly improved. We evaluate PrIter on a local cluster of machines as well as on Amazon EC2 Cloud. The results show that PrIter achieves up to 50x speedup over Hadoop for a series of iterative algorithms.
Differential dataflow
"... Existing computational models for processing continuously changing input data are unable to efficiently support iterative queries except in limited special cases. This makes it difficult to perform complex tasks, such as socialgraph analysis on changing data at interactive timescales, which would g ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
Existing computational models for processing continuously changing input data are unable to efficiently support iterative queries except in limited special cases. This makes it difficult to perform complex tasks, such as socialgraph analysis on changing data at interactive timescales, which would greatly benefit those analyzing the behavior of services like Twitter. In this paper we introduce a new model called differential computation, which extends traditional incremental computation to allow arbitrarily nested iteration, and explain—with reference to a publicly available prototype system called Naiad—how differential computation can be efficiently implemented in the context of a declarative dataparallel dataflow language. The resulting system makes it easy to program previously intractable algorithms such as incrementally updated strongly connected components, and integrate them with data transformation operations to obtain practically relevant insights from real data streams. 1.
The seven deadly sins of cloud computing research
"... Research into distributed parallelism on “the cloud ” has surged lately. As the research agenda and methodology in this area are being established, we observe a tendency towards certain common simplifications and shortcuts employed by researchers, which we provocatively term “sins”. We believe that ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
Research into distributed parallelism on “the cloud ” has surged lately. As the research agenda and methodology in this area are being established, we observe a tendency towards certain common simplifications and shortcuts employed by researchers, which we provocatively term “sins”. We believe that these sins, in some cases, are threats to the scientific integrity and practical applicability of the research presented. In this paper, we identify and discuss seven “deadly sins” (many of which we have ourselves committed!), present evidence illustrating that they pose real problems, and discuss ways for the community to avoid them in the future.
Composable Incremental and Iterative DataParallel Computation with Naiad
"... We report on the design and implementation of Naiad, a set of declarative dataparallel language extensions and an associated runtime supporting efficient and composable incremental and iterative computation. This combination is enabled by a new computational model we call differential dataflow, in ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
We report on the design and implementation of Naiad, a set of declarative dataparallel language extensions and an associated runtime supporting efficient and composable incremental and iterative computation. This combination is enabled by a new computational model we call differential dataflow, in which incremental computation can be performed using a partial, rather than total, order on time. Naiad extends standard batch dataparallel processing models like MapReduce, Hadoop, and Dryad/DryadLINQ, to support efficient incremental updates to the inputs in the manner of a stream processing system, while at the same time enabling arbitrarily nested fixedpoint iteration. In this paper, we evaluate a prototype of Naiad that uses shared memory on a single multicore computer. We apply Naiad to various computations, including several graph algorithms, and observe good scaling properties and efficient incremental recomputation. 1.
Accelerate LargeScale Iterative Computation through Asynchronous Accumulative Updates
"... Myriad of data mining algorithms in scientific computing require parsing data sets iteratively. These iterative algorithms have to be implemented in a distributed environment to scale to massive data sets. To accelerate iterative computations in a largescale distributed environment, we identify a b ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Myriad of data mining algorithms in scientific computing require parsing data sets iteratively. These iterative algorithms have to be implemented in a distributed environment to scale to massive data sets. To accelerate iterative computations in a largescale distributed environment, we identify a broad class of iterative computations that can accumulate iterative update results. Specifically, different from traditional iterative computations, which iteratively update the result based on the result from the previous iteration, accumulative iterative update accumulates the intermediate iterative update results. We prove that an accumulative update will yield the same result as its corresponding traditional iterative update. Furthermore, accumulative iterative computation can be performed asynchronously and converges much faster. We present a general computation model to describe asynchronous accumulative iterative computation. Based on the computation model, we design and implement a distributed framework, Maiter. We evaluate Maiter on Amazon EC2 Cloud with 100 EC2 instances. Our results show that Maiter achieves as much as 60x speedup over Hadoop for implementing iterative algorithms.
High Performance Clustering of Social Images in a Map Collective Programming Model
"... Largescale iterative computations are common in many important data mining and machine learning algorithms needed in analytics and deep learning. In most of these applications, individual iterations can be specified as MapReduce computations, leading to the Iterative MapReduce programming model for ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Largescale iterative computations are common in many important data mining and machine learning algorithms needed in analytics and deep learning. In most of these applications, individual iterations can be specified as MapReduce computations, leading to the Iterative MapReduce programming model for efficient execution of dataintensive iterative computations interoperably between HPC and cloud environments. Further one needs additional communication patterns from those familiar in MapReduce and we base our initial architecture on collectives that integrate capabilities developed by the MPI and MapReduce communities. This leads us to the MapCollective programming model which here we develop based on requirements of a range of applications by extending our existing Iterative MapReduce environment Twister. This paper studies the implications of large scale Social Image clustering where large scale problems study 10100 million images represented as points in a high dimensional (up to 2048) vector space which need to be divided into up to 110 million clusters. This Kmeans application needs 5 stages in each iteration: Broadcast, Map, Shuffle, Reduce and Combine, and this paper focuses on collective communication stages where large data transfers demand performance optimization. By comparing and combining ideas from MapReduce and MPI communities, we show that a topologyaware and pipelinebased broadcasting method gives better performance than other MPI and (Iterative) MapReduce systems.
Mammoth Data in the Cloud: Clustering Social Images
"... Abstract — Social image datasets have grown to dramatic size with images classified in vector spaces with high dimension (5122048) and with potentially billions of images and corresponding classification vectors. We study the challenging problem of clustering such sets into millions of clusters usi ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Abstract — Social image datasets have grown to dramatic size with images classified in vector spaces with high dimension (5122048) and with potentially billions of images and corresponding classification vectors. We study the challenging problem of clustering such sets into millions of clusters using Iterative MapReduce. We introduce a new Kmeans algorithm in the Map phase which can tackle the challenge of large cluster and dimension size. Further we stress that the necessary parallelism of such data intensive problems are dominated by particular collective (reduction) operations which are common to MPI and MapReduce and study different collective implementations, which enable cloudHPC cluster interoperability. Extensive performance results are presented.
Iterating Skeletons Structured Parallelism by Composition
"... Abstract. Algorithmic skeletons are higherorder functions which provide tools for parallel programming at a higher abstraction level, hiding the technical details of parallel execution inside the skeleton implementation. However, this encapsulation becomes an obstacle when the actual algorithm is o ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Algorithmic skeletons are higherorder functions which provide tools for parallel programming at a higher abstraction level, hiding the technical details of parallel execution inside the skeleton implementation. However, this encapsulation becomes an obstacle when the actual algorithm is one that involves iterative application of the same skeleton to successively improve or approximate the result. Striving for a general and portable solution, we propose a skeleton iteration framework in which arbitrary skeletons can be embedded with only minor modifications. The framework is flexible and allows for various parallel iteration control and parallel iteration body variants. We have implemented it in the parallel Haskell dialect Eden using dedicated stream communication types for the iteration. Two nontrivial case studies show the practicality of our approach. The performance of our compositional iteration framework is competitive with customised iteration skeletons. 1
Maiter: An Asynchronous Graph Processing Framework for Deltabased Accumulative Iterative Computation
"... Myriad of graphbased algorithms in machine learning and data mining require parsing relational data iteratively. These algorithms are implemented in a largescale distributed environment in order to scale to massive data sets. To accelerate these largescale graphbased iterative computations, we ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Myriad of graphbased algorithms in machine learning and data mining require parsing relational data iteratively. These algorithms are implemented in a largescale distributed environment in order to scale to massive data sets. To accelerate these largescale graphbased iterative computations, we propose deltabased accumulative iterative computation (DAIC). Different from traditional iterative computations, which iteratively update the result based on the result from the previous iteration, DAIC updates the result by accumulating the “changes” between iterations. By DAIC, we can process only the “changes” to avoid the negligible updates. Furthermore, we can perform DAIC asynchronously to bypass the highcost synchronous barriers in heterogeneous distributed environments. Based on the DAIC model, we design and implement an asynchronous graph processing framework, Maiter. We evaluate Maiter on local cluster as well as on Amazon EC2 Cloud. The results show that Maiter achieves as much as 60x speedup over Hadoop and outperforms other stateoftheart frameworks.