Results 1 - 10
of
35
GPS: A Graph Processing System ∗
"... GPS (for Graph Processing System) is a complete open-source system we developed for scalable, fault-tolerant, and easy-to-program execution of algorithms on extremely large graphs. GPS is similar to Google’s proprietary Pregel system [MAB+ 11], with some useful additional functionality described in ..."
Abstract
-
Cited by 63 (3 self)
- Add to MetaCart
(Show Context)
GPS (for Graph Processing System) is a complete open-source system we developed for scalable, fault-tolerant, and easy-to-program execution of algorithms on extremely large graphs. GPS is similar to Google’s proprietary Pregel system [MAB+ 11], with some useful additional functionality described in the paper. In distributed graph processing systems like GPS and Pregel, graph partitioning is the problem of deciding which vertices of the graph are assigned to which compute nodes. In addition to presenting the GPS system itself, we describe how we have used GPS to study the effects of different graph partitioning schemes. We present our experiments on the performance of GPS under different static partitioning schemes—assigning vertices to workers “intelligently ” before the computation starts—and with GPS’s dynamic repartitioning feature, which reassigns vertices to different compute nodes during the computation by observing their message sending patterns. 1
C.: Priter: a distributed framework for prioritized iterative computations
- In: Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC ’11
, 2011
"... Iterative computations are pervasive among data analysis applications in the cloud, including Web search, online social network analysis, recommendation systems, and so on. These cloud applications typically involve data sets of massive scale. Fast convergence of the iterative computation on the mas ..."
Abstract
-
Cited by 32 (9 self)
- Add to MetaCart
(Show Context)
Iterative computations are pervasive among data analysis applications in the cloud, including Web search, online social network analysis, recommendation systems, and so on. These cloud applications typically involve data sets of massive scale. Fast convergence of the iterative computation on the massive data set is essential for these applications. In this paper, we explore the opportunity for accelerating iterative computations and propose a distributed computing framework, PrIter, which enables fast iterative computation by providing the support of prioritized iteration. Instead of performing computations on all data records without discrimination, PrIter prioritizes the computations that help convergence the most, so that the convergence speed of iterative process is significantly improved. We evaluate PrIter on a local cluster of machines as well as on Amazon EC2 Cloud. The results show that PrIter achieves up to 50x speedup over Hadoop for a series of iterative algorithms.
Differential dataflow
"... Existing computational models for processing continuously changing input data are unable to efficiently support iterative queries except in limited special cases. This makes it difficult to perform complex tasks, such as social-graph analysis on changing data at interactive timescales, which would g ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
Existing computational models for processing continuously changing input data are unable to efficiently support iterative queries except in limited special cases. This makes it difficult to perform complex tasks, such as social-graph analysis on changing data at interactive timescales, which would greatly benefit those analyzing the behavior of services like Twitter. In this paper we introduce a new model called differential computation, which extends traditional incremental computation to allow arbitrarily nested iteration, and explain—with reference to a publicly available prototype system called Naiad—how differential computation can be efficiently implemented in the context of a declarative dataparallel dataflow language. The resulting system makes it easy to program previously intractable algorithms such as incrementally updated strongly connected components, and integrate them with data transformation operations to obtain practically relevant insights from real data streams. 1.
Accelerate Large-Scale Iterative Computation through Asynchronous Accumulative Updates
"... Myriad of data mining algorithms in scientific computing require parsing data sets iteratively. These iterative algorithms have to be implemented in a distributed environment to scale to massive data sets. To accelerate iterative computations in a large-scale distributed environment, we identify a b ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Myriad of data mining algorithms in scientific computing require parsing data sets iteratively. These iterative algorithms have to be implemented in a distributed environment to scale to massive data sets. To accelerate iterative computations in a large-scale distributed environment, we identify a broad class of iterative computations that can accumulate iterative update results. Specifically, different from traditional iterative computations, which iteratively update the result based on the result from the previous iteration, accumulative iterative update accumulates the intermediate iterative update results. We prove that an accumulative update will yield the same result as its corresponding traditional iterative update. Furthermore, accumulative iterative computation can be performed asynchronously and converges much faster. We present a general computation model to describe asynchronous accumulative iterative computation. Based on the computation model, we design and implement a distributed framework, Maiter. We evaluate Maiter on Amazon EC2 Cloud with 100 EC2 instances. Our results show that Maiter achieves as much as 60x speedup over Hadoop for implementing iterative algorithms.
Composable Incremental and Iterative Data-Parallel Computation with Naiad
"... We report on the design and implementation of Naiad, a set of declarative data-parallel language extensions and an associated runtime supporting efficient and composable incremental and iterative computation. This combination is enabled by a new computational model we call differential dataflow, in ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
We report on the design and implementation of Naiad, a set of declarative data-parallel language extensions and an associated runtime supporting efficient and composable incremental and iterative computation. This combination is enabled by a new computational model we call differential dataflow, in which incremental computation can be performed using a partial, rather than total, order on time. Naiad extends standard batch data-parallel processing models like MapReduce, Hadoop, and Dryad/DryadLINQ, to support efficient incremental updates to the inputs in the manner of a stream processing system, while at the same time enabling arbitrarily nested fixed-point iteration. In this paper, we evaluate a prototype of Naiad that uses shared memory on a single multi-core computer. We apply Naiad to various computations, including several graph algorithms, and observe good scaling properties and efficient incremental recomputation. 1.
Mammoth Data in the Cloud: Clustering Social Images
"... Abstract — Social image datasets have grown to dramatic size with images classified in vector spaces with high dimension (512-2048) and with potentially billions of images and corresponding classification vectors. We study the challenging problem of clustering such sets into millions of clusters usi ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Abstract — Social image datasets have grown to dramatic size with images classified in vector spaces with high dimension (512-2048) and with potentially billions of images and corresponding classification vectors. We study the challenging problem of clustering such sets into millions of clusters using Iterative MapReduce. We introduce a new Kmeans algorithm in the Map phase which can tackle the challenge of large cluster and dimension size. Further we stress that the necessary parallelism of such data intensive problems are dominated by particular collective (reduction) operations which are common to MPI and MapReduce and study different collective implementations, which enable cloud-HPC cluster interoperability. Extensive performance results are presented.
LINVIEW: incremental view maintenance for complex analytical queries
- In SIGMOD
, 2014
"... Many analytics tasks and machine learning problems can be naturally expressed by iterative linear algebra programs. In this paper, we study the incremental view maintenance problem for such complex analytical queries. We develop a framework, called Linview, for capturing deltas of linear al-gebra pr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Many analytics tasks and machine learning problems can be naturally expressed by iterative linear algebra programs. In this paper, we study the incremental view maintenance problem for such complex analytical queries. We develop a framework, called Linview, for capturing deltas of linear al-gebra programs and understanding their computational cost. Linear algebra operations tend to cause an avalanche effect where even very local changes to the input matrices spread out and infect all of the intermediate results and the final view, causing incremental view maintenance to lose its per-formance benefit over re-evaluation. We develop techniques based on matrix factorizations to contain such epidemics of change. As a consequence, our techniques make incremen-tal view maintenance of linear algebra practical and usually substantially cheaper than re-evaluation. We show, both an-alytically and experimentally, the usefulness of these tech-niques when applied to standard analytics tasks. Our evalu-ation demonstrates the efficiency of Linview in generating parallel incremental programs that outperform re-evaluation techniques by more than an order of magnitude. 1.
Iterating Skeletons Structured Parallelism by Composition
"... Abstract. Algorithmic skeletons are higher-order functions which provide tools for parallel programming at a higher abstraction level, hiding the technical details of parallel execution inside the skeleton implementation. However, this encapsulation becomes an obstacle when the actual algorithm is o ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Algorithmic skeletons are higher-order functions which provide tools for parallel programming at a higher abstraction level, hiding the technical details of parallel execution inside the skeleton implementation. However, this encapsulation becomes an obstacle when the actual algorithm is one that involves iterative application of the same skeleton to successively improve or approximate the result. Striving for a general and portable solution, we propose a skeleton iteration framework in which arbitrary skeletons can be embedded with only minor modifications. The framework is flexible and allows for various parallel iteration control and parallel iteration body variants. We have implemented it in the parallel Haskell dialect Eden using dedicated stream communication types for the iteration. Two non-trivial case studies show the practicality of our approach. The performance of our compositional iteration framework is competitive with customised iteration skeletons. 1
Large-Scale Image Classification using High Performance Clustering
, 2013
"... Abstract — Computer vision is being revolutionized by the incredible volume of visual data available on the Internet. A key part of computer vision, data mining, or any Big Data problem is the analytics that transforms this raw data into understanding, and many of these data mining approaches requir ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract — Computer vision is being revolutionized by the incredible volume of visual data available on the Internet. A key part of computer vision, data mining, or any Big Data problem is the analytics that transforms this raw data into understanding, and many of these data mining approaches require iterative computations at an unprecedented scale. Often an individual iteration can be specified as a MapReduce computation, leading to the iterative MapReduce program-ming model for efficient execution of data-intensive iterative computations. We propose the Map-Collective model as a generalization of our earlier Twister system that is interoperable between HPC and cloud environments. In this paper, we study the problem of large-scale clustering, applying it to cluster large collections of 10-100 million social images, each represented as a point in a high dimensional (up to 2048) vector space, into 1-10 million clusters. This K-means application needs 5 stages in each iteration: Broadcast, Map, Shuffle, Reduce and Combine, and this paper presents new collective communication approaches optimized for large data transfers. Furthermore one needs additional communication patterns from those familiar in MapReduce, and we develop collectives that integrate capabilities developed by the MPI and MapReduce communities. We demonstrate that a topology-aware and pipeline-based broadcasting method gives better performance than both MPI and other (Iterative) MapReduce systems. We present early results of an end-to-end computer vision application and evaluate the quality of the resulting image classifications, showing that increasing the number of feature clusters leads to improved classifier accuracy.
Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation
"... Abstract—Myriad of graph-based algorithms in machine learning and data mining require parsing relational data iteratively. These algorithms are implemented in a large-scale distributed environment in order to scale to massive data sets. To accelerate these large-scale graph-based iterative computati ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Myriad of graph-based algorithms in machine learning and data mining require parsing relational data iteratively. These algorithms are implemented in a large-scale distributed environment in order to scale to massive data sets. To accelerate these large-scale graph-based iterative computations, we propose delta-based accumulative iterative computation (DAIC). Different from traditional iterative computations, which iteratively update the result based on the result from the previous iteration, DAIC updates the result by accumulating the “changes ” between iterations. By DAIC, we can process only the “changes ” to avoid the negligible updates. Furthermore, we can perform DAIC asynchronously to bypass the high-cost synchronous barriers in heterogeneous distributed environments. Based on the DAIC model, we design and implement an asynchronous graph processing framework, Maiter. We evaluate Maiter on local cluster as well as on Amazon EC2 Cloud. The results show that Maiter achieves as much as 60x speedup over Hadoop and outperforms other state-of-the-art frameworks. Index Terms—delta-based accumulative iterative computation, asynchronous iteration, Maiter, distributed framework. F 1