Results 11 - 20
of
58
Communication Optimal Parallel Multiplication of Sparse Random Matrices
, 2013
"... Parallel algorithms for sparse matrix-matrix multiplication typically spend most of their time on inter-processor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize c ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
Parallel algorithms for sparse matrix-matrix multiplication typically spend most of their time on inter-processor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize communication costs in order to scale to large processor counts. In this paper, we consider multiplying sparse matrices corresponding to Erdős-Rényi random graphs on distributedmemory parallel machines. We prove a new lower bound on the expected communication cost for a wide class of algorithms. Our analysis of existing algorithms shows that, while some are optimal for a limited range of matrix density and number of processors, none is optimal in general. We obtain two new parallel algorithms and prove that they match the expected communication cost lower bound, and hence they are optimal.
Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search
"... Abstract—Breadth-first search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional top-down approach always takes as much time as the ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Breadth-first search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional top-down approach always takes as much time as the worst case. A recently discovered bottom-up approach manages to cut down the complexity all the way to the number of vertices in the best case, which is typically at least an order of magnitude less than the number of edges. The bottom-up approach is not always advantageous, so it is combined with the top-down approach to make the direction-optimizing algorithm which adaptively switches from top-down to bottom-up as the frontier expands. We present a scalable distributed-memory parallelization of this challenging algorithm and show up to an order of magnitude speedups compared to an earlier purely top-down code. Our approach also uses a 2D decomposition of the graph that has previously been shown to be superior to a 1D decomposition. Using the default parameters of the Graph500 benchmark, our new algorithm achieves a performance rate of over 240 billion edges per second on 115 thousand cores of a Cray XE6, which makes it over 7 × faster than a conventional top-down algorithm using the same set of optimizations and data distribution. I.
Portable Parallel Performance from Sequential, Productive, Embedded Domain-Specific Languages
- In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’12
"... Domain-expert productivity programmers desire scalable application performance, but usually must rely on efficiency programmers who are experts in explicit parallel programming to achieve it. Since such programmers are rare, to maximize reuse of their work we propose encapsulating their strategies i ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Domain-expert productivity programmers desire scalable application performance, but usually must rely on efficiency programmers who are experts in explicit parallel programming to achieve it. Since such programmers are rare, to maximize reuse of their work we propose encapsulating their strategies in mini-compilers for domain-specific embedded languages (DSELs) glued together by a common high-level host language familiar to productivity programmers. The nontrivial applications that use these DSELs perform up to 98 % of peak attainable performance, and comparable to or better than existing hand-coded implementations. Our approach is unique in that each mini-compiler not only performs conventional compiler transformations and optimizations, but includes imperative procedural code that captures an efficiency expert’s strategy for mapping a narrow domain onto a specific type of hardware. The result is source- and performance-portability for productivity programmers and parallel performance that rivals that of hand-coded efficiency-language implementations of the same applications. We describe a framework that supports our methodology and five implemented DSELs supporting common computation kernels. Our results demonstrate that for several interesting classes of problems, efficiency-level parallel performance can be achieved by packaging efficiency programmers ’ expertise in a reusable framework that is easy to use for both productivity programmers and efficiency programmers.
Recent advances in graph partitioning
, 2013
"... We survey recent trends in practical algorithms for balanced graph partitioning together with applications and future research directions. ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
We survey recent trends in practical algorithms for balanced graph partitioning together with applications and future research directions.
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
, 2014
"... From social networks to language modeling, the growing scale and importance of graph data has driven the development of numer-ous new graph-parallel systems (e.g., Pregel, GraphLab). By re-stricting the computation that can be expressed and introducing new techniques to partition and distribute the ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
From social networks to language modeling, the growing scale and importance of graph data has driven the development of numer-ous new graph-parallel systems (e.g., Pregel, GraphLab). By re-stricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magni-tude faster than more general data-parallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graph-analytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a conse-quence, existing graph analytics pipelines compose graph-parallel and data-parallel systems using external storage systems, leading to extensive data movement and complicated programming model. To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. GraphX provides a small, core set of graph-parallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to effi-ciently implement these graph-parallel operators. We evaluate GraphX on real-world graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in end-to-end graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use.
Introducing scalegraph: an x10 library for billion scale graph analytics
- In Proceedings of the 2012 ACM SIGPLAN X10 Workshop, X10 ’12
, 2012
"... ..."
High-Productivity and High-Performance Analysis of Filtered Semantic Graphs
"... Abstract—High performance is a crucial consideration when executing a complex analytic query on a massive semantic graph. In a semantic graph, vertices and edges carry attributes of various types. Analytic queries on semantic graphs typically depend on the values of these attributes; thus, the compu ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract—High performance is a crucial consideration when executing a complex analytic query on a massive semantic graph. In a semantic graph, vertices and edges carry attributes of various types. Analytic queries on semantic graphs typically depend on the values of these attributes; thus, the computation must view the graph through a filter that passes only those individual vertices and edges of interest. Knowledge Discovery Toolbox (KDT), a Python library for parallel graph computations, is customizable in two ways. First, the user can write custom graph algorithms by specifying operations between edges and vertices. These programmer-specified operations are called semiring operations due to KDT’s underlying linear-algebraic abstractions. Second, the user can customize existing graph algorithms by writing filters that return true for those vertices and edges the user wants to retain during algorithm
Hardware/software vectorization for closeness centrality on multi-/many-core architectures
- In 28th International Parallel and Distributed Processing Symposium Workshops, Workshop on Multithreaded Architectures and Applications (MTAAP
, 2014
"... Abstract—Centrality metrics have shown to be highly corre-lated with the importance and loads of the nodes in a network. Given the scale of today’s social networks, it is essential to use efficient algorithms and high performance computing techniques for their fast computation. In this work, we expl ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Centrality metrics have shown to be highly corre-lated with the importance and loads of the nodes in a network. Given the scale of today’s social networks, it is essential to use efficient algorithms and high performance computing techniques for their fast computation. In this work, we exploit hardware and software vectorization in combination with fine-grain parallelization to compute the closeness centrality values. The proposed vectorization approach enables us to do concur-rent breadth-first search operations and significantly increases the performance. We provide a comparison of different vector-ization schemes and experimentally evaluate our contributions with respect to the existing parallel CPU-based solutions on cutting-edge hardware. Our implementations achieve to be 11 times faster than the state-of-the-art implementation for a graph with 234 million edges. The proposed techniques are ben-eficial to show how the vectorization can be efficiently utilized to execute other graph kernels that require multiple traversals over a large-scale network on cutting-edge architectures. Keywords-Centrality, closeness centrality, vectorization, breadth-first search, Intel Xeon Phi. I.
PEGASUS: MINING BILLION-SCALE GRAPHS IN THE CLOUD
"... We have entered in an era of big data. Graphs are now measured in terabytes or even petabytes; analyzing them has become increasingly challenging. How do we find patterns and anomalies in these graphs that no longer fit in memory? How should we exploit parallel computation to boost our analysis capa ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
We have entered in an era of big data. Graphs are now measured in terabytes or even petabytes; analyzing them has become increasingly challenging. How do we find patterns and anomalies in these graphs that no longer fit in memory? How should we exploit parallel computation to boost our analysis capabilities? We present PEGASUS, the first opensource, peta-scale graph mining library, for the HADOOP platform (open-source implementation of MAPREDUCE). By observing that many graph mining operations can be described by repeated matrix-vector multiplications, we devised an important primitive called GIM-V for PEGASUS that applies to all such operations. GIM-V (Generalized Iterative Matrix-Vector multiplication) is highly optimized, achieving (1) good scale-up with the number of machines, (2) linear run time on the number of edges, and (3) more than 9 times faster performance over the non-optimized version. We ran experiments for PEGASUS on M45, one of the largest HADOOP clusters in the world. We report our findings on several real graphs with billions of nodes and edges. Selected findings include (a) the discovery of adult advertisers in the whofollows-whom on Twitter, and (b) the 7-degrees of separation in the Web graph.