Results 11  20
of
58
Communication Optimal Parallel Multiplication of Sparse Random Matrices
, 2013
"... Parallel algorithms for sparse matrixmatrix multiplication typically spend most of their time on interprocessor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize c ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
Parallel algorithms for sparse matrixmatrix multiplication typically spend most of their time on interprocessor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize communication costs in order to scale to large processor counts. In this paper, we consider multiplying sparse matrices corresponding to ErdősRényi random graphs on distributedmemory parallel machines. We prove a new lower bound on the expected communication cost for a wide class of algorithms. Our analysis of existing algorithms shows that, while some are optimal for a limited range of matrix density and number of processors, none is optimal in general. We obtain two new parallel algorithms and prove that they match the expected communication cost lower bound, and hence they are optimal.
Distributed Memory BreadthFirst Search Revisited: Enabling BottomUp Search
"... Abstract—Breadthfirst search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional topdown approach always takes as much time as the ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Breadthfirst search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional topdown approach always takes as much time as the worst case. A recently discovered bottomup approach manages to cut down the complexity all the way to the number of vertices in the best case, which is typically at least an order of magnitude less than the number of edges. The bottomup approach is not always advantageous, so it is combined with the topdown approach to make the directionoptimizing algorithm which adaptively switches from topdown to bottomup as the frontier expands. We present a scalable distributedmemory parallelization of this challenging algorithm and show up to an order of magnitude speedups compared to an earlier purely topdown code. Our approach also uses a 2D decomposition of the graph that has previously been shown to be superior to a 1D decomposition. Using the default parameters of the Graph500 benchmark, our new algorithm achieves a performance rate of over 240 billion edges per second on 115 thousand cores of a Cray XE6, which makes it over 7 × faster than a conventional topdown algorithm using the same set of optimizations and data distribution. I.
Portable Parallel Performance from Sequential, Productive, Embedded DomainSpecific Languages
 In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’12
"... Domainexpert productivity programmers desire scalable application performance, but usually must rely on efficiency programmers who are experts in explicit parallel programming to achieve it. Since such programmers are rare, to maximize reuse of their work we propose encapsulating their strategies i ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
Domainexpert productivity programmers desire scalable application performance, but usually must rely on efficiency programmers who are experts in explicit parallel programming to achieve it. Since such programmers are rare, to maximize reuse of their work we propose encapsulating their strategies in minicompilers for domainspecific embedded languages (DSELs) glued together by a common highlevel host language familiar to productivity programmers. The nontrivial applications that use these DSELs perform up to 98 % of peak attainable performance, and comparable to or better than existing handcoded implementations. Our approach is unique in that each minicompiler not only performs conventional compiler transformations and optimizations, but includes imperative procedural code that captures an efficiency expert’s strategy for mapping a narrow domain onto a specific type of hardware. The result is source and performanceportability for productivity programmers and parallel performance that rivals that of handcoded efficiencylanguage implementations of the same applications. We describe a framework that supports our methodology and five implemented DSELs supporting common computation kernels. Our results demonstrate that for several interesting classes of problems, efficiencylevel parallel performance can be achieved by packaging efficiency programmers ’ expertise in a reusable framework that is easy to use for both productivity programmers and efficiency programmers.
Recent advances in graph partitioning
, 2013
"... We survey recent trends in practical algorithms for balanced graph partitioning together with applications and future research directions. ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
We survey recent trends in practical algorithms for balanced graph partitioning together with applications and future research directions.
GraphX: Unifying DataParallel and GraphParallel Analytics
, 2014
"... From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graphparallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graphparallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magnitude faster than more general dataparallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graphanalytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a consequence, existing graph analytics pipelines compose graphparallel and dataparallel systems using external storage systems, leading to extensive data movement and complicated programming model. To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graphparallel and dataparallel computation. GraphX provides a small, core set of graphparallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to efficiently implement these graphparallel operators. We evaluate GraphX on realworld graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in endtoend graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use.
Introducing scalegraph: an x10 library for billion scale graph analytics
 In Proceedings of the 2012 ACM SIGPLAN X10 Workshop, X10 ’12
, 2012
"... ..."
HighProductivity and HighPerformance Analysis of Filtered Semantic Graphs
"... Abstract—High performance is a crucial consideration when executing a complex analytic query on a massive semantic graph. In a semantic graph, vertices and edges carry attributes of various types. Analytic queries on semantic graphs typically depend on the values of these attributes; thus, the compu ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—High performance is a crucial consideration when executing a complex analytic query on a massive semantic graph. In a semantic graph, vertices and edges carry attributes of various types. Analytic queries on semantic graphs typically depend on the values of these attributes; thus, the computation must view the graph through a filter that passes only those individual vertices and edges of interest. Knowledge Discovery Toolbox (KDT), a Python library for parallel graph computations, is customizable in two ways. First, the user can write custom graph algorithms by specifying operations between edges and vertices. These programmerspecified operations are called semiring operations due to KDT’s underlying linearalgebraic abstractions. Second, the user can customize existing graph algorithms by writing filters that return true for those vertices and edges the user wants to retain during algorithm
Hardware/software vectorization for closeness centrality on multi/manycore architectures
 In 28th International Parallel and Distributed Processing Symposium Workshops, Workshop on Multithreaded Architectures and Applications (MTAAP
, 2014
"... Abstract—Centrality metrics have shown to be highly correlated with the importance and loads of the nodes in a network. Given the scale of today’s social networks, it is essential to use efficient algorithms and high performance computing techniques for their fast computation. In this work, we expl ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Centrality metrics have shown to be highly correlated with the importance and loads of the nodes in a network. Given the scale of today’s social networks, it is essential to use efficient algorithms and high performance computing techniques for their fast computation. In this work, we exploit hardware and software vectorization in combination with finegrain parallelization to compute the closeness centrality values. The proposed vectorization approach enables us to do concurrent breadthfirst search operations and significantly increases the performance. We provide a comparison of different vectorization schemes and experimentally evaluate our contributions with respect to the existing parallel CPUbased solutions on cuttingedge hardware. Our implementations achieve to be 11 times faster than the stateoftheart implementation for a graph with 234 million edges. The proposed techniques are beneficial to show how the vectorization can be efficiently utilized to execute other graph kernels that require multiple traversals over a largescale network on cuttingedge architectures. KeywordsCentrality, closeness centrality, vectorization, breadthfirst search, Intel Xeon Phi. I.
PEGASUS: MINING BILLIONSCALE GRAPHS IN THE CLOUD
"... We have entered in an era of big data. Graphs are now measured in terabytes or even petabytes; analyzing them has become increasingly challenging. How do we find patterns and anomalies in these graphs that no longer fit in memory? How should we exploit parallel computation to boost our analysis capa ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We have entered in an era of big data. Graphs are now measured in terabytes or even petabytes; analyzing them has become increasingly challenging. How do we find patterns and anomalies in these graphs that no longer fit in memory? How should we exploit parallel computation to boost our analysis capabilities? We present PEGASUS, the first opensource, petascale graph mining library, for the HADOOP platform (opensource implementation of MAPREDUCE). By observing that many graph mining operations can be described by repeated matrixvector multiplications, we devised an important primitive called GIMV for PEGASUS that applies to all such operations. GIMV (Generalized Iterative MatrixVector multiplication) is highly optimized, achieving (1) good scaleup with the number of machines, (2) linear run time on the number of edges, and (3) more than 9 times faster performance over the nonoptimized version. We ran experiments for PEGASUS on M45, one of the largest HADOOP clusters in the world. We report our findings on several real graphs with billions of nodes and edges. Selected findings include (a) the discovery of adult advertisers in the whofollowswhom on Twitter, and (b) the 7degrees of separation in the Web graph.