Results 1  10
of
33
DirectionOptimizing BreadthFirst Search
"... Abstract—BreadthFirst Search is an important kernel used by many graphprocessing applications. In many of these emerging applications of BFS, such as analyzing social networks, the input graphs are lowdiameter and scalefree. We propose a hybrid approach that is advantageous for lowdiameter grap ..."
Abstract

Cited by 34 (4 self)
 Add to MetaCart
(Show Context)
Abstract—BreadthFirst Search is an important kernel used by many graphprocessing applications. In many of these emerging applications of BFS, such as analyzing social networks, the input graphs are lowdiameter and scalefree. We propose a hybrid approach that is advantageous for lowdiameter graphs, which combines a conventional topdown algorithm along with a novel bottomup algorithm. The bottomup algorithm can dramatically reduce the number of edges examined, which in turn accelerates the search as a whole. On a multisocket server, our hybrid approach demonstrates speedups of 3.3–7.8 on a range of standard synthetic graphs and speedups of 2.4–4.6 on graphs from real social networks when compared to a strong baseline. We also typically double the performance of prior leading shared memory (multicore and GPU) implementations. I.
A Flexible OpenSource Toolbox for Scalable Complex Graph Analysis
, 2011
"... The Knowledge Discovery Toolbox (KDT) enables domain experts to perform complex analyses of huge datasets on supercomputers using a highlevel language without grappling with the difficulties of writing parallel code, calling parallel libraries, or becoming a graph expert. KDT provides a flexible Py ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
The Knowledge Discovery Toolbox (KDT) enables domain experts to perform complex analyses of huge datasets on supercomputers using a highlevel language without grappling with the difficulties of writing parallel code, calling parallel libraries, or becoming a graph expert. KDT provides a flexible Python interface to a small set of highlevel graph operations; composing a few of these operations is often sufficient for a specific analysis. Scalability and performance are delivered by linking to a stateoftheart backend compute engine that scales from laptops to large HPC clusters. KDT delivers very competitive performance from a generalpurpose, reusable library for graphs on the order of 10 billion edges and greater. We demonstrate speedup of 1 and 2 orders of magnitude over PBGL and Pegasus, respectively, on some tasks. Examples from simple use cases and key graphanalytic benchmarks illustrate the productivity and performance realized by KDT users. Semantic graph abstractions provide both flexibility and high performance for realworld use cases. Graphalgorithm researchers benefit from the ability to develop algorithms quickly using KDT’s graph and underlying matrix abstractions for distributed memory. KDT is available as opensource code to foster experimentation.
Highly Parallel Sparse MatrixMatrix Multiplication
, 2010
"... Generalized sparse matrixmatrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
Generalized sparse matrixmatrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on twodimensional block distribution of sparse matrices where serial sections use a novel hypersparse kernel for scalability. We give a stateoftheart MPI implementation of one of our algorithms. Our experiments show scaling up to thousands of processors on a variety of test scenarios.
Distributed Memory BreadthFirst Search Revisited: Enabling BottomUp Search
"... Abstract—Breadthfirst search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional topdown approach always takes as much time as the ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Breadthfirst search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional topdown approach always takes as much time as the worst case. A recently discovered bottomup approach manages to cut down the complexity all the way to the number of vertices in the best case, which is typically at least an order of magnitude less than the number of edges. The bottomup approach is not always advantageous, so it is combined with the topdown approach to make the directionoptimizing algorithm which adaptively switches from topdown to bottomup as the frontier expands. We present a scalable distributedmemory parallelization of this challenging algorithm and show up to an order of magnitude speedups compared to an earlier purely topdown code. Our approach also uses a 2D decomposition of the graph that has previously been shown to be superior to a 1D decomposition. Using the default parameters of the Graph500 benchmark, our new algorithm achieves a performance rate of over 240 billion edges per second on 115 thousand cores of a Cray XE6, which makes it over 7 × faster than a conventional topdown algorithm using the same set of optimizations and data distribution. I.
Scaling Techniques for Massive ScaleFree Graphs in Distributed (External) Memory
"... Abstract—We present techniques to process large scalefree graphs in distributed memory. Our aim is to scale to trillions of edges, and our research is targeted at leadership class supercomputers and clusters with local nonvolatile memory, e.g., NAND Flash. We apply an edge list partitioning techni ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Abstract—We present techniques to process large scalefree graphs in distributed memory. Our aim is to scale to trillions of edges, and our research is targeted at leadership class supercomputers and clusters with local nonvolatile memory, e.g., NAND Flash. We apply an edge list partitioning technique, designed to accommodate highdegree vertices (hubs) that create scaling challenges when processing scalefree graphs. In addition to partitioning hubs, we use ghost vertices to represent the hubs to reduce communication hotspots. We present a scaling study with three important graph algorithms: BreadthFirst Search (BFS), KCore decomposition, and Triangle Counting. We also demonstrate scalability on BG/P Intrepid by comparing to best known Graph500 results [1]. We show results on two clusters with local NVRAM storage that are capable of traversing trillionedge scalefree graphs. By leveraging nodelocal NAND Flash, our approach can process thirtytwo times larger datasets with only a 39 % performance degradation in Traversed Edges Per Second (TEPS). Keywordsparallel algorithms; graph algorithms; big data; distributed computing. I.
Portable Parallel Performance from Sequential, Productive, Embedded DomainSpecific Languages
 In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’12
"... Domainexpert productivity programmers desire scalable application performance, but usually must rely on efficiency programmers who are experts in explicit parallel programming to achieve it. Since such programmers are rare, to maximize reuse of their work we propose encapsulating their strategies i ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Domainexpert productivity programmers desire scalable application performance, but usually must rely on efficiency programmers who are experts in explicit parallel programming to achieve it. Since such programmers are rare, to maximize reuse of their work we propose encapsulating their strategies in minicompilers for domainspecific embedded languages (DSELs) glued together by a common highlevel host language familiar to productivity programmers. The nontrivial applications that use these DSELs perform up to 98 % of peak attainable performance, and comparable to or better than existing handcoded implementations. Our approach is unique in that each minicompiler not only performs conventional compiler transformations and optimizations, but includes imperative procedural code that captures an efficiency expert’s strategy for mapping a narrow domain onto a specific type of hardware. The result is source and performanceportability for productivity programmers and parallel performance that rivals that of handcoded efficiencylanguage implementations of the same applications. We describe a framework that supports our methodology and five implemented DSELs supporting common computation kernels. Our results demonstrate that for several interesting classes of problems, efficiencylevel parallel performance can be achieved by packaging efficiency programmers ’ expertise in a reusable framework that is easy to use for both productivity programmers and efficiency programmers.
Toward a Distance Oracle for BillionNode Graphs
, 2013
"... The emergence of real life graphs with billions of nodes poses significant challenges for managing and querying these graphs. One of the fundamental queries submitted to graphs is the shortest distance query. Online BFS (breadthfirst search) and offline precomputing pairwise shortest distances are ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
The emergence of real life graphs with billions of nodes poses significant challenges for managing and querying these graphs. One of the fundamental queries submitted to graphs is the shortest distance query. Online BFS (breadthfirst search) and offline precomputing pairwise shortest distances are prohibitive in time or space complexity for billionnode graphs. In this paper, we study the feasibility of building distance oracles for billionnode graphs. A distance oracle provides approximate answers to shortest distance queries by using a precomputed data structure for the graph. Sketchbased distance oracles are good candidates because they assign each vertex a sketch of bounded size, which means they have linear space complexity. However, stateoftheart sketchbased distance oracles lack efficiency or accuracy when dealing with big graphs. In this paper, we address the scalability and accuracy issues by focusing on optimizing the three key factors that affect the performance of distance oracles: landmark selection, distributed BFS, and answer generation. We conduct extensive experiments on both real networks and synthetic networks to show that we can build distance oracles of affordable cost and efficiently answer shortest distance queries even for billionnode graphs.
Taskbased Parallel BreadthFirst Search in Heterogeneous Environments
"... Abstract—Breadthfirst search (BFS) is an essential graph traversal strategy widely used in many computing applications. Because of its irregular data access patterns, BFS has become a nontrivial problem hard to parallelize efficiently. In this paper, we introduce a parallelization strategy that al ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Breadthfirst search (BFS) is an essential graph traversal strategy widely used in many computing applications. Because of its irregular data access patterns, BFS has become a nontrivial problem hard to parallelize efficiently. In this paper, we introduce a parallelization strategy that allows the load balancing of computation resources as well as the execution of graph traversals in hybrid environments composed of CPUs and GPUs. To achieve that goal, we use a finegrained taskbased parallelization scheme and the OmpSs programming model. We obtain processing rates up to 2.8 billion traversed edges per second with a single GPU and a multicore processor. Our study shows high processing rates are achievable with hybrid environments despite the GPU communication latency and memory coherence. I.
Parallelization of Reordering Algorithms for Bandwidth and Wavefront Reduction
"... Abstract—Many sparse matrix computations can be speeded up if the matrix is first reordered. Reordering was originally developed for direct methods but it has recently become popular for improving the cache locality of parallel iterative solvers since reordering the matrix to reduce bandwidth and wa ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Many sparse matrix computations can be speeded up if the matrix is first reordered. Reordering was originally developed for direct methods but it has recently become popular for improving the cache locality of parallel iterative solvers since reordering the matrix to reduce bandwidth and wavefront can improve the locality of reference of sparse matrixvector multiplication (SpMV), the key kernel in iterative solvers. In this paper, we present the first parallel implementations of two widely used reordering algorithms: Reverse CuthillMcKee (RCM) and Sloan. On 16 cores of the Stampede supercomputer, our parallel RCM is 5.56 times faster on the average than a stateoftheart sequential implementation of RCM in the HSL library. Sloan is significantly more constrained than RCM, but our parallel implementation achieves a speedup of 2.88X on the average over sequential HSLSloan. Reordering the matrix using our parallel RCM and then performing 100 SpMV iterations is twice as fast as using HSLRCM and then performing the SpMV iterations; it is also 1.5 times faster than performing the SpMV iterations without reordering the matrix. I.