Results 1  10
of
47
Scalable gpu graph traversal
 In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’12
, 2012
"... Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrate ..."
Abstract

Cited by 62 (1 self)
 Add to MetaCart
(Show Context)
Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter. We present a BFS parallelization focused on finegrained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(V+E) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quadGPU configurations, respectively. This level of performance is several times faster than stateoftheart implementations both CPU and GPU platforms.
Efficient parallel graph exploration for multicore cpu and gpu
 In IEEE PACT
, 2011
"... Abstract—Graphs are a fundamental data representation that have been used extensively in various domains. In graphbased applications, a systematic exploration of the graph such as a breadthfirst search (BFS) often serves as a key component in the processing of their massive data sets. In this pape ..."
Abstract

Cited by 31 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Graphs are a fundamental data representation that have been used extensively in various domains. In graphbased applications, a systematic exploration of the graph such as a breadthfirst search (BFS) often serves as a key component in the processing of their massive data sets. In this paper, we present a new method for implementing the parallel BFS algorithm on multicore CPUs which exploits a fundamental property of randomly shaped realworld graph instances. By utilizing memory bandwidth more efficiently, our method shows improved performance over the current stateoftheart implementation and increases its advantage as the size of the graph increases. We then propose a hybrid method which, for each level of the BFS algorithm, dynamically chooses the best implementation from: a sequential execution, two different methods of multicore execution, and a GPU execution. Such a hybrid approach provides the best performance for each graph size while avoiding poor worstcase performance on highdiameter graphs. Finally, we study the effects of the underlying architecture on BFS performance by comparing multiple CPU and GPU systems; a highend GPU system performed as well as a quadsocket highend CPU system. I.
CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization ∗
"... As the computational power of GPUs continues to scale with Moore’s Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightlycoupled specialized DMA warps for performing memory transfers between onchip and offchip m ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
(Show Context)
As the computational power of GPUs continues to scale with Moore’s Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightlycoupled specialized DMA warps for performing memory transfers between onchip and offchip memories. Separate DMA warps improve memory bandwidth utilization by better exploiting available memorylevel parallelism and by leveraging efficient interwarp producerconsumer synchronization mechanisms. DMA warps also improve programmer productivity by decoupling the need for thread array shapes to match data layout. To illustrate the benefits of this approach, we present an extensible API, CudaDMA, that encapsulates synchronization and common sequential and strided data transfer patterns. Using CudaDMA, we demonstrate speedup of up to 1.37x on representative synthetic microbenchmarks, and 1.15x3.2x on several kernels from scientific applications written in CUDA running on NVIDIA Fermi GPUs. 1.
Speeding up LargeScale PointinPolygon Test Based Spatial Join on GPUs. Technical report online at http://geoteci.engr.ccny.cuny.edu/pub/pipsp_tr.pdf
"... PointinPolygon (PIP) test is fundamental to spatial databases and GIS. Motivated by the slow response times in joining largescale point locations with polygons using traditional spatial databases and GIS and the massively data parallel computing power of commodity GPU devices, we have designed an ..."
Abstract

Cited by 20 (15 self)
 Add to MetaCart
(Show Context)
PointinPolygon (PIP) test is fundamental to spatial databases and GIS. Motivated by the slow response times in joining largescale point locations with polygons using traditional spatial databases and GIS and the massively data parallel computing power of commodity GPU devices, we have designed and developed an endtoend system completely on GPUs to associate points with the polygons that they fall within. The system includes an efficient module to generate point quadrants that have at most K points from largescale unordered points, a simple gridfile based spatial filtering approach to associate point quadrants and polygons, and, a PIP test module to assign polygons to points in a GPU computing block using both the block and thread level parallelisms. Experiments on joining 170 million points with more than 40 thousand polygons have resulted in a runtime of 11.165 seconds on an Nvidia Quadro 6000 GPU device. Compared with a baseline serial CPU implementation using stateoftheart open source GIS packages which requires 15.223 hours to complete, a speedup of 4,910X has been achieved. We further discuss several factors and parameters that may affect the system performance. 1.
A Memory Access Model for Highlythreaded Manycore Architectures
 SUBMITTED AND ACCEPTED BY ICPADS'2012
, 2012
"... Manycore architectures are excellent in hiding memoryaccess latency by lowoverhead context switching among a large number of threads. The speedup of algorithms carried out on these machines depends on how well the latency is hidden. If the number of threads were infinite, then theoretically thes ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
Manycore architectures are excellent in hiding memoryaccess latency by lowoverhead context switching among a large number of threads. The speedup of algorithms carried out on these machines depends on how well the latency is hidden. If the number of threads were infinite, then theoretically these machines should provide the performance predicted by the PRAM analysis of the programs. However, the number of allowable threads per processor is not infinite. In this paper, we introduce the Threaded Manycore Memory (TMM) model which is meant to capture the important characteristics of these highlythreaded, manycore machines. Since we model some important machine parameters of these machines, we expect analysis under this model to give more finegrained performance prediction than the PRAM analysis. We analyze 4 algorithms for the classic all pairs shortest paths problem under this model. We find that even when two algorithms have the same PRAM performance, our model predicts different performance for some settings of machine parameters. For example, for dense graphs, the FloydWarshall algorithm and Johnson’s algorithms have the same performance in the PRAM model. However, our model predicts different performance for large enough memoryaccess latency and validates the intuition that the FloydWarshall algorithm performs better on these machines.
A yoke of oxen and a thousand chickens for heavy lifting graph processing
 in Proceedings of the 21st international conference on Parallel architectures and compilation techniques, ser. PACT ’12
"... Large, realworld graphs are famously difficult to process efficiently. Not only they have a large memory footprint but most graph processing algorithms entail memory access patterns with poor locality, datadependent parallelism, and a low computetomemory access ratio. Additionally, most realworl ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
Large, realworld graphs are famously difficult to process efficiently. Not only they have a large memory footprint but most graph processing algorithms entail memory access patterns with poor locality, datadependent parallelism, and a low computetomemory access ratio. Additionally, most realworld graphs have a low diameter and a highly heterogeneous node degree distribution. Partitioning these graphs and simultaneously achieve access locality and loadbalancing is difficult if not impossible. This paper demonstrates the feasibility of graph processing on heterogeneous (i.e., including both CPUs and GPUs) platforms as a costeffective approach towards addressing the graph processing challenges above. To this end, this work (i) presents and evaluates a performance model that estimates the achievable performance on heterogeneous platforms; (ii) introduces TOTEM – a processing engine based on the Bulk Synchronous Parallel (BSP) model that offers a convenient environment to simplify the implementation of graph algorithms on heterogeneous platforms; and, (iii) demonstrates TOTEM’S efficiency by implementing and evaluating two graph algorithms (PageRank and breadthfirst search). TOTEM achieves speedups close to the model’s prediction, and applies a number of optimizations that enable linear speedups with respect to the share of the graph offloaded for processing to accelerators.
Distributed Memory BreadthFirst Search Revisited: Enabling BottomUp Search
"... Abstract—Breadthfirst search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional topdown approach always takes as much time as the ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Breadthfirst search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional topdown approach always takes as much time as the worst case. A recently discovered bottomup approach manages to cut down the complexity all the way to the number of vertices in the best case, which is typically at least an order of magnitude less than the number of edges. The bottomup approach is not always advantageous, so it is combined with the topdown approach to make the directionoptimizing algorithm which adaptively switches from topdown to bottomup as the frontier expands. We present a scalable distributedmemory parallelization of this challenging algorithm and show up to an order of magnitude speedups compared to an earlier purely topdown code. Our approach also uses a 2D decomposition of the graph that has previously been shown to be superior to a 1D decomposition. Using the default parameters of the Graph500 benchmark, our new algorithm achieves a performance rate of over 240 billion edges per second on 115 thousand cores of a Cray XE6, which makes it over 7 × faster than a conventional topdown algorithm using the same set of optimizations and data distribution. I.
Design and Evaluation of the GeMTC Framework for GPUenabled ManyTask Computing
"... We present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration. With GeMTC, a broad class of such “manytask ” ap ..."
Abstract

Cited by 7 (5 self)
 Add to MetaCart
(Show Context)
We present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration. With GeMTC, a broad class of such “manytask ” applications can leverage the increasing number of accelerated and hybrid highend computing systems. GeMTC overcomes the obstacles to using GPUs in a manytask manner by scheduling and launching independent tasks on hardware designed for SIMDstyle vector processing. We demonstrate the use of a highlevel MTC programming model (the Swift parallel dataflow language) to run tasks on many accelerators and thus provide a highproductivity programming model for the growing number of supercomputers that are acceleratorenabled. While still in an experimental stage, GeMTC can already support tasks of fine (subsecond) granularity and execute concurrent heterogeneous tasks on 86,000 independent GPU warps spanning 2.7M GPU threads on the Blue Waters supercomputer.
Singe: Leveraging warp specialization for high performance on GPUs
, 2014
"... We present Singe, a Domain Specific Language (DSL) compiler for combustion chemistry that leverages warp specialization to produce high performance code for GPUs. Instead of relying on traditional GPU programming models that emphasize dataparallel computations, warp specialization allows compiler ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
We present Singe, a Domain Specific Language (DSL) compiler for combustion chemistry that leverages warp specialization to produce high performance code for GPUs. Instead of relying on traditional GPU programming models that emphasize dataparallel computations, warp specialization allows compilers like Singe to partition computations into subcomputations which are then assigned to different warps within a thread block. Finegrain synchronization between warps is performed efficiently in hardware using producerconsumer named barriers. Partitioning computations using warp specialization allows Singe to deal efficiently with the irregularity in both data access patterns and computation. Furthermore, warpspecialized partitioning of computations allows Singe to fit extremely large working sets into onchip memories. Finally, we describe the architecture and general compilation techniques necessary for constructing a warpspecializing compiler. We show that the warpspecialized code emitted by Singe is up to 3.75X faster than previously optimized dataparallel GPU kernels.
On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest
"... Abstract — Graph processing has gained renewed attention. The increasing large scale and wealth of connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable information from large scale graphs. Hybrid syst ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract — Graph processing has gained renewed attention. The increasing large scale and wealth of connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable information from large scale graphs. Hybrid systems that host processing units optimized for both fast sequential processing and bulk processing (e.g., GPUaccelerated systems) have the potential to cope with the heterogeneous structure of real graphs and enable high performance graph processing. Reaching this point, however, poses multiple challenges. The heterogeneity of the processing elements (e.g., GPUs implement a different parallel processing model than CPUs and have much less memory) and the inherent irregularity of graph workloads require careful graph partitioning and load assignment. In particular, the workload generated by a partitioning scheme should match the strength of the processing element the partition is allocated to. This work explores the feasibility and quantifies the performance gains of such lowcost partitioning schemes. We propose to partition the workload between the two types of processing elements based on vertex connectivity. We show that such partitioning schemes offer a simple, yet efficient way to boost the overall performance of the hybrid system. Our evaluation illustrates that processing a 4billion edges graph on a system with one CPU socket and one GPU, while offloading as little as 25 % of the edges to the GPU, achieves 2x performance improvement over stateof–theart implementations running on a dualsocket symmetric system. Moreover, for the same graph, a hybrid system with dualsocket and dualGPU is capable of 1.13 Billion breadthfirst search traversed edge per second, a performance rate that is competitive with the latest entries in the Graph500 list, yet at a much lower price point. I.