Results 1  10
of
149
Rodinia: A Benchmark Suite for Heterogeneous Computing
, 2009
"... This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multicore CPU and GPU platforms. The choice of applications is ..."
Abstract

Cited by 200 (17 self)
 Add to MetaCart
(Show Context)
This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multicore CPU and GPU platforms. The choice of applications is inspired by Berkeley’s dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memorybandwidth limitations and the consequent importance of data layout.
Analyzing CUDA workloads using a detailed gpu simulator
 In Proceedings of the International Symposium on Performance Analysis of Systems and Software
, 2009
"... Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow’s manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying t ..."
Abstract

Cited by 168 (8 self)
 Add to MetaCart
(Show Context)
Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow’s manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several nongraphics applications written in NVIDIA’s CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA’s parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve nontrivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPUonly sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary highend graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than onchip resources might otherwise allow can improve performance by reducing contention in the memory system. 1.
Scalable gpu graph traversal
 In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’12
, 2012
"... Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrate ..."
Abstract

Cited by 64 (1 self)
 Add to MetaCart
(Show Context)
Breadthfirst search (BFS) is a core primitive for graph traversal and a basis for many higherlevel graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and datadependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter. We present a BFS parallelization focused on finegrained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(V+E) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quadGPU configurations, respectively. This level of performance is several times faster than stateoftheart implementations both CPU and GPU platforms.
CUDA cuts: Fast graph cuts on the GPU
 in Computer Vision and Pattern Recognition Workshops. IEEE Computer Society
"... Graph Cuts has become a powerful and popular optimization tool for energies defined over an MRF and has found applications in image segmentation, stereo vision, image restoration etc. The maxflow/mincut algorithm to compute graph cuts is computationally expensive. The best reported implementation of ..."
Abstract

Cited by 63 (8 self)
 Add to MetaCart
(Show Context)
Graph Cuts has become a powerful and popular optimization tool for energies defined over an MRF and has found applications in image segmentation, stereo vision, image restoration etc. The maxflow/mincut algorithm to compute graph cuts is computationally expensive. The best reported implementation of it takes over 140 milliseconds even on images of size 640×480 for two labels and cannot be used for real time applications. The commodity Graphics Processor Unit (GPU) has emerged as an economical and fast parallel coprocessor recently. In this paper, we present an implementation of the pushrelabel algorithm for graph cuts on the GPU. We show our results on some benchmark dataset and some synthetic images. We can perform over 25 graph cuts per second on 640×480 size benchmark images and over 35 graph cuts per second on 1K × 1K size synthetic images on an Nvidia GTX 280. The time for each complete graphcut is few milliseconds when only a few edge weights change from the previous graphs, as on dynamic graphs. The CUDA code with a welldefined interface can be downloaded from
Accelerating CUDA graph algorithms at maximum warp
 In PPoPP
, 2011
"... Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffers heavily when the graph structure is highly irregular, as most realworld graphs t ..."
Abstract

Cited by 49 (3 self)
 Add to MetaCart
(Show Context)
Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffers heavily when the graph structure is highly irregular, as most realworld graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture. We then propose a novel virtual warpcentric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables tradeoffs between workload imbalance and ALU underutilization for finetuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30 % improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multithreaded CPU graph implementations is primarily due to the large difference in memory bandwidth.
InterBlock GPU Communication via Fast Barrier Synchronization
"... Abstract—While GPGPU stands for generalpurpose computation on graphics processing units, the lack of explicit support for interblock communication on the GPU arguably hampers its broader adoption as a generalpurpose computing device. Interblock communication on the GPU occurs via global memory an ..."
Abstract

Cited by 39 (2 self)
 Add to MetaCart
(Show Context)
Abstract—While GPGPU stands for generalpurpose computation on graphics processing units, the lack of explicit support for interblock communication on the GPU arguably hampers its broader adoption as a generalpurpose computing device. Interblock communication on the GPU occurs via global memory and then requires barrier synchronization across the blocks, i.e., interblock GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, can incur significant overhead. We propose two approaches for interblock GPU communication via barrier synchronization: GPU lockbased synchronization and GPU lockfree synchronization. We then evaluate the efficacy of each approach via a microbenchmark as well as three wellknown algorithms — Fast Fourier Transform (FFT), dynamic programming, and bitonic sort. For the microbenchmark, the experimental results show that our GPU lockfree synchronization performs 8.4 times faster than CPU explicit synchronization and 4.0 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lockfree synchronization further improves performance by 10%, 26%, and 40%, respectively, and ultimately delivers an overall speedup of 70x, 13x, and 24x, respectively. I.
Parallel breadthfirst search on distributed memory systems
, 2011
"... Dataintensive, graphbased computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for BreadthFirst Search (BFS), a key subroutine in several ..."
Abstract

Cited by 33 (9 self)
 Add to MetaCart
(Show Context)
Dataintensive, graphbased computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for BreadthFirst Search (BFS), a key subroutine in several graph algorithms. We present two highlytuned parallel approaches for BFS on large parallel systems: a levelsynchronous strategy that relies on a simple vertexbased partitioning of the graph, and a twodimensional sparse matrix partitioningbased approach that mitigates parallel communication overhead. For both approaches, we also present hybrid versions with intranode multithreading. Our novel hybrid twodimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execution regimes in which these approaches will be competitive, and we demonstrate extremely high performance on leading distributedmemory parallel systems. For instance, for a 40,000core parallel execution on Hopper, an AMD MagnyCours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution. 1.
Efficient parallel graph exploration for multicore cpu and gpu
 In IEEE PACT
, 2011
"... Abstract—Graphs are a fundamental data representation that have been used extensively in various domains. In graphbased applications, a systematic exploration of the graph such as a breadthfirst search (BFS) often serves as a key component in the processing of their massive data sets. In this pape ..."
Abstract

Cited by 32 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Graphs are a fundamental data representation that have been used extensively in various domains. In graphbased applications, a systematic exploration of the graph such as a breadthfirst search (BFS) often serves as a key component in the processing of their massive data sets. In this paper, we present a new method for implementing the parallel BFS algorithm on multicore CPUs which exploits a fundamental property of randomly shaped realworld graph instances. By utilizing memory bandwidth more efficiently, our method shows improved performance over the current stateoftheart implementation and increases its advantage as the size of the graph increases. We then propose a hybrid method which, for each level of the BFS algorithm, dynamically chooses the best implementation from: a sequential execution, two different methods of multicore execution, and a GPU execution. Such a hybrid approach provides the best performance for each graph size while avoiding poor worstcase performance on highdiameter graphs. Finally, we study the effects of the underlying architecture on BFS performance by comparing multiple CPU and GPU systems; a highend GPU system performed as well as a quadsocket highend CPU system. I.
Fast Combinatorial Vector Field Topology
 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS
, 2010
"... This paper introduces a novel approximation algorithm for the fundamental graph problem of combinatorial vector field topology (CVT). CVT is a combinatorial approach based on a sound theoretical basis given by Forman’s work on a discrete Morse theory for dynamical systems. A computational framework ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
This paper introduces a novel approximation algorithm for the fundamental graph problem of combinatorial vector field topology (CVT). CVT is a combinatorial approach based on a sound theoretical basis given by Forman’s work on a discrete Morse theory for dynamical systems. A computational framework for this mathematical model of vector field topology has been developed recently. The applicability of this framework is however severely limited by the quadratic complexity of its main computational kernel. In this work we present an approximation algorithm for CVT with a significantly lower complexity. This new algorithm reduces the runtime by several orders of magnitude, and maintains the main advantages of CVT over the continuous approach. Due to the simplicity of our algorithm it can be easily parallelized to improve the runtime further.
Characterizing and Improving the Use of Demandfetched Caches in GPUs
 In ICS
, 2012
"... Initially introduced as specialpurpose accelerators for games and graphics code, graphics processing units (GPUs) have emerged as widelyused highperformance parallel computing platforms. GPUs traditionally provided only softwaremanaged local memories (or scratchpads) instead of demandfetched cach ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
(Show Context)
Initially introduced as specialpurpose accelerators for games and graphics code, graphics processing units (GPUs) have emerged as widelyused highperformance parallel computing platforms. GPUs traditionally provided only softwaremanaged local memories (or scratchpads) instead of demandfetched caches. Increasingly, however, GPUs are being used in broader application domains where memory access patterns are both harder to analyze and harder to manage in softwarecontrolled caches. In response, GPU vendors have included sizable demandfetched caches in recent chip designs. Nonetheless, several problems remain. First, since these hardware caches are quite new and highlyconfigurable, it can be difficult to know when and how to use them; they sometimes degrade performance instead of improving it. Second, since GPU programming is quite distinct from generalpurpose programming, application programmers do not yet have solid intuition about which memory reference patterns are amenable to demandfetched caches. In response, this paper characterizes application performance on GPUs with caches and provides a taxonomy for reasoning about different types of access patterns and locality. Based on this taxonomy, we present an algorithm which can be automated and applied at compiletime to identify an application’s memory access patterns and to use that information to intelligently configure cache usage to improve application performance. Experiments on real GPU systems show that our algorithm reliably predicts when GPU caches will help or hurt performance. Compared to always passively turning caches on, our method can increase the average benefit of caches from 5.8 % to 18.0 % for applications that have significant performance sensitivity to caching.