Results 1  10
of
242
Starpu: a unified platform for task scheduling on heterogeneous multicore architectures,
 Concurrency and Computation: Practice and Experience
, 2011
"... Abstract. In the field of HPC, the current hardware trend is to design multiprocessor architectures that feature heterogeneous technologies such as specialized coprocessors (e.g., Cell/BE SPUs) or dataparallel accelerators (e.g., GPGPUs). Approaching the theoretical performance of these architectu ..."
Abstract

Cited by 172 (15 self)
 Add to MetaCart
Abstract. In the field of HPC, the current hardware trend is to design multiprocessor architectures that feature heterogeneous technologies such as specialized coprocessors (e.g., Cell/BE SPUs) or dataparallel accelerators (e.g., GPGPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We have thus designed STARPU, an original runtime system providing a highlevel, unified execution model tightly coupled with an expressive data management library. The main goal of STARPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run time, and we have demonstrated their efficiency by analyzing the impact of those scheduling policies on several classical linear algebra algorithms that take advantage of multiple cores and GPUs at the same time. In addition to substantial improvements regarding execution times, we obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine.
Implementing sparse matrixvector multiplication on throughputoriented processors
 In SC ’09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing
, 2009
"... Sparse matrixvector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential ..."
Abstract

Cited by 142 (7 self)
 Add to MetaCart
(Show Context)
Sparse matrixvector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential of throughputoriented processors for sparse operations requires that we expose substantial finegrained parallelism and impose sufficient regularity on execution paths and memory access patterns. We explore SpMV methods that are wellsuited to throughputoriented architectures like the GPU and which exploit several common sparsity classes. The techniques we propose are efficient, successfully utilizing large percentages of peak bandwidth. Furthermore, they deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previously achieved on Cell BE and more than 10 times that of a quadcore Intel Clovertown system. 1.
Efficient sparse matrixvector multiplication on CUDA
, 2008
"... The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many highperformance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its rol ..."
Abstract

Cited by 113 (2 self)
 Add to MetaCart
(Show Context)
The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many highperformance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrixvector multiplication (SpMV) is of singular importance in sparse linear algebra. In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the finegrained parallel architecture of the GPU. Given the memorybound nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider a broad spectrum of sparse matrices, from those that are wellstructured and regular to highly irregular matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to exploit several common forms of matrix structure while offering alternatives which accommodate greater irregularity. On structured, gridbased matrices we achieve performance of 36 GFLOP/s in single precision and 16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finiteelement matrices, we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision respectively. These results compare favorably to prior stateoftheart studies of SpMV methods on conventional multicore processors. Our double precision SpMV performance is generally two and a half times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quadcore Intel Clovertown system.
Demystifying gpu microarchitecture through microbenchmarking
 In ISPASS
, 2010
"... Abstract—Graphics processors (GPU) offer the promise of more than an order of magnitude speedup over conventional processors for certain nongraphics computations. Because the GPU is often presented as a Clike abstraction (e.g., Nvidia’s CUDA), little is known about the characteristics of the GPU’s ..."
Abstract

Cited by 69 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Graphics processors (GPU) offer the promise of more than an order of magnitude speedup over conventional processors for certain nongraphics computations. Because the GPU is often presented as a Clike abstraction (e.g., Nvidia’s CUDA), little is known about the characteristics of the GPU’s architecture beyond what the manufacturer has documented. This work develops a microbechmark suite and measures the CUDAvisible architectural characteristics of the Nvidia GT200 (GTX280) GPU. Various undisclosed characteristics of the processing elements and the memory hierarchies are measured. This analysis exposes undocumented features that impact program performance and correctness. These measurements can be useful for improving performance optimization, analysis, and modeling on this architecture and offer additional insight on the decisions made in developing this GPU.
The Scalable HeterOgeneous Computing (SHOC) benchmark suite
 in Proc. 3rd Workshop on GeneralPurpose Computation on Graphics Processing Units (GPGPU3
, 2010
"... Scalable heterogeneous computing systems, which are composed of a mix of compute devices, such as commodity multicore processors, graphics processors, reconfigurable processors, and others, are gaining attention as one approach to continuing performance improvement while managing the new challenge o ..."
Abstract

Cited by 69 (0 self)
 Add to MetaCart
(Show Context)
Scalable heterogeneous computing systems, which are composed of a mix of compute devices, such as commodity multicore processors, graphics processors, reconfigurable processors, and others, are gaining attention as one approach to continuing performance improvement while managing the new challenge of energy efficiency. As these systems become more common, it is important to be able to compare and contrast architectural designs and programming systems in a fair and open forum. To this end, we have designed the Scalable HeterOgeneous Computing benchmark suite (SHOC). SHOC’s initial focus is on systems containing graphics processing units (GPUs) and multicore processors, and on the new OpenCL programming standard. SHOC is a spectrum of programs that test the performance and stability of these scalable heterogeneous computing systems. At the lowest level, SHOC uses microbenchmarks to assess architectural features of the system. At higher levels, SHOC uses application kernels to determine systemwide performance including many system features such as intranode and internode communication among devices. SHOC includes benchmark implementations in both OpenCL and CUDA in order to provide a comparison of these programming models.
Towards dense linear algebra for hybrid gpu accelerated manycore systems
 Parallel Computing
"... a b s t r a c t We highlight the trends leading to the increased appeal of using hybrid multicore + GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with t ..."
Abstract

Cited by 67 (20 self)
 Add to MetaCart
(Show Context)
a b s t r a c t We highlight the trends leading to the increased appeal of using hybrid multicore + GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with the development of a hybrid LU factorization algorithm where we split the computation over a multicore and a graphics processor, and use particular techniques to reduce the amount of pivoting and communication between the hybrid components. This results in an efficient algorithm with balanced use of a multicore processor and a graphics processor.
Modeldriven autotuning of sparse matrixvector multiply on GPUs
 In PPoPP
, 2010
"... We present a performance modeldriven framework for automated performance tuning (autotuning) of sparse matrixvector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we describe several carefully handtuned SpMV implementations for G ..."
Abstract

Cited by 65 (4 self)
 Add to MetaCart
(Show Context)
We present a performance modeldriven framework for automated performance tuning (autotuning) of sparse matrixvector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we describe several carefully handtuned SpMV implementations for GPUs, identifying key GPUspecific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed stateoftheart implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in singleprecision and 15.7 Gflop/s in doubleprecision on the NVIDIA T10P multiprocessor (C1060), enhancing prior stateoftheart unblocked implementations (Bell and Garland, 2009) by up to 1.8 × and 1.5 × for single and doubleprecision respectively. However, achieving this level of performance requires input matrixdependent parameter tuning. Thus, in the second part of this study, we develop a performance model that can guide tuning. Like prior autotuning models for CPUs (e.g., Im, Yelick, and Vuduc, 2004), this model requires offline measurements and runtime estimation, but more directly models the structure of multithreaded vector processors like GPUs. We show that our model can identify the implementations that achieve within 15 % of those found through exhaustive search.
A Quantitative Performance Analysis Model for GPU Architectures
 In HPCA
, 2011
"... We develop a microbenchmarkbased performance model for NVIDIA GeForce 200series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improv ..."
Abstract

Cited by 57 (2 self)
 Add to MetaCart
(Show Context)
We develop a microbenchmarkbased performance model for NVIDIA GeForce 200series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improvements. In particular, we use a microbenchmarkbased approach to develop a throughput model for three major components of GPU execution time: the instruction pipeline, shared memory access, and global memory access. Because our model is based on the GPU’s native instruction set, we can predict performance with a 5–15 % error. To demonstrate the usefulness of the model, we analyze three representative realworld and already highlyoptimized programs: dense matrix multiply, tridiagonal systems solver, and sparse matrix vector multiply. The model provides us detailed quantitative analysis on performance, allowing us to understand the configuration of the fastest dense matrix multiply implementation and to optimize the tridiagonal solver and sparse matrix vector multiply by 60 % and 18 % respectively. Furthermore, our model applied to analysis on these codes allows us to suggest architectural improvements on hardware resource allocation, avoiding bank conflicts, block scheduling, and memory transaction granularity. 1
A GPGPU Compiler for Memory Optimization and Parallelism Management
 In Proceedings of PLDI
, 2010
"... This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. The input to o ..."
Abstract

Cited by 53 (4 self)
 Add to MetaCart
(Show Context)
This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. The input to our compiler is a naïve GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or addressoffset insertion for partitioncamping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly finetuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.
InterBlock GPU Communication via Fast Barrier Synchronization
"... Abstract—While GPGPU stands for generalpurpose computation on graphics processing units, the lack of explicit support for interblock communication on the GPU arguably hampers its broader adoption as a generalpurpose computing device. Interblock communication on the GPU occurs via global memory an ..."
Abstract

Cited by 39 (2 self)
 Add to MetaCart
(Show Context)
Abstract—While GPGPU stands for generalpurpose computation on graphics processing units, the lack of explicit support for interblock communication on the GPU arguably hampers its broader adoption as a generalpurpose computing device. Interblock communication on the GPU occurs via global memory and then requires barrier synchronization across the blocks, i.e., interblock GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, can incur significant overhead. We propose two approaches for interblock GPU communication via barrier synchronization: GPU lockbased synchronization and GPU lockfree synchronization. We then evaluate the efficacy of each approach via a microbenchmark as well as three wellknown algorithms — Fast Fourier Transform (FFT), dynamic programming, and bitonic sort. For the microbenchmark, the experimental results show that our GPU lockfree synchronization performs 8.4 times faster than CPU explicit synchronization and 4.0 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lockfree synchronization further improves performance by 10%, 26%, and 40%, respectively, and ultimately delivers an overall speedup of 70x, 13x, and 24x, respectively. I.