Results 1  10
of
153
Implementing sparse matrixvector multiplication on throughputoriented processors
 In SC ’09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing
, 2009
"... Sparse matrixvector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential ..."
Abstract

Cited by 137 (6 self)
 Add to MetaCart
(Show Context)
Sparse matrixvector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential of throughputoriented processors for sparse operations requires that we expose substantial finegrained parallelism and impose sufficient regularity on execution paths and memory access patterns. We explore SpMV methods that are wellsuited to throughputoriented architectures like the GPU and which exploit several common sparsity classes. The techniques we propose are efficient, successfully utilizing large percentages of peak bandwidth. Furthermore, they deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previously achieved on Cell BE and more than 10 times that of a quadcore Intel Clovertown system. 1.
Stencil computation optimization and autotuning on stateoftheart multicore architectures
 In (submitted to) Proc. SC2008: High performance computing, networking, and storage conference
, 2008
"... Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearestneighbor) computations — a class of algorithms a ..."
Abstract

Cited by 133 (16 self)
 Add to MetaCart
(Show Context)
Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearestneighbor) computations — a class of algorithms at the heart of many structured grid codes, including PDE solvers. We develop a number of effective optimization strategies, and build an autotuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the
Efficient sparse matrixvector multiplication on CUDA
, 2008
"... The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many highperformance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its rol ..."
Abstract

Cited by 109 (2 self)
 Add to MetaCart
(Show Context)
The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many highperformance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrixvector multiplication (SpMV) is of singular importance in sparse linear algebra. In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the finegrained parallel architecture of the GPU. Given the memorybound nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider a broad spectrum of sparse matrices, from those that are wellstructured and regular to highly irregular matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to exploit several common forms of matrix structure while offering alternatives which accommodate greater irregularity. On structured, gridbased matrices we achieve performance of 36 GFLOP/s in single precision and 16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finiteelement matrices, we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision respectively. These results compare favorably to prior stateoftheart studies of SpMV methods on conventional multicore processors. Our double precision SpMV performance is generally two and a half times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quadcore Intel Clovertown system.
the Parallel Computing Landscape
"... contributed articles doi:10.1145/1562764.1562783 Writing programs that scale with increasing numbers of cores should be as easy as writing programs for sequential computers. ..."
Abstract

Cited by 92 (0 self)
 Add to MetaCart
contributed articles doi:10.1145/1562764.1562783 Writing programs that scale with increasing numbers of cores should be as easy as writing programs for sequential computers.
Understanding sources of inefficiency in generalpurpose chips
 IN ISCA ’10: PROCEEDINGS OF THE 37TH ANNUAL INTERNATIONAL SYMPOSIUMONCOMPUTERARCHITECTURE,TOAPPEAR(2010),IEEEPRESS
"... Due to their high volume, generalpurpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in generalpurpose proces ..."
Abstract

Cited by 73 (2 self)
 Add to MetaCart
(Show Context)
Due to their high volume, generalpurpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in generalpurpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a generalpurpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original fourprocessor CMP. Broadly, applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90 % of the energy used in these solutions is still “overhead”. Achieving ASIClike performance and efficiency requires algorithmspecific optimizations. For each subalgorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution’s performance within 3x of its energy and within comparable area.
A Quantitative Performance Analysis Model for GPU Architectures
 In HPCA
, 2011
"... We develop a microbenchmarkbased performance model for NVIDIA GeForce 200series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improv ..."
Abstract

Cited by 53 (0 self)
 Add to MetaCart
(Show Context)
We develop a microbenchmarkbased performance model for NVIDIA GeForce 200series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improvements. In particular, we use a microbenchmarkbased approach to develop a throughput model for three major components of GPU execution time: the instruction pipeline, shared memory access, and global memory access. Because our model is based on the GPU’s native instruction set, we can predict performance with a 5–15 % error. To demonstrate the usefulness of the model, we analyze three representative realworld and already highlyoptimized programs: dense matrix multiply, tridiagonal systems solver, and sparse matrix vector multiply. The model provides us detailed quantitative analysis on performance, allowing us to understand the configuration of the fastest dense matrix multiply implementation and to optimize the tridiagonal solver and sparse matrix vector multiply by 60 % and 18 % respectively. Furthermore, our model applied to analysis on these codes allows us to suggest architectural improvements on hardware resource allocation, avoiding bank conflicts, block scheduling, and memory transaction granularity. 1
Lattice Boltzmann simulation optimization on leading multicore platforms
 in Interational Conference on Parallel and Distributed Computing Systems (IPDPS
, 2008
"... We present an autotuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of searchbased performance optimizations, popular in linear algebra and FFT libraries, to applicationspecific computational kernels. Our work applies this str ..."
Abstract

Cited by 45 (16 self)
 Add to MetaCart
(Show Context)
We present an autotuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of searchbased performance optimizations, popular in linear algebra and FFT libraries, to applicationspecific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara2, STI Cell, as well as the single core Intel Itanium2. Rather than handtuning LBMHD for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our autotuned LBMHD application achieves up to a 14 × improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications. 1
Minimizing Communication in Sparse Matrix Solvers
"... Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparsematrixvectormultiplications and Ω(k) vecto ..."
Abstract

Cited by 35 (10 self)
 Add to MetaCart
(Show Context)
Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparsematrixvectormultiplications and Ω(k) vector operations like dot products, resulting in communication that grows by a factor of Ω(k) in both the memory and network. By reorganizing the sparsematrix kernel to compute a set of matrixvector products at once and reorganizing the rest of the algorithm accordingly, we can perform k iterations by sending O(log P) messages instead of O(k · log P) messages on a parallel machine, and reading the matrix A from DRAM to cache just once, instead of k times on a sequential machine. This reduces communication to the minimum possible. We combine these techniques to form a new variant of GMRES. Our sharedmemory implementation on an 8core Intel Clovertown gets speedups of up to 4.3 × over standard GMRES, without sacrificing convergence rate or numerical stability. 1.