Results 1 - 10
of
135
the Parallel Computing Landscape
"... contributed articles doi:10.1145/1562764.1562783 Writing programs that scale with increasing numbers of cores should be as easy as writing programs for sequential computers. ..."
Abstract
-
Cited by 98 (0 self)
- Add to MetaCart
contributed articles doi:10.1145/1562764.1562783 Writing programs that scale with increasing numbers of cores should be as easy as writing programs for sequential computers.
3D finite difference computation on GPUs using CUDA
- Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, ACM International Conference Proceeding Series
, 2009
"... In this paper we describe a GPU parallelization of the 3D finite difference computation using CUDA. Data access redundancy is used as the metric to determine the optimal implementation for both the stencil-only computation, as well as the discretization of the wave equation, which is currently of gr ..."
Abstract
-
Cited by 75 (0 self)
- Add to MetaCart
In this paper we describe a GPU parallelization of the 3D finite difference computation using CUDA. Data access redundancy is used as the metric to determine the optimal implementation for both the stencil-only computation, as well as the discretization of the wave equation, which is currently of great interest in seismic computing. For the larger stencils, the described approach achieves the throughput of between 2,400 to over 3,000 million of output points per second on a single Tesla 10-series GPU. This is roughly an order of magnitude higher than a 4-core Harpertown CPU running a similar code from seismic industry. Multi-GPU parallelization is also described, achieving linear scaling with GPUs by overlapping inter-GPU communication with computation.
The Pochoir Stencil Compiler
"... A stencil computation repeatedly updates each point of a d-dimensionalgridasafunctionofitselfanditsnearneighbors. Parallel cache-efficient stencil algorithms based on “trapezoidal decompositions” are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a prog ..."
Abstract
-
Cited by 43 (2 self)
- Add to MetaCart
(Show Context)
A stencil computation repeatedly updates each point of a d-dimensionalgridasafunctionofitselfanditsnearneighbors. Parallel cache-efficient stencil algorithms based on “trapezoidal decompositions” are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochoir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm. Pochoir supports general d-dimensional stencils and handles both periodic and aperiodic boundary conditions in one unified algorithm. The Pochoir system provides a C++ template library that allows the user’s stencilspecificationtobeexecuteddirectlyinC++withoutthePochoir compiler(albeitmoreslowly),whichsimplifiesuserdebuggingand greatly simplified the implementation of the Pochoir compiler itself. A host of stencil benchmarks run on a modern multicore machine demonstrates that Pochoir outperforms standard parallelloop implementations, typically running 2–10 times faster. The algorithm behind Pochoir improves on prior cache-efficient algorithms on multidimensional grids by making “hyperspace ” cuts, which yield asymptotically more parallelism for the same cache efficiency. Categories andSubjectDescriptors
Performance Modeling and Automatic Ghost Zone Optimization for Iterative Stencil Loops on GPUs
- ICS ’09: Proceedings of the 23rd international conference on Supercomputing (2009
"... Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addi ..."
Abstract
-
Cited by 42 (9 self)
- Add to MetaCart
(Show Context)
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in a grid environment. To automate this process on shared memory systems, we establish a performance model using NVIDIA’s Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 98 % of the optimal speedup. 1.
3.5d blocking optimization for stencil computations on modern CPUs and GPUs
- in Proc. of the 2010 ACM/IEEE Int’l Conf. for High Performance Computing, Networking, Storage and Analysis, 2010
"... Abstract—Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations. The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth. Since memory bandwidth ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations. The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth. Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density. We present a novel 3.5D-blocking algorithm that performs 2.5D-spatial and temporal blocking of the input grid into on-chip memory for both CPUs and GPUs. The resultant algorithm is amenable to both threadlevel and data-level parallelism, and scales near-linearly with the SIMD width and multiple-cores. Our performance numbers are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs. Our implementation of 7-point-stencil is 1.5X-faster on CPUs, and 1.8X faster on GPUs for singleprecision floating point inputs than previously reported numbers. For Lattice Boltzmann methods, the corresponding speedup number on CPUs is 2.1X.
Autotuning Performance on Multicore Computers
, 2008
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract
-
Cited by 30 (8 self)
- Add to MetaCart
(Show Context)
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
High-performance code generation for stencil computations on GPU architectures
- In ICS
, 2012
"... Stencil computations arise in many scientific computing do-mains, and often represent time-critical portions of applica-tions. There is significant interest in offloading these com-putations to high-performance devices such as GPU acceler-ators, but these architectures offer challenges for developer ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
(Show Context)
Stencil computations arise in many scientific computing do-mains, and often represent time-critical portions of applica-tions. There is significant interest in offloading these com-putations to high-performance devices such as GPU acceler-ators, but these architectures offer challenges for developers and compilers alike. Stencil computations in particular re-quire careful attention to off-chip memory access and the balancing of work among compute units in GPU devices. In this paper, we present a code generation scheme for stencil computations on GPU accelerators, which optimizes the code by trading an increase in the computational work-load for a decrease in the required global memory band-width. We develop compiler algorithms for automatic gen-eration of efficient, time-tiled stencil code for GPU accel-erators from a high-level description of the stencil opera-tion. We show that the code generation scheme can achieve high performance on a range of GPU architectures, includ-ing both nVidia and AMD devices.
Mint: Realizing CUDA Performance in 3D Stencil Methods with Annotated C
- In Proceedings of the 25th International Conference on Supercomputing (ICS’11
, 2011
"... We present Mint, a programming model that enables the non-expert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
(Show Context)
We present Mint, a programming model that enables the non-expert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with a source-to-source translator that generates optimized CUDA C from traditional C source. The translator relies on annotations to guide translation at a high level. The set of pragmas is small, and the model is compact and simple. Yet, Mint is able to deliver performance competitive with painstakingly hand-optimized CUDA. We show that, for a set of widely used stencil kernels, Mint realized 80 % of the performance obtained from aggressively optimized CUDA on the 200 series NVIDIA GPUs. Our optimizations target three dimensional kernels, which present a daunting array of optimizations.