Results 1 -
5 of
5
2015 IEEE/ACM International Symposium on Code Generation and Optimization 978-1-4799-8161-8/15/$31.00 ©2015 IEEE Automatic Data Placement into GPU On-Chip Memory Resources
"... Although graphics processing units (GPUs) rely on thread-level parallelism to hide long off-chip memory access latency, judicious utilization of on-chip memory resources, including register files, shared memory, and data caches, is critical to application performance. However, explicitly managing GP ..."
Abstract
- Add to MetaCart
(Show Context)
Although graphics processing units (GPUs) rely on thread-level parallelism to hide long off-chip memory access latency, judicious utilization of on-chip memory resources, including register files, shared memory, and data caches, is critical to application performance. However, explicitly managing GPU on-chip memory resources is a non-trivial task for application developers. More importantly, as on-chip memory resources vary among different GPU generations, performance portability has become a daunting challenge. In this paper, we tackle this problem with compiler-driven automatic data placement. We focus on programs that have already been reasonably optimized either manually by programmers or automatically by compiler tools. Our proposed compiler algorithms refine these programs by revising data placement across different types of GPU on-chip resources to achieve both performance enhancement and performance portability. Among 12 benchmarks in our study, our proposed compiler algorithm improves the performance by 1.76x on average on Nvidia GTX480, and by 1.61x on average on GTX680. 1.
Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
"... Abstract—The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the OpenCL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable. I.
Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse
"... Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are cur-rently two fundamental ways for programs to exploit dy-namic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subker ..."
Abstract
- Add to MetaCart
(Show Context)
Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are cur-rently two fundamental ways for programs to exploit dy-namic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some load imbalance; the latter su↵ers large runtime overhead. In this work, we propose free launch, a new software approach to overcoming the shortcomings of both meth-ods. It allows programmers to use subkernel launches to express dynamic parallelism. It employs a novel compiler-based code transformation named subkernel launch removal to replace the subkernel launches with the reuse of parent threads. Coupled with an adaptive task assignment mechanism, the transformation reas-signs the tasks in the subkernels to the parent threads with a good load balance. The technique requires no hardware extensions, immediately deployable on exist-ing GPUs. It keeps the programming convenience of the subkernel launch-based approach while avoiding its large runtime overhead. Meanwhile, its superior load balancing makes it outperform manual worklist-based techniques by 3X on average.
NUPAR: A Benchmark Suite for Modern GPU Architectures
"... Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelera-tors have gained widespread use by application developers and data-center platform developers. Modern day heteroge-neous systems have evolved to include advanced hardware and software feature ..."
Abstract
- Add to MetaCart
(Show Context)
Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelera-tors have gained widespread use by application developers and data-center platform developers. Modern day heteroge-neous systems have evolved to include advanced hardware and software features to support a spectrum of application patterns. Heterogeneous programming frameworks such as CUDA, OpenCL, and OpenACC have all introduced new in-terfaces to enable developers to utilize new features on these platforms. In emerging applications, performance optimiza-tion is not only limited to eectively exploiting data-level parallelism, but includes leveraging new degrees of concur-rency and parallelism to accelerate the entire application. To aid hardware architects and application developers in
Efficient Warp Execution in Presence of Divergence with Collaborative Context Collection
"... GPU’s SIMD architecture is a double-edged sword con-fronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet power-efficient platform to accelerate applications via massive parallelism; however, on the other hand, irregularities induce inefficiencies ..."
Abstract
- Add to MetaCart
(Show Context)
GPU’s SIMD architecture is a double-edged sword con-fronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet power-efficient platform to accelerate applications via massive parallelism; however, on the other hand, irregularities induce inefficiencies due to the warp’s lockstep traver-sal of all diverging execution paths. In this work, we present a software (compiler) technique named Collab-orative Context Collection (CCC) that increases the warp execution efficiency when faced with thread diver-gence incurred either by different intra-warp task as-signment or by intra-warp load imbalance. CCC col-lects the relevant registers of divergent threads in a warp-specific stack allocated in the fast shared mem-ory, and restores them only when the perfect utiliza-tion of warp lanes becomes feasible. We propose code transformations to enable applicability of CCC to va-riety of program segments with thread divergence. We also introduce optimizations to reduce the cost of CCC and to avoid device occupancy limitation or memory divergence. We have developed a framework that au-tomates application of CCC to CUDA generated inter-mediate PTX code. We evaluated CCC on real-world applications and multiple scenarios using synthetic pro-grams. CCC improves the warp execution efficiency of real-world benchmarks by up to 56 % and achieves an average speedup of 1.69x (maximum 3.08x).