• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

by Yi Yang, Huiyang Zhou
Add To MetaCart

Tools

Sorted by:
Results 1 - 5 of 5

2015 IEEE/ACM International Symposium on Code Generation and Optimization 978-1-4799-8161-8/15/$31.00 ©2015 IEEE Automatic Data Placement into GPU On-Chip Memory Resources

by Chao Li, Yi Yang, Zhen Lin, Huiyang Zhou
"... Although graphics processing units (GPUs) rely on thread-level parallelism to hide long off-chip memory access latency, judicious utilization of on-chip memory resources, including register files, shared memory, and data caches, is critical to application performance. However, explicitly managing GP ..."
Abstract - Add to MetaCart
Although graphics processing units (GPUs) rely on thread-level parallelism to hide long off-chip memory access latency, judicious utilization of on-chip memory resources, including register files, shared memory, and data caches, is critical to application performance. However, explicitly managing GPU on-chip memory resources is a non-trivial task for application developers. More importantly, as on-chip memory resources vary among different GPU generations, performance portability has become a daunting challenge. In this paper, we tackle this problem with compiler-driven automatic data placement. We focus on programs that have already been reasonably optimized either manually by programmers or automatically by compiler tools. Our proposed compiler algorithms refine these programs by revising data placement across different types of GPU on-chip resources to achieve both performance enhancement and performance portability. Among 12 benchmarks in our study, our proposed compiler algorithm improves the performance by 1.76x on average on Nvidia GTX480, and by 1.61x on average on GTX680. 1.
(Show Context)

Citation Context

...edsmemory and the impacts of varying on-chip resourcessacross different GPU generations.sTo relieve the burden of optimizing GPU programs fromsthe programmers, many auto-tuning frameworkss[15][16][22]=-=[25]-=-[26] have been developed to automaticallysoptimize the GPU programs to achieve high performance.sFor example, a polyhedral model is used in [16] forsoptimizing global memory accesses. In [27], the sha...

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

by Da Li, Hancheng Wu, Michela Becchi
"... Abstract—The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational ..."
Abstract - Add to MetaCart
Abstract—The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the OpenCL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable. I.
(Show Context)

Citation Context

...ources. Unfortunately, theseffective use of this feature has yet to be understood: thesinvocation of nested kernels can incur significant overheads(due to parameter parsing, queueing, and scheduling) =-=[1, 2]-=-sand may be beneficial only if the amount of work spawned isssubstantial.sIn this work, we consider two categories ofscomputational patterns involving nested parallelism:sapplications containing paral...

Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse

by Guoyang Chen, Xipeng Shen
"... Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are cur-rently two fundamental ways for programs to exploit dy-namic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subker ..."
Abstract - Add to MetaCart
Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are cur-rently two fundamental ways for programs to exploit dy-namic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some load imbalance; the latter su↵ers large runtime overhead. In this work, we propose free launch, a new software approach to overcoming the shortcomings of both meth-ods. It allows programmers to use subkernel launches to express dynamic parallelism. It employs a novel compiler-based code transformation named subkernel launch removal to replace the subkernel launches with the reuse of parent threads. Coupled with an adaptive task assignment mechanism, the transformation reas-signs the tasks in the subkernels to the parent threads with a good load balance. The technique requires no hardware extensions, immediately deployable on exist-ing GPUs. It keeps the programming convenience of the subkernel launch-based approach while avoiding its large runtime overhead. Meanwhile, its superior load balancing makes it outperform manual worklist-based techniques by 3X on average.
(Show Context)

Citation Context

...parent threads. For the nature of massive parallelism of GPU, each of the three components could take some substantial amount of time, causing overhead sometimes even greater than the workload itself =-=[4, 5, 6]-=-. So despite being intuitive to use, this method has received only little practical usage. Some recent hardware extensions have been proposed to help alleviate the problem [5, 7]. They are yet to be a...

NUPAR: A Benchmark Suite for Modern GPU Architectures

by unknown authors
"... Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelera-tors have gained widespread use by application developers and data-center platform developers. Modern day heteroge-neous systems have evolved to include advanced hardware and software feature ..."
Abstract - Add to MetaCart
Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelera-tors have gained widespread use by application developers and data-center platform developers. Modern day heteroge-neous systems have evolved to include advanced hardware and software features to support a spectrum of application patterns. Heterogeneous programming frameworks such as CUDA, OpenCL, and OpenACC have all introduced new in-terfaces to enable developers to utilize new features on these platforms. In emerging applications, performance optimiza-tion is not only limited to eectively exploiting data-level parallelism, but includes leveraging new degrees of concur-rency and parallelism to accelerate the entire application. To aid hardware architects and application developers in
(Show Context)

Citation Context

... applications using CUDA on NVIDIA GPUs [34]. NUPAR, however, covers a larger spectrum of applications. CUDA-NP proposes a compilerlevel solution to leverage nested parallelism for GPGPU applications =-=[35]-=-. Y. Liang et al. demonstrate a performance improvement over Hyper-Q using their technique which allows spatial and temporal multitasking on GPUs [8]. But they do not provide researchers with a set of...

Efficient Warp Execution in Presence of Divergence with Collaborative Context Collection

by Farzad Khorasani, Rajiv Gupta, Laxmi N. Bhuyan
"... GPU’s SIMD architecture is a double-edged sword con-fronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet power-efficient platform to accelerate applications via massive parallelism; however, on the other hand, irregularities induce inefficiencies ..."
Abstract - Add to MetaCart
GPU’s SIMD architecture is a double-edged sword con-fronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet power-efficient platform to accelerate applications via massive parallelism; however, on the other hand, irregularities induce inefficiencies due to the warp’s lockstep traver-sal of all diverging execution paths. In this work, we present a software (compiler) technique named Collab-orative Context Collection (CCC) that increases the warp execution efficiency when faced with thread diver-gence incurred either by different intra-warp task as-signment or by intra-warp load imbalance. CCC col-lects the relevant registers of divergent threads in a warp-specific stack allocated in the fast shared mem-ory, and restores them only when the perfect utiliza-tion of warp lanes becomes feasible. We propose code transformations to enable applicability of CCC to va-riety of program segments with thread divergence. We also introduce optimizations to reduce the cost of CCC and to avoid device occupancy limitation or memory divergence. We have developed a framework that au-tomates application of CCC to CUDA generated inter-mediate PTX code. We evaluated CCC on real-world applications and multiple scenarios using synthetic pro-grams. CCC improves the warp execution efficiency of real-world benchmarks by up to 56 % and achieves an average speedup of 1.69x (maximum 3.08x).
(Show Context)

Citation Context

... the code blocks inside a loop with varying trip-count and improve the performance ; however, unlike CCC, the solution does not guarantee full warp execution efficiency. Yang and Zhou created CUDA-NP =-=[37]-=-, a source-to-source compiler that transforms GPU codes with parallel sections using the idea of master and slave threads. However, fixed number of slave threads for a master thread can hurt the perfo...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University