Results 1 - 10
of
554
Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA
- PPOPP'08
, 2008
"... GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of tradition ..."
Abstract
-
Cited by 215 (11 self)
- Add to MetaCart
(Show Context)
GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor’s organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread’s resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup.
Starpu: a unified platform for task scheduling on heterogeneous multicore architectures,
- Concurrency and Computation: Practice and Experience
, 2011
"... Abstract. In the field of HPC, the current hardware trend is to design multiprocessor architectures that feature heterogeneous technologies such as specialized coprocessors (e.g., Cell/BE SPUs) or data-parallel accelerators (e.g., GPGPUs). Approaching the theoretical performance of these architectu ..."
Abstract
-
Cited by 172 (15 self)
- Add to MetaCart
(Show Context)
Abstract. In the field of HPC, the current hardware trend is to design multiprocessor architectures that feature heterogeneous technologies such as specialized coprocessors (e.g., Cell/BE SPUs) or data-parallel accelerators (e.g., GPGPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We have thus designed STARPU, an original runtime system providing a highlevel, unified execution model tightly coupled with an expressive data management library. The main goal of STARPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run time, and we have demonstrated their efficiency by analyzing the impact of those scheduling policies on several classical linear algebra algorithms that take advantage of multiple cores and GPUs at the same time. In addition to substantial improvements regarding execution times, we obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine.
PacketShader: a GPU-Accelerated Software Router
"... We present PacketShader, a high-performance software router framework for general packet processing with Graphics Processing Unit (GPU) acceleration. PacketShader exploits the massively-parallel processing power of GPU to address the CPU bottleneck in current software routers. Combined with our high ..."
Abstract
-
Cited by 165 (14 self)
- Add to MetaCart
(Show Context)
We present PacketShader, a high-performance software router framework for general packet processing with Graphics Processing Unit (GPU) acceleration. PacketShader exploits the massively-parallel processing power of GPU to address the CPU bottleneck in current software routers. Combined with our high-performance packet I/O engine, PacketShader outperforms existing software routers by more than a factor of four, forwarding 64B IPv4 packets at 39 Gbps on a single commodity PC. We have implemented IPv4 and IPv6 forwarding, OpenFlow switching, and IPsec tunneling to demonstrate the flexibility and performance advantage of PacketShader. The evaluation results show that GPU brings significantly higher throughput over the CPU-only implementation, confirming the effectiveness of GPU for computation and memory-intensive operations in packet processing.
GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management
- SIGMOD
, 2006
"... We present a new algorithm, GPUTeraSort, to sort billio-nrecord wide-key databases using a graphics processing unit (GPU) Our algorithm uses the data and task parallelism on the GPU to perform memory-intensive and compute-intensive tasks while the CPU is used to perform I/O and resource management. ..."
Abstract
-
Cited by 153 (8 self)
- Add to MetaCart
(Show Context)
We present a new algorithm, GPUTeraSort, to sort billio-nrecord wide-key databases using a graphics processing unit (GPU) Our algorithm uses the data and task parallelism on the GPU to perform memory-intensive and compute-intensive tasks while the CPU is used to perform I/O and resource management. We therefore exploit both the high-bandwidth GPU memory interface and the lower-bandwidth CPU main memory interface and achieve higher memory bandwidth than purely CPU-based algorithms. GPUTera-Sort is a two-phase task pipeline: (1) read disk, build keys, sort using the GPU, generate runs, write disk, and (2) read, merge, write. It also pipelines disk transfers and achieves near-peak I/O performance. We have tested the performance of GPUTeraSort on billion-record files using the standard Sort benchmark. In practice, a 3 GHz Pentium IV PC with $265 NVIDIA 7800 GT GPU is significantly faster than optimized CPU-based algorithms on much faster processors, sorting 60GB for a penny; the best reported PennySort price-performance. These results suggest that a GPU co-processor can significantly improve performance on large data processing tasks.
Mars: A MapReduce Framework on Graphics Processors
"... We design and implement Mars, a MapReduce framework, on graphics processors (GPUs). MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of CPUs. Compared with commodity CPUs, GPUs have an order of mag ..."
Abstract
-
Cited by 134 (11 self)
- Add to MetaCart
(Show Context)
We design and implement Mars, a MapReduce framework, on graphics processors (GPUs). MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of CPUs. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power and memory bandwidth, but are harder to program since their architectures are designed as a special-purpose co-processor and their programming interfaces are typically for graphics applications. As the first attempt to harness GPU's power for MapReduce, we developed Mars on an NVIDIA G80 GPU, which contains hundreds of processors, and evaluated it in comparison with Phoenix, the state-ofthe-art MapReduce framework on multi-core processors. Mars hides the programming complexity of the GPU behind the simple and familiar MapReduce interface. It is up to 16 times faster than its CPU-based counterpart for six common web applications on a quad-core machine. Additionally, we integrated Mars with Phoenix to perform co-processing between the GPU and the CPU for further performance improvement. 1.
Conservation Cores: Reducing the Energy of Mature Computations
"... Growing transistor counts, limited power budgets, and the breakdown of voltage scaling are currently conspiring to create a utilization wall that limits the fraction of a chip that can run at full speed at one time. In this regime, specialized, energy-efficient processors can increase parallelism by ..."
Abstract
-
Cited by 89 (9 self)
- Add to MetaCart
(Show Context)
Growing transistor counts, limited power budgets, and the breakdown of voltage scaling are currently conspiring to create a utilization wall that limits the fraction of a chip that can run at full speed at one time. In this regime, specialized, energy-efficient processors can increase parallelism by reducing the per-computation power requirements and allowing more computations to execute under the same power budget. To pursue this goal, this paper introduces conservation cores. Conservation cores, or c-cores, are specialized processors that focus on reducing energy and energy-delay instead of increasing performance. This focus on energy makes c-cores an excellent match for many applications that would be poor candidates for hardware acceleration (e.g., irregular integer codes). We present a toolchain for automatically synthesizing c-cores from application source code and demonstrate that they can significantly reduce energy and energy-delay for a wide range of applications. The c-cores support patching, a form of targeted reconfigurability, that allows them to adapt to new versions of the software they target. Our results show that conservation cores can reduce energy consumption by up to 16.0 × for functions and by up to 2.1 × for whole applications, while patching can extend the useful lifetime of individual c-cores to match that of conventional processors.
OpenVIDIA: Parallel GPU computer vision
- In ACM Multimedia
, 2005
"... Graphics and vision are approximate inverses of each other: ordinarily Graphics Processing Units (GPUs) are used to convert \numbers into pictures " (i.e. computer graphics). In this paper, we propose using GPUs in approximately the reverse way: to assist in \converting pictures into numbers&qu ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
(Show Context)
Graphics and vision are approximate inverses of each other: ordinarily Graphics Processing Units (GPUs) are used to convert \numbers into pictures " (i.e. computer graphics). In this paper, we propose using GPUs in approximately the reverse way: to assist in \converting pictures into numbers" (i.e. computer vision). The OpenVIDIA project uses single or multiple graphics cards to accelerate image analysis and computer vision. It is a library and API aimed at provid-ing a graphics hardware accelerated processing framework for image processing and computer vision. OpenVIDIA ex-plores the creation of a parallel computer architecture con-sisting of multiple Graphics Processing Units (GPUs) built entirely from commodity hardware. OpenVIDIA uses mul-tiple Graphics Processing Units in parallel to operate as a general-purpose parallel computer architecture. It provides a simple API which implements some common computer vi-sion algorithms. Many components can be used immediately and because the project is Open Source, the code is intended to serve as templates and examples for how similar algo-rithms are mapped onto graphics hardware. Implemented are image processing techniques (Canny edge detection, l-tering), image feature handling (identifying and matching features) and image registration, to name a few.
A memory model for scientific algorithms on graphics processors
- in Proc. of the ACM/IEEE Conference on Supercomputing (SC’06
, 2006
"... We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics ..."
Abstract
-
Cited by 74 (4 self)
- Add to MetaCart
We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C’s model to analyze the cache misses. Moreover, we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memory-intensive scientific applications – sorting, fast Fourier transform and dense matrix-multiplication. In practice, our cache-efficient algorithms for these applications are able to achieve memory throughput of 30–50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPU-based and CPU-based implementations on highend processors. In practice, we are able to achieve 2–5× performance improvement.
Relational joins on graphics processors
, 2007
"... We present our novel design and implementation of relational join algorithms for new-generation graphics processing units (GPUs). The new features of such GPUs include support for writes to random memory locations, efficient inter-processor communication through fast shared memory, and a programming ..."
Abstract
-
Cited by 74 (12 self)
- Add to MetaCart
(Show Context)
We present our novel design and implementation of relational join algorithms for new-generation graphics processing units (GPUs). The new features of such GPUs include support for writes to random memory locations, efficient inter-processor communication through fast shared memory, and a programming model for general-purpose computing. Taking advantage of these new features, we design a set of data-parallel primitives such as scan, scatter and split, and use these primitives to implement indexed or non-indexed nested-loop, sort-merge and hash joins. Our algorithms utilize the high parallelism as well as the high memory bandwidth of the GPU and use parallel computation to effectively hide the memory latency. We have implemented our algorithms on a PC with an NVIDIA G80 GPU and an Intel P4 dual-core CPU. Our GPU-based algorithms are able to achieve 2-20 times higher performance than their CPU-based counterparts.
Towards dense linear algebra for hybrid gpu accelerated manycore systems
- Parallel Computing
"... a b s t r a c t We highlight the trends leading to the increased appeal of using hybrid multicore + GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with t ..."
Abstract
-
Cited by 67 (20 self)
- Add to MetaCart
(Show Context)
a b s t r a c t We highlight the trends leading to the increased appeal of using hybrid multicore + GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with the development of a hybrid LU factorization algorithm where we split the computation over a multicore and a graphics processor, and use particular techniques to reduce the amount of pivoting and communication between the hybrid components. This results in an efficient algorithm with balanced use of a multicore processor and a graphics processor.