Results 1 - 10
of
224
A survey of general-purpose computation on graphics hardware
, 2007
"... The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware acompelling platform for computationally demanding tasks in awide variety of application domains. In this report, we describe, summarize, and analyze the l ..."
Abstract
-
Cited by 554 (15 self)
- Add to MetaCart
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware acompelling platform for computationally demanding tasks in awide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general-purpose computation to graphics hardware. We begin with the technical motivations that underlie general-purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general-purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general-purpose application development on graphics hardware.
Larrabee: a many-core x86 architecture for visual computing
- In SIGGRAPH ’08: ACM SIGGRAPH 2008 papers
, 2008
"... Abstract 123 This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector proces ..."
Abstract
-
Cited by 279 (12 self)
- Add to MetaCart
Abstract 123 This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2 nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this
Interpreting the data: Parallel analysis with Sawzall.
- Scientific Programming Journal,
, 2005
"... Abstract Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because th ..."
Abstract
-
Cited by 273 (0 self)
- Add to MetaCart
(Show Context)
Abstract Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on. We present a system for automating such analyses. A filtering phase, in which a query is expressed using a new procedural programming language, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The results are then collated and saved to a file. The design -including the separation into two phases, the form of the programming language, and the properties of the aggregators -exploits the parallelism inherent in having data and computation distributed across many machines.
Sequoia: Programming the Memory Hierarchy
, 2006
"... We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and p ..."
Abstract
-
Cited by 156 (9 self)
- Add to MetaCart
We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms.
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs,
- Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS-XII,
, 2006
"... Abstract As multicore architectures enter the mainstream, there is a pressing demand for high-level programming models that can effectively map to them. Stream programming offers an attractive way to expose coarse-grained parallelism, as streaming applications (image, video, DSP, etc.) are naturall ..."
Abstract
-
Cited by 133 (6 self)
- Add to MetaCart
(Show Context)
Abstract As multicore architectures enter the mainstream, there is a pressing demand for high-level programming models that can effectively map to them. Stream programming offers an attractive way to expose coarse-grained parallelism, as streaming applications (image, video, DSP, etc.) are naturally represented by independent filters that communicate over explicit data channels. In this paper, we demonstrate an end-to-end stream compiler that attains robust multicore performance in the face of varying application characteristics. As benchmarks exhibit different amounts of task, data, and pipeline parallelism, we exploit all types of parallelism in a unified manner in order to achieve this generality. Our compiler, which maps from the StreamIt language to the 16-core Raw architecture, attains a 11.2x mean speedup over a single-core baseline, and a 1.84x speedup over our previous work.
The Direct3D 10 system
- ACM Trans. Graph
"... We present a system architecture for the 4 th generation of PCclass programmable graphics processing units (GPUs). The new pipeline features significant additions and changes to the prior generation pipeline including a new programmable stage capable of generating additional primitives and streaming ..."
Abstract
-
Cited by 119 (1 self)
- Add to MetaCart
We present a system architecture for the 4 th generation of PCclass programmable graphics processing units (GPUs). The new pipeline features significant additions and changes to the prior generation pipeline including a new programmable stage capable of generating additional primitives and streaming primitive data to memory, an expanded, common feature set for all of the programmable stages, generalizations to vertex and image memory resources, and new storage formats. We also describe structural modifications to the API, runtime, and shading language to complement the new pipeline. We motivate the design with descriptions of frequently encountered obstacles in current systems. Throughout the paper we present rationale behind prominent design choices and alternatives that were ultimately rejected, drawing on insights collected during a multi-year collaboration with application developers and hardware designers.
Accelerator: using data parallelism to program GPUs for general-purpose uses
- in Proceedings of the 12th international conference on Architectural
, 2006
"... GPUs are difficult to program for general-purpose uses. Programmers can either learn graphics APIs and convert their applications to use graphics pipeline operations or they can use stream programming abstractions of GPUs. We describe Accelerator, a system that uses data parallelism to program GPUs ..."
Abstract
-
Cited by 117 (0 self)
- Add to MetaCart
(Show Context)
GPUs are difficult to program for general-purpose uses. Programmers can either learn graphics APIs and convert their applications to use graphics pipeline operations or they can use stream programming abstractions of GPUs. We describe Accelerator, a system that uses data parallelism to program GPUs for general-purpose uses instead. Programmers use a conventional imperative programming language and a library that provides only high-level data-parallel operations. No aspects of GPUs are exposed to programmers. The library implementation compiles the data-parallel operations on the fly to optimized GPU pixel shader code and API calls. We describe the compilation techniques used to do this. We evaluate the effectiveness of using data parallelism to program GPUs by providing results for a set of compute-intensive benchmarks. We compare the performance of Accelerator versions of the benchmarks against hand-written pixel shaders. The speeds of the Accelerator versions are typically within 50 % of the speeds of hand-written pixel shader code. Some benchmarks significantly outperform C versions on a CPU: they are up to 18 times faster than C code running on a CPU.
Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication
, 2004
"... Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model&a ..."
Abstract
-
Cited by 95 (1 self)
- Add to MetaCart
Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even nearoptimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.
RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing
- ACM Trans. Graph
, 2005
"... with shadows and refractions), a Conference room (5.5 fps, without shadows), reflective and refractive Spheres-RT in an office (4.5 fps), and UT2003 a scene from a current computer game (7.5 fps, precomputed illumination). Recursive ray tracing is a simple yet powerful and general approach for accur ..."
Abstract
-
Cited by 93 (4 self)
- Add to MetaCart
with shadows and refractions), a Conference room (5.5 fps, without shadows), reflective and refractive Spheres-RT in an office (4.5 fps), and UT2003 a scene from a current computer game (7.5 fps, precomputed illumination). Recursive ray tracing is a simple yet powerful and general approach for accurately computing global light transport and rendering high quality images. While recent algorithmic improvements and optimized parallel software implementations have increased ray tracing performance to realtime levels, no compact and programmable hardware solution has been available yet. This paper describes the architecture and a prototype implementation of a single chip, fully programmable Ray Processing Unit (RPU). It combines the flexibility of general purpose CPUs with the efficiency of current GPUs for data parallel computations. This design allows for realtime ray tracing of dynamic scenes with programmable material, geometry, and illumination shaders. Although, running at only 66 MHz the prototype FPGA implementation already renders images at up to 20 frames per second, which in many cases beats the performance of highly optimized software running on multi-GHz desktop CPUs. The performance and efficiency of the proposed architecture is analyzed using a variety of benchmark scenes.
A Performance Study of General-Purpose Applications on Graphics Processors Using CUDA
"... Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of generalpurpose applications compared to contempora ..."
Abstract
-
Cited by 86 (7 self)
- Add to MetaCart
(Show Context)
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of generalpurpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIA’s C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications.