Results 1 - 10
of
41
The Direct3D 10 system
- ACM Trans. Graph
"... We present a system architecture for the 4 th generation of PCclass programmable graphics processing units (GPUs). The new pipeline features significant additions and changes to the prior generation pipeline including a new programmable stage capable of generating additional primitives and streaming ..."
Abstract
-
Cited by 74 (1 self)
- Add to MetaCart
We present a system architecture for the 4 th generation of PCclass programmable graphics processing units (GPUs). The new pipeline features significant additions and changes to the prior generation pipeline including a new programmable stage capable of generating additional primitives and streaming primitive data to memory, an expanded, common feature set for all of the programmable stages, generalizations to vertex and image memory resources, and new storage formats. We also describe structural modifications to the API, runtime, and shading language to complement the new pipeline. We motivate the design with descriptions of frequently encountered obstacles in current systems. Throughout the paper we present rationale behind prominent design choices and alternatives that were ultimately rejected, drawing on insights collected during a multi-year collaboration with application developers and hardware designers.
Optimization of Mesh Locality for Transparent Vertex Caching
"... Bus traffic between the graphics subsystem and memory can become a bottleneck when rendering geometrically complex meshes. In this paper, we investigate the use of vertex caching to transparently reduce geometry bandwidth. Use of an indexed triangle strip representation permits application programs ..."
Abstract
-
Cited by 65 (0 self)
- Add to MetaCart
Bus traffic between the graphics subsystem and memory can become a bottleneck when rendering geometrically complex meshes. In this paper, we investigate the use of vertex caching to transparently reduce geometry bandwidth. Use of an indexed triangle strip representation permits application programs to animate the meshes at video rates, and provides backward compatibility on legacy hardware. The efficiency of vertex caching is maximized by reordering the faces in the mesh during a preprocess. We present two reordering techniques, a fast greedy strip-growing algorithm and a local optimization algorithm. The strip-growing algorithm performs lookahead simulations of the cache to adapt strip lengths to the cache capacity. The local optimization algorithm improves this initial result by exploring a set of perturbations to the face ordering. The resulting cache miss rates are comparable to the efficiency of the earlier mesh buffer scheme described by Deering and Chow, even though the vertex cache is not actively managed.
Lu-gpu: Efficient algorithms for solving dense linear systems on graphics hardware
- in SC ’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing
, 2005
"... We present a novel algorithm to solve dense linear systems using graphics processors (GPUs). We reduce matrix decomposition and row operations to a series of rasterization problems on the GPU. These include new techniques for streaming index pairs, swapping rows and columns and parallelizing the com ..."
Abstract
-
Cited by 44 (4 self)
- Add to MetaCart
We present a novel algorithm to solve dense linear systems using graphics processors (GPUs). We reduce matrix decomposition and row operations to a series of rasterization problems on the GPU. These include new techniques for streaming index pairs, swapping rows and columns and parallelizing the computation to utilize multiple vertex and fragment processors. We also use appropriate data representations to match the rasterization order and cache technology of graphics processors. We have implemented our algorithm on different GPUs and compared the performance with optimized CPU implementations. In particular, our implementation on a NVIDIA GeForce 7800 GPU outperforms a CPU-based ATLAS implementation. Moreover, our results show that our algorithm is cache and bandwidth efficient and scales well with the number of fragment processors within the GPU and the core GPU clock rate. We use our algorithm for fluid flow simulation and demonstrate that the commodity GPU is a useful co-processor for many scientific applications. 1
The Design of a Parallel Graphics Interface
, 1998
"... It has become increasingly difficult to drive a modern highperformance graphics accelerator at full speed with a serial immediate -mode graphics interface. To resolve this problem, retainedmode constructs have been integrated into graphics interfaces. While retained-mode constructs provide a good so ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
It has become increasingly difficult to drive a modern highperformance graphics accelerator at full speed with a serial immediate -mode graphics interface. To resolve this problem, retainedmode constructs have been integrated into graphics interfaces. While retained-mode constructs provide a good solution in many cases, at times they provide an undesirable interface model for the application programmer, and in some cases they do not solve the performance problem. In order to resolve some of these cases, we present a parallel graphics interface that may be used in conjunction with the existing API as a new paradigm for highperformance graphics applications.
P.: Shadow silhouette maps
- ACM SIGGRAPH
, 2003
"... Permission to make digital/hard copy of part of all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given ..."
Abstract
-
Cited by 36 (2 self)
- Add to MetaCart
Permission to make digital/hard copy of part of all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission
A memory model for scientific algorithms on graphics processors
- in Proc. of the ACM/IEEE Conference on Supercomputing (SC’06
, 2006
"... We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C’s model to analyze the cache misses. Moreover, we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memory-intensive scientific applications – sorting, fast Fourier transform and dense matrix-multiplication. In practice, our cache-efficient algorithms for these applications are able to achieve memory throughput of 30–50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPU-based and CPU-based implementations on highend processors. In practice, we are able to achieve 2–5× performance improvement.
Order-independent texture synthesis
, 2002
"... Search-based texture synthesis algorithms are sensitive to the order in which texture samples are generated; different synthesis orders yield different textures. Unfortunately, most polygon rasterizers and ray tracers do not guarantee the order with which surfaces are sampled. To circumvent this pro ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
Search-based texture synthesis algorithms are sensitive to the order in which texture samples are generated; different synthesis orders yield different textures. Unfortunately, most polygon rasterizers and ray tracers do not guarantee the order with which surfaces are sampled. To circumvent this problem, textures are synthesized beforehand at some maximum resolution and rendered using texture mapping. We describe a search-based texture synthesis algorithm in which samples can be generated in arbitrary order, yet the resulting texture remains identical. The key to our algorithm is a pyramidal representation in which each texture sample depends only on a fixed number of neighboring samples at each level of the pyramid. The bottom (coarsest) level of the pyramid consists of a noise image, which is small and predetermined. When a sample is requested by the renderer, all samples on which it depends are generated at once. Using this approach, samples can be generated in any order. To make the algorithm efficient, we propose storing texture samples and their dependents in a pyramidal cache. Although the first few samples are expensive to generate, there is substantial reuse, so subsequent samples cost less. Fortunately, most rendering algorithms exhibit good coherence, so cache reuse is high.
GPU-ABiSort: Optimal parallel sorting on stream architectures
- In Proceedings of the 20th IEEE International Parallel and Distributed Processing Symposium (IPDPS ’06) (Apr
, 2006
"... In this paper, we present a novel approach for parallel sorting on stream processing architectures. It is based on adaptive bitonic sorting. For sorting n values utilizing p stream processor units, this approach achieves the optimal time complexity O((n log n)/p). While this makes our approach compe ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
In this paper, we present a novel approach for parallel sorting on stream processing architectures. It is based on adaptive bitonic sorting. For sorting n values utilizing p stream processor units, this approach achieves the optimal time complexity O((n log n)/p). While this makes our approach competitive with common sequential sorting algorithms not only from a theoretical viewpoint, it is also very fast from a practical viewpoint. This is achieved by using efficient linear stream memory accesses and by combining the optimal time approach with algorithms optimized for small input sequences. We present an implementation on modern programmable graphics hardware (GPUs). On recent GPUs, our optimal parallel sorting approach has shown to be remarkably faster than sequential sorting on the CPU, and it is also faster than previous non-optimal sorting approaches on the GPU for sufficiently large input sequences. Because of the excellent scalability of our algorithm with the number of stream processor units p (up to n / log 2 n or even n / log n units, depending on the stream architecture), our approach profits heavily from the trend of increasing number of fragment processor units on GPUs, so that we can expect further speed improvement with upcoming GPU generations. 1
Prefetching in a texture cache architecture
- SIGGRAPH / Eurographics Workshop on Graphics Hardware
, 1998
"... Texture mapping has become so ubiquitous in real-time graphics hardware that many systems are able to perform filtered texturing without any penalty in fill rate. The computation rates available in hardware have been outpacing the memory access rates, and texture systems are becoming constrained by ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
Texture mapping has become so ubiquitous in real-time graphics hardware that many systems are able to perform filtered texturing without any penalty in fill rate. The computation rates available in hardware have been outpacing the memory access rates, and texture systems are becoming constrained by memory bandwidth and latency. Caching in conjunction with prefetching can be used to alleviate this problem. In this paper, we introduce a prefetching texture cache architecture designed to take advantage of the access characteristics of texture mapping. The structures needed are relatively simple and are amenable to high clock rates. To quantify the robustness of our architecture, we identify a set of six scenes whose texture locality varies over nearly two orders of magnitude and a set of four memory systems with varying bandwidths and latencies. Through the use of a cycle-accurate simulation, we demonstrate that even in the presence of a high-latency memory system, our architecture can attain at least 97 % of the performance of a zerolatency memory system.
Graphics for the Masses: A Hardware Rasterization Architecture for Mobile Phones
- ACM Transactions on Graphics
, 2003
"... The mobile phone is one of the most widespread devices with rendering capabilities. Those capabilities have been very limited because the resources on such devices are extremely scarce; small amounts of memory, little bandwidth, little chip area dedicated for special purposes, and limited power cons ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
The mobile phone is one of the most widespread devices with rendering capabilities. Those capabilities have been very limited because the resources on such devices are extremely scarce; small amounts of memory, little bandwidth, little chip area dedicated for special purposes, and limited power consumption. The small display resolutions present a further challenge; the angle subtended by a pixel is relatively large, and therefore reasonably high quality rendering is needed to generate high fidelity images.

