Download:
by Lixin Zhang, John B. Carter, Wilson C. Hsieh, Sally A. Mckee
In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
http://www.cs.utah.edu/techreports/1999/pdf/UUCS-99-002.pdf
Add To MetaCart
Abstract:
Processor speeds are increasing rapidly, but memory speeds are not keeping pace. Image processing is an important application domain that is particularly impacted by this growing performance gap. Image processing algorithms tend to have poor memory locality because they access their data in a non-sequential fashion and reuse that data infrequently. As a result, they often exhibit poor cache and TLB hit rates on conventional memory systems, which limits overall performance. Most current approaches to addressing the memory bottleneck focus on modifying cache organizations or introducing processor-based prefetching. The Impulse memory system takes a different approach: allowing application software to control how, when, and where data are loaded into a conventional processor cache. Impulse does this by letting software configure how the memory controller interprets the physical addresses exported by a processor. Introducing an extra level of address translation in the memory controller enables an application to dynamically change how its data are fetched from memory. Data that is sparse in memory can be accessed densely, which improves both cache and TLB utilization, and Impulse hides memory latency by prefetching data within the memory controller. We describe how Impulse improves the performance of three image processing algorithms: volume rendering, image warping, andimage filtering. We find that for these codes, an Impulse memory system yields speedups of 40 % to 226 % over an otherwise identical machine with a conventional memory system.
Citations
|
680
|
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and
– Jouppi
- 1990
|
|
371
|
Fast volume rendering using a shear–warp factorization of the viewing transformation
– Lacroute, Levoy
- 1994
|
|
359
|
The Tera Computer System
– Alverson, Callahan, et al.
- 1990
|
|
322
|
Digital Image Warping
– Wolberg
- 1990
|
|
165
|
Evaluating Stream Buffers as a Secondary Cache Replacement
– Palacharla, Kessler
- 1994
|
|
159
|
Effective Hardware-based Data Prefetching for High-performance Processors
– Chen, Baer
- 1995
|
|
158
|
Memory bandwidth limitations of future microprocessors
– Burger, Goodman, et al.
- 1996
|
|
115
|
Synchronization and communication in the T3E multiprocessor
– Scott
- 1996
|
|
108
|
Interactive ray tracing for isosurface rendering
– Parker, Shirley, et al.
- 1998
|
|
97
|
Simultaneous multithreading: a platform for next-generation processors
– Eggers, Emer, et al.
- 1997
|
|
88
|
A bandwidth-efficient architecture for media processing
– Rixner, Dally, et al.
- 1998
|
|
75
|
A case for two-way skewed-associative caches
– Seznec
- 1993
|
|
67
|
MemorySystem Design Considerations for Dynamically-Scheduled Processors
– Farkas, Chow, et al.
- 1997
|
|
63
|
Impulse: Building a smarter memory controller
– Carter, Hsieh, et al.
- 1999
|
|
53
|
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
– Dahlgren, Dubois, et al.
- 1995
|
|
41
|
The impact of instruction-level parallelism on multiprocessor performance and simulation methodology
– Pai, Ranganathan, et al.
- 1997
|
|
39
|
3-D Transformations of Images in Scanline Order
– Catmull, Smith
- 1980
|
|
39
|
Increasing TLB Reach Using Superpages Backed by Shadow Memory
– Swanson, Stoller, et al.
- 1998
|
|
38
|
A case for intelligent RAM: IRAM
– Patterson, Anderson, et al.
- 1997
|
|
32
|
Data relocation and prefetching for programs with large data sets
– Yamada, Gyllenhall, et al.
- 1994
|
|
28
|
Command Vector Memory Systems: High Performance at Low Cost
– Corbal, Espasa, et al.
- 1998
|
|
25
|
Active pages: a model of computation for intelligent memory
– Oskin, Chong, et al.
- 1998
|
|
19
|
Image Processing for Computer Graphics
– Gomes, Velho
- 1997
|
|
17
|
Paint: PA instruction set interpreter
– Stoller, Kuramkote, et al.
- 1996
|
|
17
|
Design and evaluation of dynamic access ordering hardware
– McKee
- 1996
|
|
15
|
ªArchitectural Adaptation for Application-Specific Locality Optimizations,º
– Zhang, Dasdan, et al.
- 1997
|
|
14
|
et al. Design and evaluation of dynamic access ordering hardware
– McKee
- 1996
|
|
13
|
Performance Study of a Concurrent Multithreaded Processor
– Tsai, Jiang, et al.
- 1998
|
|
12
|
A new memory system design for commercial and technical computing products
– Hotchkiss, Marschke, et al.
- 1996
|
|
10
|
et al.: Simultaneous Multithreading: A Platform for Next-Generation Processors
– Eggers
- 1997
|