by Stephen P. Crago, Alvin M. Despain
http://www.east.isi.edu/~crago/tr-96-09.ps
Add To MetaCart
Abstract:
Large cache block sizes are used to take advantage of spatial locality and amortize long memory latency over more words. However, the cost of large cache block sizes is increased memory traffic requirements, especially for applications that show poor spacial locality. Software prefetching is usually presumed to increase memory traffic. We present an architecture that uses a separate processor devoted to prefetching that improves execution time and at the same time allows the cache block size to be reduced, thereby reducing memory traffic. Simulation results show that our architecture reduces traffic at the microprocessor chip boundary by between 15 % and 67 % while reducing execution time by up to 68 % for eight scientific and signal processing benchmarks. 1.
Citations
|
455
|
Software Pipelining, “An Effective Scheduling Technique for VLIW
– Lam
|
|
455
|
Design and evaluation of a compiler algorithm for prefetching
– Mowry, Lam, et al.
- 1992
|
|
264
|
Tolerating Latency Through SoftwareControlled Prefetching in Shared-Memory Multiprocessors
– Mowry, Gupta
- 1991
|
|
199
|
An effective on-chip preloading scheme to reduce data access penalty
– Baer, Chen
- 1991
|
|
172
|
Hitting the memory wall: Implications of the obvious
– Wulf, McKee
- 1995
|
|
158
|
Memory bandwidth limitations of future microprocessors
– Burger, Goodman, et al.
- 1996
|
|
131
|
Improving Locality and Parallelism in Nested Loops
– Wolf
- 1992
|
|
103
|
Exporing the benefits of multiple hardware contexts in a multiprocessor architecture: Preliminary results
– Weber, Gupta
- 1989
|
|
89
|
Very Long Instruction Word Architectures and the ELI-52
– Fisher
- 1983
|
|
78
|
The Fast Fourier Transform
– Brigham
- 1974
|
|
43
|
An effective programmable prefetch engine for on-chip caches
– Chen
- 1995
|
|
30
|
A study of single-chip processor/cache organizations for large number of transistors
– Farrens, Tyson, et al.
|
|
29
|
Memory latency effects in decoupled architectures
– Kurian, Hulina, et al.
- 1994
|
|
22
|
Improving the Effectiveness of Software Prefetching With Adaptive Execution
– Saavedra, Park
- 1996
|
|
13
|
et al., “Design of Ion-Implanted MOSFETS with Very Small Physical Dimensions
– Dennard
- 1974
|
|
11
|
High Performance Reduced Instruction Set Processors
– Agawala, Cocke
- 1987
|
|
11
|
FORTRAN CPU Performance Analysis
– McMahon
- 1972
|
|
10
|
Decoupled Pre-Fetching for distributed shared memory
– Watson, Rawsthorne
- 1995
|
|
5
|
Microparallel Processors
– Sano
- 1994
|
|
4
|
A high-performance, hierarchical decoupled architecture
– Crago, Despain, et al.
- 2000
|
|
1
|
Improving the performance of loop-based programs using a prefetch processor
– Crago, Despain
|