MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Reducing the Traffic of Loop-Based Programs Using a Prefetch Processor

Download:
pdf | ps
by Stephen P. Crago, Alvin M. Despain
http://www.east.isi.edu/~crago/tr-96-09.ps
Add To MetaCart

Abstract:

Large cache block sizes are used to take advantage of spatial locality and amortize long memory latency over more words. However, the cost of large cache block sizes is increased memory traffic requirements, especially for applications that show poor spacial locality. Software prefetching is usually presumed to increase memory traffic. We present an architecture that uses a separate processor devoted to prefetching that improves execution time and at the same time allows the cache block size to be reduced, thereby reducing memory traffic. Simulation results show that our architecture reduces traffic at the microprocessor chip boundary by between 15 % and 67 % while reducing execution time by up to 68 % for eight scientific and signal processing benchmarks. 1.

Citations

455 Software Pipelining, “An Effective Scheduling Technique for VLIW – Lam
455 Design and evaluation of a compiler algorithm for prefetching – Mowry, Lam, et al. - 1992
264 Tolerating Latency Through SoftwareControlled Prefetching in Shared-Memory Multiprocessors – Mowry, Gupta - 1991
199 An effective on-chip preloading scheme to reduce data access penalty – Baer, Chen - 1991
172 Hitting the memory wall: Implications of the obvious – Wulf, McKee - 1995
158 Memory bandwidth limitations of future microprocessors – Burger, Goodman, et al. - 1996
131 Improving Locality and Parallelism in Nested Loops – Wolf - 1992
103 Exporing the benefits of multiple hardware contexts in a multiprocessor architecture: Preliminary results – Weber, Gupta - 1989
89 Very Long Instruction Word Architectures and the ELI-52 – Fisher - 1983
78 The Fast Fourier Transform – Brigham - 1974
43 An effective programmable prefetch engine for on-chip caches – Chen - 1995
30 A study of single-chip processor/cache organizations for large number of transistors – Farrens, Tyson, et al.
29 Memory latency effects in decoupled architectures – Kurian, Hulina, et al. - 1994
22 Improving the Effectiveness of Software Prefetching With Adaptive Execution – Saavedra, Park - 1996
13 et al., “Design of Ion-Implanted MOSFETS with Very Small Physical Dimensions – Dennard - 1974
11 High Performance Reduced Instruction Set Processors – Agawala, Cocke - 1987
11 FORTRAN CPU Performance Analysis – McMahon - 1972
10 Decoupled Pre-Fetching for distributed shared memory – Watson, Rawsthorne - 1995
5 Microparallel Processors – Sano - 1994
4 A high-performance, hierarchical decoupled architecture – Crago, Despain, et al. - 2000
1 Improving the performance of loop-based programs using a prefetch processor – Crago, Despain