Despite rapid increases in CPU performance, the primary obstacles to achieving higher performance in contemporary processor organizations remain control and data hazards. Primary data cache misses are responsible for the majority of the data hazards. With CPU primary cache sizes limited by clock cycle time constraints, the performance of future CPUs is effectively going to be limited by the number of primary data cache misses whose penalty cannot be masked. To address this problem, this dissertation takes a detailed look at memory access patterns in complex, real-world programs. A simple memory reference pattern classification is introduced, which is applicable to a broad range of computations, including pointer-intensive and numeric codes. To exploit the new classification, a data prefetch device called the Indirect Reference Buffer (IRB) is proposed. The IRB extends data prefetching to indirect memory address sequences, while also handling dense scientific codes. It is distinguished from previous designs in its seamless integration of linear and indirect address prefetching. The behavior of the IRB on a suite of programs drawn from
|
3148
|
Computer architecture : a quantitative approach, 3rd ed
– Hennessy, Patterson, et al.
- 2003
|
|
680
|
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and
– Jouppi
- 1990
|
|
664
|
ATOM: A system for building customized program analysis tools
– Srivastava, Eustace
- 1994
|
|
537
|
Cache Memories
– Smith
- 1982
|
|
487
|
The cache performance and optimizations of blocked algorithms
– LAM, ROTHBERG, et al.
- 1991
|
|
432
|
Direct Methods for Sparse Matrices
– DUFF, ERISMAN, et al.
- 1986
|
|
296
|
Shade: A fast instruction-set simulator for execution profiling
– Cmelik, Keppel
- 1994
|
|
264
|
Tolerating Latency Through SoftwareControlled Prefetching in Shared-Memory Multiprocessors
– Mowry, Gupta
- 1991
|
|
216
|
Branch prediction strategies and branch target buffer design,” Computer
– Lee, Smith
- 1984
|
|
165
|
Evaluating Stream Buffers as a Secondary Cache Replacement
– Palacharla, Kessler
- 1994
|
|
159
|
Effective Hardware-based Data Prefetching for High-performance Processors
– Chen, Baer
- 1995
|
|
156
|
An architecture for software-controlled data prefetching
– Klaiber, Levy
- 1991
|
|
145
|
Aspects of Cache Memory and Instruction Buffer Performance
– Hill
- 1987
|
|
138
|
Cache profiling and the SPEC benchmarks: A case study
– LEBECK, WOOD
- 1994
|
|
135
|
Software methods for improvement of cache performance on supercomputer applications
– Porterfield
- 1989
|
|
134
|
Highly Concurrent Scalar Processing
– Hsu
- 1986
|
|
110
|
Stride directed prefetching in scalar processors
– Fu, Patel
- 1992
|
|
88
|
Compiler-directed data prefetching in multiprocessors with memory hierarchies
– Gornish, Granston, et al.
- 1990
|
|
88
|
Efficient program tracing
– Larus
- 1993
|
|
79
|
A Load-Instruction Unit for Pipelined Processors
– Eickemeyer, Vassiliadis
- 1993
|
|
75
|
Sequential Program Prefetching in Memory Hierarchies
– Smith
- 1978
|
|
74
|
The microarchitecture of superscalar processors
– Smith, Sohi
- 1995
|
|
70
|
Using lifetime predictors to improve memory allocation performance
– Barrett, Zorn
- 1993
|
|
66
|
Instruction Fetching: Coping with Code Bloat
– Uhlig, Nagle, et al.
- 1995
|
|
61
|
Effective Cache Prefetching on BusBased Multiprocessors
– Tullsen, Eggers
- 1995
|
|
53
|
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
– Dahlgren, Dubois, et al.
- 1995
|
|
45
|
Prefetching in Supercomputer Instruction Caches
– Smith, Hsu
- 1992
|
|
38
|
Characterizing the behavior of sparse algorithms on caches
– Temam, Jalby
- 1992
|
|
37
|
Prefetch unit for vector operations on scalar computers
– Sklenar
- 1992
|
|
36
|
Performance characterization of the Alpha 21164 microprocessor using TP and SPEC workloads
– Cvetanovic, Bhandarkar
- 1996
|
|
36
|
Data prefetching for high-performance processors
– Chen
- 1993
|
|
33
|
Data prefetching in shared memory multiprocessors
– Lee, Yew, et al.
- 1987
|
|
32
|
Data relocation and prefetching for programs with large data sets
– Yamada, Gyllenhall, et al.
- 1994
|
|
32
|
Cache replacement with dynamic exclusion
– McFarling
- 1992
|
|
30
|
Speculative prefetching
– Jegou, Temam
- 1993
|
|
29
|
Streamlining data cache access with fast address calculation
– Austin, Pnevmatikatos, et al.
- 1995
|
|
22
|
Data Preload for Superscalar and VLIW Processors
– Chen
- 1993
|
|
22
|
Two-level adaptive branch prediction and instruction fetch mechanisms for high performance superscalar processors
– Yeh
- 1993
|
|
20
|
Influence of cross-interferences on blocked loops: A case study with matrix-vector multiply
– Fricker, Temam, et al.
- 1995
|
|
20
|
Designing programming languages for analyzability: A fresh look at pointer data structures
– Hendren, Gao
- 1992
|
|
19
|
Sunder: A Programmable Hardware Prefetch Architecture for Numerical Loops
– Chiueh
- 1994
|
|
16
|
Adaptive and Integrated Data Cache Prefetching for Shared-Memory Multiprocessors
– Gornish
- 1995
|
|
16
|
Compiler technology for future microprocessors
– Hwu, Hank, et al.
- 1995
|
|
15
|
Compilation-Based Prefetching for Memory Latency Tolerance
– Selvidge
- 1992
|
|
15
|
Garbage collection using a dynamic threatening boundary. Computer Science
– Barrett, Zorn
- 1993
|
|
14
|
A Preliminary Evaluation of Cache-Miss-Initiated Prefetching Techniques in Scalable Multiprocessors
– Bianchini, LeBlanc
- 1994
|
|
14
|
Compiler Optimization Technique for Data Cache Prefetching Using a Small CAM
– Chi
- 1994
|
|
13
|
Analysis of memory referencing behavior for design of local memories
– McNiven, Davidson
- 1988
|
|
12
|
Predicting load latencies using cache profiling
– Abraham, Rau
- 1994
|
|
8
|
A Data Prefetch Mechanism for Accelerating General Computation
– Harrison, Mehrotra
- 1994
|