As processor demands quickly outpace memory, the performance of load instructions becomes an increasingly critical component to good system performance. This thesis contributes four novel load latency reduction techniques, each targeting a different component of load latency: address calculation, data cache access, address translation, and data cache misses. The contributed techniques are as follows: ffl Fast Address Calculation employs a stateless set index predictor to allow address calculation to overlap with data cache access. The design eliminates the latency of address calculation for many loads. ffl Zero-Cycle Loads combine fast address calculation with an early-issue mechanism to produce pipeline designs capable of hiding the latency of many loads that hit in the data cache. ffl High-Bandwidth Address Translation develops address translation mechanisms with better latency and area characteristics than a multi-ported TLB. The new designs provide multiple-issue processors with effective alternatives for keeping address translation off the critical path of data cache access. ffl Cache-conscious Data Placement is a profile-guided data placement optimization for reducing the frequency of data cache misses. The approach employs heuristic algorithms to find variable placement solutions that decrease inter-variable conflict, and increase cache line utilization and block prefetch. Detailed design descriptions and experimental evaluations are provided for each approach, confirming the designs as cost-effective and practical solutions for reducing load latency. ii
|
3148
|
Computer Architecture: A Quantitative Approach
– Hennessy, Patterson
- 1996
|
|
680
|
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and
– Jouppi
- 1990
|
|
560
|
Trace scheduling: A technique for global microcode compaction
– Fisher
- 1981
|
|
537
|
Cache Memories
– Smith
- 1982
|
|
487
|
The cache performance and optimizations of blocked algorithms
– LAM, ROTHBERG, et al.
- 1991
|
|
455
|
Design and evaluation of a compiler algorithm for prefetching
– Mowry, Lam, et al.
- 1992
|
|
247
|
Profile guided code positioning
– Pettis, Hansen
- 1990
|
|
230
|
Evaluating associativity in CPU caches
– Hill, Smith
- 1989
|
|
228
|
EEL: Machine-independent executable editing
– Larus, Schnarr
- 1995
|
|
227
|
A Comparison of Dynamic Branch Predictors that Use Two Levels of Branch History
– Yeh, Patt
- 1993
|
|
218
|
An enhanced access and cycle time model for on-chip caches
– Wilton, Jouppi
- 1994
|
|
200
|
Improving register allocation for subscripted variables
– Callahan, Carr, et al.
- 1990
|
|
199
|
An effective on-chip preloading scheme to reduce data access penalty
– Baer, Chen
- 1991
|
|
188
|
Compiler optimizations for improving data locality
– Carr, McKinley, et al.
- 1994
|
|
153
|
Principles of CMOS VLSI Design, A System Perspective, Second Edition, Addison-Wesely Publishers Company
– Weste, Eshraghian
- 1994
|
|
151
|
Reducing memory latency via nonblocking and prefetching caches
– Chen, Baer
- 1992
|
|
149
|
Analysis of Cache Performance for Operating Systems and Multiprogramming
– Agarwal
- 1987
|
|
141
|
Program optimization for instruction caches
– MCFARLING
- 1989
|
|
138
|
Cache profiling and the SPEC benchmarks: A case study
– LEBECK, WOOD
- 1994
|
|
134
|
Cache write policies and performance
– Jouppi
- 1991
|
|
119
|
Optimization of instruction fetch mechanisms for high issue rates
– Conte, Menezes, et al.
- 1995
|
|
119
|
High-Bandwidth Data Memory Systems for Superscalar Processors
– Sohi, Franklin
- 1991
|
|
115
|
Column-associative caches: A tech-nique for reducing the miss rate of direct-mapped caches
– Agarwal, Pudar
- 1993
|
|
109
|
Reducing false sharing on shared memory multiprocessors through compile time data transformations
– Jeremiassen, Eggers
- 1995
|
|
104
|
Integrating Register Allocation and Instruction Scheduling for RISCs
– Bradlee, Eggers, et al.
- 1991
|
|
104
|
Tradeoffs in Two-Level On-Chip Caching
– Jouppi, Wilton
- 1994
|
|
102
|
Mips Risc Architecture
– Kane, Heinrich
- 1991
|
|
98
|
Surpassing the TLB performance of superpages with less operating system support
– Talluri, Hill
- 1994
|
|
93
|
Avoiding conflict misses dynamically in large direct-mapped caches
– Bershad, Lee, et al.
- 1994
|
|
89
|
Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches
– Evers, Chang, et al.
|
|
87
|
Generation and analysis of very long address traces
– KESSLER, WALL
- 1990
|
|
83
|
Complexity/performance tradeoffs with non-blocking loads
– Farkas, Jouppi
- 1994
|
|
79
|
A Load-Instruction Unit for Pipelined Processors
– Eickemeyer, Vassiliadis
- 1993
|
|
77
|
An e cient resource-constrained global scheduling technique for superscalar and VLIW processors
– Moon, Ebcioglu
- 1992
|
|
76
|
Dynamic dependency analysis of ordinary programs
– Austin, Sohi
- 1992
|
|
75
|
A case for two-way skewed-associative caches
– Seznec
- 1993
|
|
70
|
Using lifetime predictors to improve memory allocation performance
– Barrett, Zorn
- 1993
|
|
65
|
Software Support for Speculative Loads
– Rogers, Li
- 1992
|
|
63
|
Pseudo-randomly interleaved memory
– Rau
- 1991
|
|
61
|
A simulation based study of TLB performance
– Chen, Borg, et al.
- 1992
|
|
57
|
Architectural Support for Single Address Space Operating Systems
– Koldinger, Chase, et al.
- 1992
|
|
54
|
An Analytical Access Time Model for On-Chip Cache Memories
– Wada, Rajan, et al.
- 1992
|
|
48
|
Dataflow machine architecture
– Veen
- 1986
|
|
47
|
Shade: A Fast Instruction Set Simulator for Execution Profiling
– Cmelik, Keppel
- 1994
|
|
47
|
Procedure merging with instruction caches
– McFarling
- 1991
|
|
43
|
Balanced Scheduling: Instruction Scheduling When Memory Latency is Uncertain
– Kerns, Eggers
- 1993
|
|
42
|
Designing the TFP Microprocessor
– Hsu
- 1994
|
|
41
|
Inexpensive implementations of set-associativity
– Kessler, Jooss, et al.
- 1989
|
|
37
|
The P2 Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations
– Jain, Chlamtac
- 1985
|
|
36
|
Compiler Support for Software-Based Cache Partitioning
– Mueller
- 1995
|