MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Hardware and Software Mechanisms for Reducing Load Latency (1996) [3 citations — 1 self]

Download:
Download as a PDF | Download as a PS
by Todd Michael Austin
ftp://ftp.cs.wisc.edu/sohi/theses/austin.ps.gz
Add To MetaCart

Abstract:

As processor demands quickly outpace memory, the performance of load instructions becomes an increasingly critical component to good system performance. This thesis contributes four novel load latency reduction techniques, each targeting a different component of load latency: address calculation, data cache access, address translation, and data cache misses. The contributed techniques are as follows: ffl Fast Address Calculation employs a stateless set index predictor to allow address calculation to overlap with data cache access. The design eliminates the latency of address calculation for many loads. ffl Zero-Cycle Loads combine fast address calculation with an early-issue mechanism to produce pipeline designs capable of hiding the latency of many loads that hit in the data cache. ffl High-Bandwidth Address Translation develops address translation mechanisms with better latency and area characteristics than a multi-ported TLB. The new designs provide multiple-issue processors with effective alternatives for keeping address translation off the critical path of data cache access. ffl Cache-conscious Data Placement is a profile-guided data placement optimization for reducing the frequency of data cache misses. The approach employs heuristic algorithms to find variable placement solutions that decrease inter-variable conflict, and increase cache line utilization and block prefetch. Detailed design descriptions and experimental evaluations are provided for each approach, confirming the designs as cost-effective and practical solutions for reducing load latency. ii

Citations

3148 Computer Architecture: A Quantitative Approach – Hennessy, Patterson - 1996
680 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and – Jouppi - 1990
560 Trace scheduling: A technique for global microcode compaction – Fisher - 1981
537 Cache Memories – Smith - 1982
487 The cache performance and optimizations of blocked algorithms – LAM, ROTHBERG, et al. - 1991
455 Design and evaluation of a compiler algorithm for prefetching – Mowry, Lam, et al. - 1992
247 Profile guided code positioning – Pettis, Hansen - 1990
230 Evaluating associativity in CPU caches – Hill, Smith - 1989
228 EEL: Machine-independent executable editing – Larus, Schnarr - 1995
227 A Comparison of Dynamic Branch Predictors that Use Two Levels of Branch History – Yeh, Patt - 1993
218 An enhanced access and cycle time model for on-chip caches – Wilton, Jouppi - 1994
200 Improving register allocation for subscripted variables – Callahan, Carr, et al. - 1990
199 An effective on-chip preloading scheme to reduce data access penalty – Baer, Chen - 1991
188 Compiler optimizations for improving data locality – Carr, McKinley, et al. - 1994
153 Principles of CMOS VLSI Design, A System Perspective, Second Edition, Addison-Wesely Publishers Company – Weste, Eshraghian - 1994
151 Reducing memory latency via nonblocking and prefetching caches – Chen, Baer - 1992
149 Analysis of Cache Performance for Operating Systems and Multiprogramming – Agarwal - 1987
141 Program optimization for instruction caches – MCFARLING - 1989
138 Cache profiling and the SPEC benchmarks: A case study – LEBECK, WOOD - 1994
134 Cache write policies and performance – Jouppi - 1991
119 Optimization of instruction fetch mechanisms for high issue rates – Conte, Menezes, et al. - 1995
119 High-Bandwidth Data Memory Systems for Superscalar Processors – Sohi, Franklin - 1991
115 Column-associative caches: A tech-nique for reducing the miss rate of direct-mapped caches – Agarwal, Pudar - 1993
109 Reducing false sharing on shared memory multiprocessors through compile time data transformations – Jeremiassen, Eggers - 1995
104 Integrating Register Allocation and Instruction Scheduling for RISCs – Bradlee, Eggers, et al. - 1991
104 Tradeoffs in Two-Level On-Chip Caching – Jouppi, Wilton - 1994
102 Mips Risc Architecture – Kane, Heinrich - 1991
98 Surpassing the TLB performance of superpages with less operating system support – Talluri, Hill - 1994
93 Avoiding conflict misses dynamically in large direct-mapped caches – Bershad, Lee, et al. - 1994
89 Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches – Evers, Chang, et al.
87 Generation and analysis of very long address traces – KESSLER, WALL - 1990
83 Complexity/performance tradeoffs with non-blocking loads – Farkas, Jouppi - 1994
79 A Load-Instruction Unit for Pipelined Processors – Eickemeyer, Vassiliadis - 1993
77 An e cient resource-constrained global scheduling technique for superscalar and VLIW processors – Moon, Ebcioglu - 1992
76 Dynamic dependency analysis of ordinary programs – Austin, Sohi - 1992
75 A case for two-way skewed-associative caches – Seznec - 1993
70 Using lifetime predictors to improve memory allocation performance – Barrett, Zorn - 1993
65 Software Support for Speculative Loads – Rogers, Li - 1992
63 Pseudo-randomly interleaved memory – Rau - 1991
61 A simulation based study of TLB performance – Chen, Borg, et al. - 1992
57 Architectural Support for Single Address Space Operating Systems – Koldinger, Chase, et al. - 1992
54 An Analytical Access Time Model for On-Chip Cache Memories – Wada, Rajan, et al. - 1992
48 Dataflow machine architecture – Veen - 1986
47 Shade: A Fast Instruction Set Simulator for Execution Profiling – Cmelik, Keppel - 1994
47 Procedure merging with instruction caches – McFarling - 1991
43 Balanced Scheduling: Instruction Scheduling When Memory Latency is Uncertain – Kerns, Eggers - 1993
42 Designing the TFP Microprocessor – Hsu - 1994
41 Inexpensive implementations of set-associativity – Kessler, Jooss, et al. - 1989
37 The P2 Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations – Jain, Chlamtac - 1985
36 Compiler Support for Software-Based Cache Partitioning – Mueller - 1995