Results 1 - 10
of
164
Prefetching using Markov predictors
- In ISCA
, 1997
"... Prefetching is one approach to reducing the latency of memory op-erations in modem computer systems. In this paper, we describe the Markov prefetcher. This prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing com-puter designs. The Markov prefetcher is ..."
Abstract
-
Cited by 308 (1 self)
- Add to MetaCart
Prefetching is one approach to reducing the latency of memory op-erations in modem computer systems. In this paper, we describe the Markov prefetcher. This prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing com-puter designs. The Markov prefetcher is distinguished by prefetch-ing multiple reference predictions from the memory subsystem, and then prioritizing the delivery of those references to the processor. This design results in a prefetching system that provides good coverage, is accurate and produces timely results that can be ef-fectively used by the processor. In our cycle-level simulations, the Markov Prefetcher reduces the overall execution stalls due to in-struction and data memory operations by an average of 54 % for various commercial benchmarks while only using two thrds the memory of a demand-fetch cache organization. 1
Optimal Prefetching via Data Compression
, 1995
"... Caching and prefetching are important mechanisms for speeding up access time to data on secondary storage. Recent work in competitive online algorithms has uncovered several promising new algorithms for caching. In this paper we apply a form of the competitive philosophy for the first time to the pr ..."
Abstract
-
Cited by 258 (7 self)
- Add to MetaCart
Caching and prefetching are important mechanisms for speeding up access time to data on secondary storage. Recent work in competitive online algorithms has uncovered several promising new algorithms for caching. In this paper we apply a form of the competitive philosophy for the first time to the problem of prefetching to develop an optimal universal prefetcher in terms of fault ratio, with particular applications to large-scale databases and hypertext systems. Our prediction algorithms for prefetching are novel in that they are based on data compression techniques that are both theoretically optimal and good in practice. Intuitively, in order to compress data effectively, you have to be able to predict future data well, and thus good data compressors should be able to predict well for purposes of prefetching. We show for powerful models such as Markov sources and nth order Markov sources that the page fault rates incurred by our prefetching algorithms are optimal in the limit for almost all sequences of page requests.
Effective Hardware-Based Data Prefetching for High-Performance Processors.
- IEEE Transactions on Computers,
, 1995
"... ..."
A Study of Integrated Prefetching and Caching Strategies
- In Proceedings of the ACM SIGMETRICS
, 1995
"... Prefetching and caching are effective techniques for improving the performance of file systems, but they have not been studied in an integrated fashion. This paper proposes four properties that optimal integrated strategies for prefetching and caching must satisfy, and then presents and studies two ..."
Abstract
-
Cited by 210 (9 self)
- Add to MetaCart
Prefetching and caching are effective techniques for improving the performance of file systems, but they have not been studied in an integrated fashion. This paper proposes four properties that optimal integrated strategies for prefetching and caching must satisfy, and then presents and studies two such integrated strategies, called aggressive and conservative. We prove that the performance of the conservative approach is within a factor of two of optimal and that the performance of the aggressive strategy is a factor significantly less than twice that of the optimal case. We have evaluated these two approaches by trace-driven simulation with a collection of file access traces. Our results show that the two integrated prefetching and caching strategies are indeed close to optimal and that these strategies can reduce the running time of applications by up to 50%.
Managing Wire Delay in Large Chip-Multiprocessor Caches
- IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 2004
"... In response to increasing (relative) wire delay, architects have proposed various technologies to manage the impact of slow wires on large uniprocessor L2 caches. Block migration (e.g., D-NUCA and NuRapid) reduces average hit latency by migrating frequently used blocks towards the lower-latency bank ..."
Abstract
-
Cited by 157 (4 self)
- Add to MetaCart
In response to increasing (relative) wire delay, architects have proposed various technologies to manage the impact of slow wires on large uniprocessor L2 caches. Block migration (e.g., D-NUCA and NuRapid) reduces average hit latency by migrating frequently used blocks towards the lower-latency banks. Transmission Line Caches (TLC) use on-chip transmission lines to provide low latency to all banks. Traditional stride-based hardware prefetching strives to tolerate, rather than reduce, latency. Chip multiprocessors (CMPs) present additional challenges. First, CMPs often share the on-chip L2 cache, requiring multiple ports to provide sufficient bandwidth. Second, multiple threads mean multiple working sets, which compete for limited on-chip storage. Third, sharing code and data interferes with block migration, since one processor's low-latency bank is another processor's high-latency bank. In this paper, we develop L2 cache designs for CMPs that incorporate these three latency management techniques. We use detailed full-system simulation to analyze the performance trade-offs for both commercial and scientific workloads. First, we demonstrate that block migration is less effective for CMPs because 40-60% of L2 cache hits in commercial workloads are satisfied in the central banks, which are equally far from all processors. Second, we observe that although transmission lines provide low latency, contention for their restricted bandwidth limits their performance. Third, we show stride-based prefetching between L1 and L2 caches alone improves performance by at least as much as the other two techniques. Finally, we present a hybrid design-combining all three techniques-that improves performance by an additional 2% to 19% over prefetching alone.
Dynamic hot data stream prefetching for general-purpose programs
- InACM SIGPLANConference on Programming Language Designand Implementation
, 2002
"... Prefetching data ahead of use has the potential to tolerate the growing processor-memory performance gap by overlapping long latency memory accesses with useful computation. While sophisticated prefetching techniques have been automated for limited domains, such as scientific codes that access dense ..."
Abstract
-
Cited by 116 (2 self)
- Add to MetaCart
(Show Context)
Prefetching data ahead of use has the potential to tolerate the growing processor-memory performance gap by overlapping long latency memory accesses with useful computation. While sophisticated prefetching techniques have been automated for limited domains, such as scientific codes that access dense arrays in loop nests, a similar level of success has eluded general-purpose programs, especially pointer-chasing codes written in languages such as C and C++. We address this problem by describing, implementing and evaluating a dynamic prefetching scheme. Our technique runs on stock hardware, is completely automatic, and works for generalpurpose programs, including pointer-chasing codes written in weakly-typed languages, such as C and C++. It operates in three phases. First, the profiling phase gathers a temporal data reference profile from a running program with low-overhead. Next, the profiling is turned off and a fast analysis algorithm extracts hot data streams, which are data reference sequences that frequently repeat in the same order, from the temporal profile. Then, the system dynamically injects code at appropriate program points to detect and prefetch these hot data streams. Finally, the process enters the hibernation phase where no profiling or analysis is performed, and the program continues to execute with the added prefetch instructions. At the end of the hibernation phase, the program is deoptimized to remove the inserted checks and prefetch instructions, and control returns to the profiling phase. For long-running programs, this profile, analyze and optimize, hibernate, cycle will repeat multiple times. Our initial results from applying dynamic prefetching are promising, indicating overall execution time improvements of 5–19 % for several memory-performance-limited SPECint2000 benchmarks running their largest (ref) inputs.
Implementation and Performance of Integrated Application-Controlled Caching, Prefetching and Disk Scheduling
, 1996
"... Although file caching and prefetching are known techniques to improve the performance of file systems, little work has been done on intergrating caching and prefetching. Optimal prefetching is nontrivial because prefetching may require early cache block replacements. Moreover, the tradeoff between t ..."
Abstract
-
Cited by 114 (8 self)
- Add to MetaCart
Although file caching and prefetching are known techniques to improve the performance of file systems, little work has been done on intergrating caching and prefetching. Optimal prefetching is nontrivial because prefetching may require early cache block replacements. Moreover, the tradeoff between the latency-hiding benefits of prefetching and the increase in the number of fetches required must be considered. This paper presents the design and implementation of a file system that integrates application-controlled caching, prefetching and disk scheduling. We use a two-level cache management strategy. The kernel uses the LRU-SP policy [CFL94a] to allocate blocks to processes, and each process uses the controlledaggressive policy, an algorithm previously shown in a theoretical sense to be near-optimal, for managing its cache. Each process then improves its disk access latency by submitting its prefetches in batches and schedules the requests in each batch to optimize disk access performa...
Run-time adaptive cache hierarchy management via reference analysis
- in Proceedings of the 24th International Symposium on Computer Architecture
, 1997
"... Improvements in main memory speeds have not kept pace with increasing processor clock frequency and improved exploitation of instruction-level parallelism. Consequently, the gap between processor and main memory performance is expected to grow, increasing the number of execution cycles spent waiting ..."
Abstract
-
Cited by 114 (3 self)
- Add to MetaCart
(Show Context)
Improvements in main memory speeds have not kept pace with increasing processor clock frequency and improved exploitation of instruction-level parallelism. Consequently, the gap between processor and main memory performance is expected to grow, increasing the number of execution cycles spent waiting for memory accesses to complete. One solution to this growing problem is to reduce the number of cache misses by increasing the e ectiveness of the cache hierarchy. In this paper we present a technique for dynamic analysis of program data access behavior, which is then used to proactively guide the placement of data within the cache hierarchy in a location-sensitive manner. We introduce the concept of a macroblock, which allows us to feasibly characterize the memory locations accessed by a program, and a Memory Address Table, which performs the dynamic reference analysis. Our technique is fully compatible with existing Instruction Set Architectures. Results from detailed simulations of several integer programs show signi cant speedups. 1
Dead-block prediction and dead-block correlating prefetchers
- In Proceedings of the 28th International Symposium on Computer Architecture
, 2001
"... laia @ ecn.purdue, edu Effective data prefetching requires accurate mechanisms to predict both "which " cache blocks to prefetch and "when " to prefetch them. This paper proposes the Dead-Block Predictors (DBPs), trace-based predictors that accu-rately identify &a ..."
Abstract
-
Cited by 102 (9 self)
- Add to MetaCart
laia @ ecn.purdue, edu Effective data prefetching requires accurate mechanisms to predict both "which " cache blocks to prefetch and "when " to prefetch them. This paper proposes the Dead-Block Predictors (DBPs), trace-based predictors that accu-rately identify "when " an L1 data cache block becomes evictable or "dead". Predicting a dead block significantly enhances prefetching lookahead and opportunity, and enables placing data directly into L1, obviating the need for auxiliary prefetch buffers. This paper also proposes Dead-Block Correlating Prefetchers (DBCPs), that use address correlation to predict "which " subsequent block to prefetch when a block becomes evictable. A DBCP enables effective data prefetching in a wide spectrum of pointer-intensive, integer, and floating-point applications. We use cycle-accurate simulation of an out-of-order superscalar processor and memory-intensive benchmarks to show that: (1) dead-block prediction enhances prefetch-ing lookahead at least by an order of magnitude as com-pared to previous techniques, (2) a DBP can predict dead blocks on average with a coverage of 90 % only mispredict-ing 4 % of the time, (3) a DBCP offers an address prediction coverage of 86 % only mispredicting 3 % of the time, and (4) DBCPs improve performance by 62 % on average and 282 % at best in the benchmarks we studied. 1
Impulse: Building a Smarter Memory Controller
- IN PROCEEDINGS OF THE FIFTH ANNUAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE
, 1999
"... Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accesse ..."
Abstract
-
Cited by 99 (21 self)
- Add to MetaCart
(Show Context)
Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accessed and cached, improving their cache and bus utilization. Second, Impulse supports prefetching at the memory controller, which can hide much of the latency of DRAM accesses. In this paper