Results 1 - 10
of
67
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers
, 1990
"... Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper prese ..."
Abstract
-
Cited by 747 (4 self)
- Add to MetaCart
Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches. Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches. Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching. St...
A Study of Integrated Prefetching and Caching Strategies
- In Proceedings of the ACM SIGMETRICS
, 1995
"... Prefetching and caching are effective techniques for improving the performance of file systems, but they have not been studied in an integrated fashion. This paper proposes four properties that optimal integrated strategies for prefetching and caching must satisfy, and then presents and studies two ..."
Abstract
-
Cited by 168 (9 self)
- Add to MetaCart
Prefetching and caching are effective techniques for improving the performance of file systems, but they have not been studied in an integrated fashion. This paper proposes four properties that optimal integrated strategies for prefetching and caching must satisfy, and then presents and studies two such integrated strategies, called aggressive and conservative. We prove that the performance of the conservative approach is within a factor of two of optimal and that the performance of the aggressive strategy is a factor significantly less than twice that of the optimal case. We have evaluated these two approaches by trace-driven simulation with a collection of file access traces. Our results show that the two integrated prefetching and caching strategies are indeed close to optimal and that these strategies can reduce the running time of applications by up to 50%.
Implementation and Performance of Integrated Application-Controlled Caching, Prefetching and Disk Scheduling
, 1996
"... Although file caching and prefetching are known techniques to improve the performance of file systems, little work has been done on intergrating caching and prefetching. Optimal prefetching is nontrivial because prefetching may require early cache block replacements. Moreover, the tradeoff between t ..."
Abstract
-
Cited by 100 (8 self)
- Add to MetaCart
Although file caching and prefetching are known techniques to improve the performance of file systems, little work has been done on intergrating caching and prefetching. Optimal prefetching is nontrivial because prefetching may require early cache block replacements. Moreover, the tradeoff between the latency-hiding benefits of prefetching and the increase in the number of fetches required must be considered. This paper presents the design and implementation of a file system that integrates application-controlled caching, prefetching and disk scheduling. We use a two-level cache management strategy. The kernel uses the LRU-SP policy [CFL94a] to allocate blocks to processes, and each process uses the controlledaggressive policy, an algorithm previously shown in a theoretical sense to be near-optimal, for managing its cache. Each process then improves its disk access latency by submitting its prefetches in batches and schedules the requests in each batch to optimize disk access performa...
Data Prefetch Mechanisms
, 2000
"... The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently use ..."
Abstract
-
Cited by 79 (4 self)
- Add to MetaCart
The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently used data, it is still not uncommon for many programs to spend more than half their run times stalled on memory requests. Data prefetching has been proposed as a technique for hiding the access latency of data referencing patterns that defeat caching strategies. Rather than waiting for a cache miss to initiate a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead. Secondary effects such as cache pollution and increased memory bandwidth requirements must also be taken into consideration. Despite these obstacles, prefetching has the potential to significantly improve overall program execution time by overlapping computation with memory accesses. Prefetching
Adaptive Page Replacement Based on Memory Reference Behavior
, 1997
"... As disk performance continues to lag behind that of memory systems and processors, virtual memory management becomes increasingly important for overall system performance. In this paper we study the page reference behavior of a collection of memory-intensive applications, and propose a new virtual m ..."
Abstract
-
Cited by 66 (1 self)
- Add to MetaCart
As disk performance continues to lag behind that of memory systems and processors, virtual memory management becomes increasingly important for overall system performance. In this paper we study the page reference behavior of a collection of memory-intensive applications, and propose a new virtual memory page replacement algorithm, SEQ. SEQ detects long sequences of page faults and applies most-recently-used replacement to those sequences. Simulations show that for a large class of applications, SEQ performs close to the optimal replacement algorithm, and significantly better than Least-Recently-Used (LRU). In addition, SEQ performs similarly to LRU for applications that do not exhibit sequential faulting.
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
- IEEE Transactions on Parallel and Distributed Systems
, 1995
"... Abstract-To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software-and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequ ..."
Abstract
-
Cited by 63 (6 self)
- Add to MetaCart
Abstract-To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software-and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of prefetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25 % respectively. Index Terms-Hardware-controlled prefetching, latency tolerance, memory consistency models, performance evaluation, sequential prefetching, shared-memory multiprocessors. I.
Instruction Fetching: Coping with Code Bloat
- In Proceedings of the 22nd Annual International Symposium on Computer Architecture
, 1995
"... Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher instruction-cache miss ratios than do the SPEC benchmar ..."
Abstract
-
Cited by 62 (9 self)
- Add to MetaCart
Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher instruction-cache miss ratios than do the SPEC benchmarks. To represent these trends, we have assembled a collection of applications, called the Instruction Benchmark Suite (IBS), that provides a better test of instruction-cache performance. We discuss the rationale behind the design of IBS and characterize its behavior relative to the SPEC benchmark suite. Our analysis is based on trace-driven and trap-driven simulations and takes into full account both the application and operating-system components of the workloads. This paper then reexamines a collection of previously-proposed hardware mechanisms for improving instruction-fetch performance
Disk cache-miss ratio analysis and design considerations
- ACM Transactions on Computer Systems
, 1985
"... power and toward disk drives of rapidly increasing density, but with disk performance increasing very slowly if at all. The implication of these trends is that at some point the processing power of computer systems will be limited by the throughput of the input/output (I/O) system. A solution to thi ..."
Abstract
-
Cited by 57 (5 self)
- Add to MetaCart
power and toward disk drives of rapidly increasing density, but with disk performance increasing very slowly if at all. The implication of these trends is that at some point the processing power of computer systems will be limited by the throughput of the input/output (I/O) system. A solution to this problem, which is described and evaluated in this paper, is disk cache. The idea is to buffer recently used portions of the disk address space in electronic storage. Empirically, it is shown that a large (e.g., 80-90 percent) fraction of all I/O requests are captured by a cache of an 8-Mbyte order-of-magnitude size for our workload sample. This paper considers a number of design parameters for such a cache (called cache disk or disk cache), including those that can be examined experimentally (cache location, cache size, migration algorithms, block sizes, etc.) and others (access time, bandwidth, multipathing, technology, consistency, error recovery, etc.) for which we have no relevant data or experiments. Consideration is given to both caches located in the I/O system, as with the storage controller, and those located in the CPU main memory. Experimental results are based on extensive trace-driven simulations using traces taken from three large IBM or IBMcompatible mainframe data processing installations. We find that disk cache is a powerful means of extending the performance limits of high-end computer systems. Categories and Subject Descriptors: B.3 [Hardware]: Memory Structures-design styles; performance analysis and design aids; B.4 [Hardware]: Input/Output and Data Communications-input
Practical Prefetching Techniques for Parallel File Systems
- In Proceedings of the First International Conference on Parallel and Distributed Information Systems
, 1991
"... Improvements in the processing speed of multiprocessors are outpacing improvements in the speed of disk hardware. Parallel disk I/O subsystems have been proposed as one way to close the gap between processor and disk speeds. In a previous paper we showed that prefetching and caching have the potenti ..."
Abstract
-
Cited by 52 (2 self)
- Add to MetaCart
Improvements in the processing speed of multiprocessors are outpacing improvements in the speed of disk hardware. Parallel disk I/O subsystems have been proposed as one way to close the gap between processor and disk speeds. In a previous paper we showed that prefetching and caching have the potential to deliver the performance bene ts of parallel le systems to parallel applications. In this paper we describe experiments with practical prefetching policies, and show that prefetching can be implemented e ciently even for the more complex parallel le access patterns. We also test the ability of these policies across a range of architectural parameters. 1
Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors
- IEEE Transactions on Parallel and Distributed Systems
"... . We study the efficiency of previously proposed stride and sequential prefetching---two promising hardware-based prefetching schemes to reduce readmiss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of six of the applications we study, we find that sequ ..."
Abstract
-
Cited by 48 (2 self)
- Add to MetaCart
. We study the efficiency of previously proposed stride and sequential prefetching---two promising hardware-based prefetching schemes to reduce readmiss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of six of the applications we study, we find that sequential prefetching does as well as and in same cases even better than stride prefetching for five applications. This is because (i) most strides are shorter than the block size (we assume 32 byte blocks), which means that sequential prefetching is as effective for these stride accesses, and (ii) sequential prefetching also exploits the locality of read misses with non-stride accesses. However, since stride prefetching in general results in fewer useless prefetches, it offers the extra advantage of consuming less memory-system bandwidth. Corresponding author: Fredrik Dahlgren Keywords: Hardware-Controlled Prefetching, Latency Tolerance, Performance Evaluation, Relaxed Memory Consiste...

