Results 1 -
8 of
8
Data Prefetch Mechanisms
, 2000
"... The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently use ..."
Abstract
-
Cited by 79 (4 self)
- Add to MetaCart
The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently used data, it is still not uncommon for many programs to spend more than half their run times stalled on memory requests. Data prefetching has been proposed as a technique for hiding the access latency of data referencing patterns that defeat caching strategies. Rather than waiting for a cache miss to initiate a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead. Secondary effects such as cache pollution and increased memory bandwidth requirements must also be taken into consideration. Despite these obstacles, prefetching has the potential to significantly improve overall program execution time by overlapping computation with memory accesses. Prefetching
Adapting Cache Line Size to Application Behavior
- In ICS
, 1999
"... A cache line size has a significant effect on miss rate and memory traffic. Today's computers use a fixed line size, typically 32B, which may not be optimal for a given application. Optimal size may also change during application execution. This paper describes a cache in which the line (fetch) size ..."
Abstract
-
Cited by 42 (11 self)
- Add to MetaCart
A cache line size has a significant effect on miss rate and memory traffic. Today's computers use a fixed line size, typically 32B, which may not be optimal for a given application. Optimal size may also change during application execution. This paper describes a cache in which the line (fetch) size is continuously adjusted by hardware based on observed application accesses to the line. The approach can improve the miss rate, even over the optimal for the fixed line size, as well as significantly reduce the memory traffic. 1 Introduction A design of a computer system is an optimization problem involving of a number of dependent variables in a very large design space. The variables include system performance, hardware constraints, cost, and architectural parameters. For general-purpose systems, performance is evaluated with respect to a particular benchmark program or program suite for a given set of design parameters. To simplify the problem, dependences between parameters are freque...
CPU Cache Prefetching: Timing Evaluation of Hardware Implementations
- IEEE Transactions on Computers
, 1998
"... Abstract—Prefetching into CPU caches has long been known to be effective in reducing the cache miss ratio, but known implementations of prefetching have been unsuccessful in improving CPU performance. The reasons for this are that prefetches interfere with normal cache operations by making cache add ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Abstract—Prefetching into CPU caches has long been known to be effective in reducing the cache miss ratio, but known implementations of prefetching have been unsuccessful in improving CPU performance. The reasons for this are that prefetches interfere with normal cache operations by making cache address and data ports busy, the memory bus busy, the memory banks busy, and by not necessarily being complete by the time that the prefetched data is actually referenced. In this paper, we present extensive quantitative results of a detailed cycle-by-cycle trace-driven simulation of a uniprocessor memory system in which we vary most of the relevant parameters in order to determine when and if hardware prefetching is useful. We find that, in order for prefetching to actually improve performance, the address array needs to be double ported and the data array needs to either be double ported or fully buffered. It is also very helpful for the bus to be very wide (e.g., 16 bytes) for bus transactions to be split and for main memory to be interleaved. Under the best circumstances, i.e., with a significant investment in extra hardware, prefetching can significantly improve performance. For implementations without adequate hardware, prefetching often decreases performance.
Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions
- ACM Euro-Par Conference
, 2002
"... Abstract. As the degree of instruction-level parallelism in superscalar architectures increases, the gap between processor and memory performance continues to grow requiring more aggressive techniques to increase the performance of the memory system. We propose a new technique, which is based on the ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Abstract. As the degree of instruction-level parallelism in superscalar architectures increases, the gap between processor and memory performance continues to grow requiring more aggressive techniques to increase the performance of the memory system. We propose a new technique, which is based on the wrong-path execution of loads far beyond instruction fetch-limiting conditional branches, to exploit more instruction-level parallelism by reducing the impact of memory delays. We examine the effects of the execution of loads down the wrong branch path on the performance of an aggressive issue processor. We find that, by continuing to execute the loads issued in the mispredicted path, even after the branch is resolved, we can actually reduce the cache misses observed on the correctly executed path. This wrong-path execution of loads can result in a speedup of up to 5 % due to an indirect prefetching effect that brings data or instruction blocks into the cache for instructions subsequently issued on the correctly predicted path. However, it also can increase the amount of memory traffic and can pollute the cache. We propose the Wrong Path Cache (WPC) to eliminate the cache pollution caused by the execution of loads down mispredicted branch paths. For the configurations tested, fetching the results of wrong path loads into a fully associative 8-entry WPC can result in a 12 % to 39 % reduction in L1 data cache misses and in a speedup of up to 37%, with an average speedup of 9%, over the baseline processor. 1
The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture
- IEEE Trans. Parallel Distributed Systems
, 2005
"... Abstract—The speculated execution of threads in a multithreaded architecture, plus the branch prediction used in each thread execution unit, allows many instructions to be executed speculatively, that is, before it is known whether they actually will be needed by the program. In this study, we exami ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract—The speculated execution of threads in a multithreaded architecture, plus the branch prediction used in each thread execution unit, allows many instructions to be executed speculatively, that is, before it is known whether they actually will be needed by the program. In this study, we examine how the load instructions executed on what turn out to be incorrectly executed program paths impact the memory system performance. We find that incorrect speculation (wrong execution) on the instruction and thread-level provides an indirect prefetching effect for the later correct execution paths and threads. By continuing to execute the mispredicted load instructions even after the instruction or thread-level control speculation is known to be incorrect, the cache misses observed on the correctly executed paths can be reduced by 16 to 73 percent, with an average reduction of 45 percent. However, we also find that these extra loads can increase the amount of memory traffic and can pollute the cache. We introduce the small, fully associative Wrong Execution Cache (WEC) to eliminate the potential pollution that can be caused by the execution of the mispredicted load instructions. Our simulation results show that the WEC can improve the performance of a concurrent multithreaded architecture up to 18.5 percent on the benchmark programs tested, with an average improvement of 9.7 percent, due to the reductions in the number of cache misses. Index Terms—Speculation, multithreaded architecture, mispredicted loads, wrong execution, prefetching, wrong execution cache. 1
Adaptive Line Size Cache
, 1999
"... Current cache design supports only fixed line size. In this report, we propose an adaptive line size cache where line size can dynamically change based on a locality prediction mechanism. Overall better utilization of spatial and temporal localities is achieved by this novel cache design. Experiment ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Current cache design supports only fixed line size. In this report, we propose an adaptive line size cache where line size can dynamically change based on a locality prediction mechanism. Overall better utilization of spatial and temporal localities is achieved by this novel cache design. Experiments on SPEC95 benchmarks show that this cache design can significantly reduce cache miss rates. 2-level adaptive line size cache can achieve average 1.7 speedup over tradition 2-level cache with L1 line size 32B and L2 line size 64B. And 128k direct-mapped L2 cache outperforms 512K 2-way set associative cache.
Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors
"... Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for uniprocessors. With the growing dominance of chip ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for uniprocessors. With the growing dominance of chip multiprocessors (CMPs), it is necessary to examine TLB performance in the context of parallel workloads. This work is the first to present TLB prefetchers that exploit commonality in TLB miss patterns across cores in CMPs. We propose and evaluate two Inter-Core Cooperative (ICC) TLB prefetching mechanisms, assessing their effectiveness at eliminating TLB misses both individually and together. Our results show these approaches require at most modest hardware and can collectively eliminate 19 % to 90 % of data TLB (D-TLB) misses across the surveyed parallel workloads. We also compare performance improvements across a range of hardware and software implementation possibilities. We find that while a fully-hardware implementation results in average performance improvements of 8-46 % for a range of TLB sizes, a hardware/software approach yields improvements of 4-32%. Overall, our work shows that TLB prefetchers exploiting inter-core correlations can effectively eliminate TLB misses.
Improving Parallel I/O Performance with Data Layout Awareness ♯
"... Parallel applications can benefit greatly from massive computational capability, but their performance suffers from large latency of I/O accesses. The poor I/O performance has been attributed as a critical cause of the low sustained performance of parallel computing systems. In this study, we propos ..."
Abstract
- Add to MetaCart
Parallel applications can benefit greatly from massive computational capability, but their performance suffers from large latency of I/O accesses. The poor I/O performance has been attributed as a critical cause of the low sustained performance of parallel computing systems. In this study, we propose a data layout-aware optimization strategy to promote a better integration of the parallel I/O middleware and parallel file systems, two major components of the current parallel I/O systems, and to improve the data access performance. We explore the layout-aware optimization in both independent I/O and collective I/O, two primary forms of I/O in parallel applications. We illustrate that the layout-aware I/O optimization could improve the performance of current parallel I/O strategy effectively. The experimental results verify that the proposed strategy could improve parallel I/O performance by nearly 40 % on average. The proposed layout-aware parallel I/O has a promising potential in improving the I/O performance of parallel systems. Keywords: parallel I/O, parallel file systems, parallel I/O middleware, collective I/O, independent I/O, data layout, I/O performance, data access optimization

