Results 1 - 10
of
109
Dynamic hot data stream prefetching for general-purpose programs
- InACM SIGPLANConference on Programming Language Designand Implementation
, 2002
"... Prefetching data ahead of use has the potential to tolerate the growing processor-memory performance gap by overlapping long latency memory accesses with useful computation. While sophisticated prefetching techniques have been automated for limited domains, such as scientific codes that access dense ..."
Abstract
-
Cited by 116 (2 self)
- Add to MetaCart
(Show Context)
Prefetching data ahead of use has the potential to tolerate the growing processor-memory performance gap by overlapping long latency memory accesses with useful computation. While sophisticated prefetching techniques have been automated for limited domains, such as scientific codes that access dense arrays in loop nests, a similar level of success has eluded general-purpose programs, especially pointer-chasing codes written in languages such as C and C++. We address this problem by describing, implementing and evaluating a dynamic prefetching scheme. Our technique runs on stock hardware, is completely automatic, and works for generalpurpose programs, including pointer-chasing codes written in weakly-typed languages, such as C and C++. It operates in three phases. First, the profiling phase gathers a temporal data reference profile from a running program with low-overhead. Next, the profiling is turned off and a fast analysis algorithm extracts hot data streams, which are data reference sequences that frequently repeat in the same order, from the temporal profile. Then, the system dynamically injects code at appropriate program points to detect and prefetch these hot data streams. Finally, the process enters the hibernation phase where no profiling or analysis is performed, and the program continues to execute with the added prefetch instructions. At the end of the hibernation phase, the program is deoptimized to remove the inserted checks and prefetch instructions, and control returns to the profiling phase. For long-running programs, this profile, analyze and optimize, hibernate, cycle will repeat multiple times. Our initial results from applying dynamic prefetching are promising, indicating overall execution time improvements of 5–19 % for several memory-performance-limited SPECint2000 benchmarks running their largest (ref) inputs.
An overview of cache optimization techniques and cache-aware numerical algorithms
- In Proceedings of the GI-Dagstuhl Forschungseminar: Algorithms for Memory Hierarchies, volume 2625 of (LNCS
, 2003
"... In order to mitigate the impact of the growing gap between CPU speed and main memory performance, today's computer architectures implement hierar-chical memory structures. The idea behind this approach is to hide both the low main memory bandwidth and the latency of main memory accesses which i ..."
Abstract
-
Cited by 36 (4 self)
- Add to MetaCart
(Show Context)
In order to mitigate the impact of the growing gap between CPU speed and main memory performance, today's computer architectures implement hierar-chical memory structures. The idea behind this approach is to hide both the low main memory bandwidth and the latency of main memory accesses which is
Generating Cache Hints for Improved Program Efficiency
- JOURNAL OF SYSTEMS ARCHITECTURE
, 2004
"... One of the new extensions in EPIC architectures are cache hints. On each memory instruction, two kinds of hints can be attached: a source cache hint and a target cache hint. The source hint indicates the true latency of the instruction, which is used by the compiler to improve the instruction schedu ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
One of the new extensions in EPIC architectures are cache hints. On each memory instruction, two kinds of hints can be attached: a source cache hint and a target cache hint. The source hint indicates the true latency of the instruction, which is used by the compiler to improve the instruction schedule. The target hint indicates at which cache levels it is profitable to retain data, allowing to improve cache replacement decisions at run time. A compile-time method is presented which calculates appropriate cache hints. Both kind of hints are based on the locality of the instruction, measured by the reuse distance metric. Two
Conjunctive selection conditions in main memory
- In Proc. PODS
, 2002
"... We consider the fundamental operation of applying a compound filtering condition to a set of records. With large main memories available cheaply, systems may choose to keep the data entirely in main memory, in order to improve query and/or update performance. The design of a data-intensive algorithm ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
We consider the fundamental operation of applying a compound filtering condition to a set of records. With large main memories available cheaply, systems may choose to keep the data entirely in main memory, in order to improve query and/or update performance. The design of a data-intensive algorithm in main memory needs to take into account the architectural characteristics of modern processors, just as a disk-based method needs to consider the physical characteristics of disk devices. An important architectural feature that influences the performance of main memory algorithms is the branch misprediction penalty. We demonstrate that branch misprediction has a substantial impact on the performance of an algorithm for applying selection conditions. We describe a space of “query plans ” that are logically equivalent, but differ in terms of performance due to variations in their branch prediction behavior. We propose a cost model that takes branch prediction into account, and develop a query optimization algorithm that chooses a plan with optimal estimated cost for conjunctive conditions. We also develop an efficient heuristic optimization algorithm. We also show how records can be ordered to further reduce branch misprediction effects.
Sarc: Sequential prefetching in adaptive replacement cache
- In Proc. of USENIX 2005 Annual Technical Conference (2005
, 2005
"... Abstract — Sequentiality of reference is an ubiquitous access pattern dating back at least to Multics. Sequential workloads lend themselves to highly accurate prediction and prefetching. In spite of the simplicity of the workload, design and analysis of a good sequential prefetching algorithm and as ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
(Show Context)
Abstract — Sequentiality of reference is an ubiquitous access pattern dating back at least to Multics. Sequential workloads lend themselves to highly accurate prediction and prefetching. In spite of the simplicity of the workload, design and analysis of a good sequential prefetching algorithm and associated cache replacement policy turns out to be surprisingly intricate. As first contribution, we uncover and remedy an anomaly (akin to famous Belady’s anomaly) that plagues sequential prefetching when integrated with caching. Typical workloads contain a mix of sequential and random streams. As second contribution, we design a self-tuning, low overhead, simple to implement, locally adaptive, novel cache management policy SARC that dynamically and adaptively partitions the cache space amongst sequential and random streams so as to reduce the read misses. As third contribution, we implemented SARC along with two popular state-of-theart LRU variants on hardware for IBM’s flagship storage controller Shark. On Shark hardware with 8 GB cache and 16 RAID-5 arrays that is serving a workload akin to Storage Performance Council’s widely adopted SPC-1 benchmark, SARC consistently and dramatically outperforms the two LRU variants shifting the throughput-response time curve to the right and thus fundamentally increasing the capacity of the system. As anecdotal evidence, at the peak throughput, SARC has average response time of 5.18ms as compared to 33.35ms and 8.92ms for the two LRU variants. I.
Iterative Compilation and Performance Prediction for Numerical Applications
, 2004
"... As the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compiler-based approaches to improving memory performance are limited and programmers fr ..."
Abstract
-
Cited by 20 (13 self)
- Add to MetaCart
(Show Context)
As the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compiler-based approaches to improving memory performance are limited and programmers frequently resort to manual optimisation techniques. However, this process is tedious and time-consuming. Furthermore, a diverse range of a rapidly evolving hardware makes the optimisation process even more complex. It is often hard to predict the potential benefits from different optimisations and there are no simple criteria to stop optimisations i.e. when optimal memory performance has been achieved or sufficiently approached. This thesis presents a platform independent optimisation approach for numerical applications based on iterative feedback-directed program restructuring using a new reasonably fast and accurate performance prediction technique for guiding optimisations. New strategies for searching the optimisation space, by means of
Fine-grain Priority Scheduling on Multi-channel Memory Systems
- in 8th International Symposium on High Performance Computer Architecture
, 2002
"... Configurations of contemporary DRAM memory systems become increasingly complex. A recent study [5] shows that application performance is highly sensitive to choices of configurations, and suggests that tuning burst sizes and channel configurations be an effective way to optimize the DRAM performance ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
(Show Context)
Configurations of contemporary DRAM memory systems become increasingly complex. A recent study [5] shows that application performance is highly sensitive to choices of configurations, and suggests that tuning burst sizes and channel configurations be an effective way to optimize the DRAM performance for a given memory-intensive workload. However, this approach is workload dependent. In this study we show that, by utilizing fine-grain priority access scheduling, we are able to find a workload independent configuration that achieves optimal performance on a multichannel memory system. Our approach can well utilize the available high concurrency and high bandwidth on such memory systems, and e#ectively reduce the memory stall time of memory-intensive applications. Conducting execution-driven simulation of a 4-way issue, 2 GHz processor, we show that the average performance improvement for fifteen memory-intensive SPEC2000 programs by using an optimized fine-grain priority scheduling is about 13% and 8% for a 2-channel and a 4-channel Direct Rambus DRAM memory systems, respectively, compared with gang scheduling. Compared with burst scheduling, the average performance improvement is 16% and 14% for the 2-channel and 4-channel memory systems, respectively.
Static identification of delinquent loads
- In CGO ’04: Proceedings of the international symposium on Code generation and optimization
, 2004
"... The effective use of processor caches is crucial to the performance of applications. It has been shown that cache misses are not evenly distributed throughout a program. In applications running on RISC-style processors, a small number of delinquent load instructions are responsible for most of the c ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
(Show Context)
The effective use of processor caches is crucial to the performance of applications. It has been shown that cache misses are not evenly distributed throughout a program. In applications running on RISC-style processors, a small number of delinquent load instructions are responsible for most of the cache misses. Identification of delinquent loads is the key to the success of many cache optimization and prefetching techniques. In this paper, we propose a method for identifying delinquent loads that can be implemented at compile time. Our experiments over eighteen benchmarks from the SPEC suite shows that our proposed scheme is stable across benchmarks, inputs, and cache structures, identifying an average of 10 % of the total number of loads in the benchmarks we tested that account for over 90 % of all data cache misses. As far as we know, this is the first time a technique for static delinquent load identification with such a level of precision and coverage has been reported. While comparable techniques can also identify load instructions that cover 90 % of all data cache misses, they do so by selecting over 50 % of all load instructions in the code, resulting in a high number of false positives. If basic block profiling is used in conjunction with our heuristic, then our results show that it is possible to pin down just 1.3 % of the load instructions that account for 82 % of all data cache misses. 1.
Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions
- ACM Euro-Par Conference
, 2002
"... Abstract. As the degree of instruction-level parallelism in superscalar architectures increases, the gap between processor and memory performance continues to grow requiring more aggressive techniques to increase the performance of the memory system. We propose a new technique, which is based on the ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
Abstract. As the degree of instruction-level parallelism in superscalar architectures increases, the gap between processor and memory performance continues to grow requiring more aggressive techniques to increase the performance of the memory system. We propose a new technique, which is based on the wrong-path execution of loads far beyond instruction fetch-limiting conditional branches, to exploit more instruction-level parallelism by reducing the impact of memory delays. We examine the effects of the execution of loads down the wrong branch path on the performance of an aggressive issue processor. We find that, by continuing to execute the loads issued in the mispredicted path, even after the branch is resolved, we can actually reduce the cache misses observed on the correctly executed path. This wrong-path execution of loads can result in a speedup of up to 5 % due to an indirect prefetching effect that brings data or instruction blocks into the cache for instructions subsequently issued on the correctly predicted path. However, it also can increase the amount of memory traffic and can pollute the cache. We propose the Wrong Path Cache (WPC) to eliminate the cache pollution caused by the execution of loads down mispredicted branch paths. For the configurations tested, fetching the results of wrong path loads into a fully associative 8-entry WPC can result in a 12 % to 39 % reduction in L1 data cache misses and in a speedup of up to 37%, with an average speedup of 9%, over the baseline processor. 1
1 C-Pack: A High-Performance Microprocessor Cache Compression Algorithm
"... Abstract—Microprocessor designers have been torn between tight constraints on the amount of on-chip cache memory and the high latency of off-chip memory, such as dynamic random access memory. Accessing off-chip memory generally takes an order of magnitude more time than accessing on-chip cache, and ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Microprocessor designers have been torn between tight constraints on the amount of on-chip cache memory and the high latency of off-chip memory, such as dynamic random access memory. Accessing off-chip memory generally takes an order of magnitude more time than accessing on-chip cache, and two orders of magnitude more time than executing an instruction. Computer systems and microarchitecture researchers have proposed using hardware data compression units within the memory hierarchies of microprocessors in order to improve performance, energy efficiency, and functionality. However, most past work, and all work on cache compression, has made unsubstantiated assumptions about the performance, power consumption, and area overheads of the proposed compression algorithms and hardware. It is not possible to determine whether compression at levels of the memory hierarchy closest to the processor is beneficial without understanding its costs. Furthermore, as we show in this paper, raw compression ratio is not always the most important metric. In this work, we present a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular. The algorithm has a number of novel features tailored for this application, including combining pairs of compressed lines into one cache line and allowing parallel compression of multiple words while using a single dictionary and without degradation in compression ratio. We reduced the proposed algorithm to a register transfer level hardware implementation, permitting performance, power consumption, and area estimation. Experiments comparing our work to previous work are described. Index Terms–Cache compression, effective system-wide compression ratio, pair matching, parallel compression, hardware implementation I.