Results 11 - 20
of
173
Data Access Microarchitectures for Superscalar Processors with Compiler-Assisted Data Prefetching
"... The performance of superscalar processors is more sensitive to the memory system delay than their single-issue predecessors. This paper examines alternative data access microarchitectures that effectively support compilerassisted data prefetching in superscalar processors. In particular, a prefetch ..."
Abstract
-
Cited by 70 (6 self)
- Add to MetaCart
The performance of superscalar processors is more sensitive to the memory system delay than their single-issue predecessors. This paper examines alternative data access microarchitectures that effectively support compilerassisted data prefetching in superscalar processors. In particular, a prefetch buffer is shown to be more effective than increasing the cache dimension in solving the cache pollution problem. All in all, we show that a small data cache with compiler-assisted data prefetching can achieve a performance level close to that of an ideal cache.
Software Support for Speculative Loads
- Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1992
"... This paper describes a simple hardware mechanism and related compiler support for software-controlled speculative loads. The compiler issues speculative load instructions based on anticipated data references and the ability of the memory system to hide memory latency in high-performance processors. ..."
Abstract
-
Cited by 67 (3 self)
- Add to MetaCart
(Show Context)
This paper describes a simple hardware mechanism and related compiler support for software-controlled speculative loads. The compiler issues speculative load instructions based on anticipated data references and the ability of the memory system to hide memory latency in high-performance processors. The architectural support for such a mechanism is simple and minimal, yet handles faults gracefully. We have simulated the speculative load mechanism based on a MIPS processor and a detailed memory system. The results of scientific kernel loops indicate that our speculative load technique is an effective approaches to hiding memory latency. 1 Introduction The performance gap between processors and memory has widened in the last few years. In the last decade, microprocessor speeds have increased at a rate of 50% to 100% each year whereas DRAM speeds have increased at a rate of 10% or less each year [13]. As the performance gap becomes wider, high-performance processors become more sensitive...
Guided Region Prefetching: A Cooperative Hardware/Software Approach
- In Proceedings of the 30th International Symposium on Computer Architecture
, 2003
"... Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but i ..."
Abstract
-
Cited by 65 (9 self)
- Add to MetaCart
(Show Context)
Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but is limited by prefetch instruction overheads and the compiler's limited ability to schedule prefetches sufficiently far in advance to cover level-two cache miss latencies. Hardware prefetching can be effective at hiding these large latencies, but generates many useless prefetches and consumes considerable memory bandwidth. In this paper, we propose a cooperative hardware-software prefetching scheme called Guided Region Prefetching (GRP), which uses compiler-generated hints encoded in load instructions to regulate an aggressive hardware prefetching engine. We compare GRP against a sophisticated pure hardware stride prefetcher and a scheduled region prefetching (SRP) engine. SRP and GRP show the best performance, with respective 22% and 21% gains over no prefetching, but SRP incurs 180% extra memory traffic---nearly tripling bandwidth requirements. GRP achieves performance close to SRP, but with a mere eighth of the extra prefetching traffic, a 23% increase over no prefetching. The GRP hardware-software collaboration thus combines the accuracy of compilerbased program analysis with the performance potential of aggressive hardware prefetching, bringing the performance gap versus a perfect L2 cache under 20%.
SPAID: Software Prefetching in Pointer and Call Intensive Environments. In
- Proc. 28th International Symposium on Microarchitecture,
, 1995
"... ..."
Caching considerations for generational garbage collection: a case for large and set-associative caches
, 1990
"... ..."
(Show Context)
The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors
, 1995
"... The computational power of commodity general-purpose microprocessors is racing to truly amazing levels. As peak levels of performance rise, the building of memory systems that can keep pace becomes increasingly problematic. We claim that in addition to the latency associated with waiting for operand ..."
Abstract
-
Cited by 50 (9 self)
- Add to MetaCart
The computational power of commodity general-purpose microprocessors is racing to truly amazing levels. As peak levels of performance rise, the building of memory systems that can keep pace becomes increasingly problematic. We claim that in addition to the latency associated with waiting for operands, the bandwidth of the memory system, especially that across the chip boundary, will become a progressively greater limit to high performance. After describing the current state of microsolutions aimed at alleviating the memory bottleneck, this paper postulates that dynamic caches themselves use memory inefficiently and will impede attempts to solve the memory problem. We present an analysis of several important algorithms, which shows that increasing levels of integration will not result in computational requirements outstripping off-chip bandwidth needs, thereby preserving the memory bottleneck. We then present results from two sets of simulations, which measured both the efficiency with which current caching techniques use memory (generally less than 20%), and how well (or poorly) caches reduce traffic to main memory (cache sizes up to 2000 times worse than optimal). We then discuss how two classes of techniques, (i) decoupling memory operations from computation, and (ii) explicit compiler management of the memory hierarchy, provide better long-term solutions to lowering a program's memory latencies and bandwidth requirements. Finally, we describe Galileo, a new project that will attempt to provide a long-term solution to the pernicious memory bottleneck.
Toward kilo-instruction processors
- ACM Transactions on Architecture and Code Optimization
, 2004
"... The continuously increasing gap between processor and memory speeds is a serious limitation to the performance achievable by future microprocessors. Currently, processors tolerate long-latency memory operations largely by maintaining a high number of in-flight instructions. In the future, this may r ..."
Abstract
-
Cited by 48 (4 self)
- Add to MetaCart
The continuously increasing gap between processor and memory speeds is a serious limitation to the performance achievable by future microprocessors. Currently, processors tolerate long-latency memory operations largely by maintaining a high number of in-flight instructions. In the future, this may require supporting many hundreds, or even thousands, of in-flight instructions. Unfortunately, the traditional approach of scaling up critical processor structures to provide such support is impractical at these levels, due to area, power, and cycle time constraints. In this paper we show that, in order to overcome this resource-scalability problem, the way in which critical processor resources are managed must be changed. Instead of simply upsizing the processor structures, we propose a smarter use of the available resources, supported by a selective checkpointing mechanism. This mechanism allows instructions to commit out of order, and makes a reorder buffer unnecessary. We present a set of techniques such as multilevel instruction queues, late allocation and early release of registers, and early release of load/store queue entries. All together, these techniques constitute what we call a kilo-instruction processor,anarchitecture that can support thousands of in-flight instructions, and thus may achieve high performance even in the presence of large memory access latencies.
Access Order and Memory-Conscious Cache Utilization
- In Proceedings of the First Annual Symposium on High Performance Computer Architecture
, 1995
"... As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance factor for many applications. Several approaches to bridging this performance gap have been suggested. This paper examines one approach, access ordering, and pushes its limits to det ..."
Abstract
-
Cited by 48 (12 self)
- Add to MetaCart
(Show Context)
As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance factor for many applications. Several approaches to bridging this performance gap have been suggested. This paper examines one approach, access ordering, and pushes its limits to determine bounds on memory performance. We present several access-ordering schemes, and compare their performance, developing analytic models and partially validating these with benchmark timings on the Intel i860XR. 1. Introduction Processor speeds are increasing much faster than memory speeds, thus memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly scientific computations. Proposed solutions range from software prefetching [4, 16, 27] and iteration space tiling [5, 8, 9, 18, 32, 38], to address transformations [12, 13], unusual memory systems [3, 10, 33, 36], and prefetching or non-blocking caches [1, 6, 34]. Here we take one technique, ...
Examination of a Memory Access Classification Scheme for Pointer-Intensive and Numeric Programs
, 1996
"... In recent work, we described a data prefetch mechanism for pointer-intensive and numeric computations, and presented some aggregate measurements on a suite of benchmarks to quantify its performance potential [MH95]. The basis for this device is a simple classification of memory access patterns in pr ..."
Abstract
-
Cited by 46 (0 self)
- Add to MetaCart
In recent work, we described a data prefetch mechanism for pointer-intensive and numeric computations, and presented some aggregate measurements on a suite of benchmarks to quantify its performance potential [MH95]. The basis for this device is a simple classification of memory access patterns in programs that we introduced earlier [HM94]. In this paper we take a close look at two codes from our suite, an English parser called Link-Gram, and the circuit simulation program spice2g6, and present a detailed analysis of them in the context of our model. Focusing on just two programs allows us to display a wider range of data, and discuss relevant code fragments extracted from their source distributions. Results from this study provide a deeper understanding of our memory access classification scheme, and suggest additional optimizations for future data prefetch mechanisms. Keywords: CPU architecture, data cache, memory access pattern classification, instruction profiling, memory latency t...
Toward Scalable Cache Only Memory Architectures
- SWEDISH INSTITUTE OF COMPUTER SCIENCE
, 1993
"... HIGH PERFORMANCE at a low cost is the common goal of most new computers. Even if the speed of microprocessors seems to double every year, there are, and will always be, important applications demanding even better performance. There are twoways of meeting this demand: a single processor designed wit ..."
Abstract
-
Cited by 46 (1 self)
- Add to MetaCart
HIGH PERFORMANCE at a low cost is the common goal of most new computers. Even if the speed of microprocessors seems to double every year, there are, and will always be, important applications demanding even better performance. There are twoways of meeting this demand: a single processor designed with exclusive technology, or several processors in cooperation. The shared-memory model allows the processors to share inputs and results, and is regarded as the most general model for cooperation. Our approach has been to taketoday's microprocessors as a starting point and to add the means for successful cooperation via shared memory. There exists, however, no physical shared memory in our proposal. Instead, all memory is divided among the processors and organized in suchaway that data can be duplicated, moved freely, and allowed to reside in any memory. This behavior of data is not visible to the programmer, who sees the popular shared memory abstraction. It is, however, beneficial to performance and adapts well to different application behaviors. We have introduced a new class of architectures based on the above model comprised of caches and processors connected by a network---Cache-Only Memory Architectures (COMA)---and an implementation proposal thereof---the Data Diffusion Machine (DDM). The large caches are managed by a cache-coherence protocol which makes sure the many copies of a datum have the same value. The protocol will also find a datum that is not in the caches of the requesting processor. COMAs were shown to have superior performance over alternative shared memory architectures in a quantitative analytical performance study. The implementation proposal for the DDM has been simulated with good performance results. Two optimization techniques have also been proposed; ha...