• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Tolerating Latency Through Software-Controlled Data Prefetching. (1993)

by Todd C Mowry
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 302
Next 10 →

Design and Evaluation of a Compiler Algorithm for Prefetching

by Todd C. Mowry, Monica S. Lam, Anoop Gupta - in Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems , 1992
"... Software-controlled data prefetching is a promising technique for improving the performance of the memory subsystem to match today's high-performance processors. While prefetching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the ..."
Abstract - Cited by 501 (20 self) - Add to MetaCart
Software-controlled data prefetching is a promising technique for improving the performance of the memory subsystem to match today's high-performance processors. While prefetching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the memory subsystem. As a result, care must be taken to ensure that such overheads do not exceed the benefits.
(Show Context)

Citation Context

...e and software approaches to improve the memory performance have been proposed recently[15]. A promising technique to mitigate the impact of long cache miss penalties is softwarecontrolled prefetching=-=[5, 13, 16, 22, 23]. Software-=--controlled prefetching requires support from both hardware and software. The processor must provide a special "prefetch" instruction. The software uses this instruction to inform the hardwa...

The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers

by Steven K. Reinhardt, Mark D. Hill, James R. Larus, Alvin R. Lebeck, James C. Lewis, David A. Wood - In Proceedings of the 1993 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems , 1993
"... We have developed a new technique for evaluating cache coherent, shared-memory computers. The Wisconsin Wind Tunnel (WWT) runs a parallel sharedmemory program on a parallel computer (CM-5) and uses execution-driven, distributed, discrete-event simulation to accurately calculate program execution tim ..."
Abstract - Cited by 226 (29 self) - Add to MetaCart
We have developed a new technique for evaluating cache coherent, shared-memory computers. The Wisconsin Wind Tunnel (WWT) runs a parallel sharedmemory program on a parallel computer (CM-5) and uses execution-driven, distributed, discrete-event simulation to accurately calculate program execution time. WWT is a virtual prototype that exploits similarities between the system under design (the target) and an existing evaluation platform (the host). The host directly executes all target program instructions and memory references that hit in the target cache. WWT's shared memory uses the CM-5 memory 's error-correcting code (ECC) as valid bits for a fine-grained extension of shared virtual memory. Only memory references that miss in the target cache trap to WWT, which simulates a cache-coherence protocol. WWT correctly interleaves target machine events and calculates target program execution time. WWT runs on parallel computers with greater speed and memory capacity than uniprocessors. WWT'...
(Show Context)

Citation Context

...use we did not have MIPS math library sources. not demonstrate that either system correctly modeled a real computer. We compared WWT against Dixie, a Tango-based simulation of the DASH multiprocessor =-=[26]-=-. Tango is a direct-execution simulation system that runs on a uniprocessor MIPS-based workstation. Tango instruments a target program's assembly code with additional code that computes instruction ex...

Effective Hardware-Based Data Prefetching for High-Performance Processors.

by T E Chen, J L Baer - IEEE Transactions on Computers, , 1995
"... ..."
Abstract - Cited by 220 (2 self) - Add to MetaCart
Abstract not found

Compiler-Based Prefetching for Recursive Data Structures

by Chi-Keung Luk, Todd C. Mowry - In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems , 1996
"... Software-controlled data prefetching offers the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. While prefetching has enjoyed considerable success in array-based numeric codes, its potential in pointer-based applications ..."
Abstract - Cited by 203 (14 self) - Add to MetaCart
Software-controlled data prefetching offers the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. While prefetching has enjoyed considerable success in array-based numeric codes, its potential in pointer-based applications has remained largely unexplored. This paper investigates compilerbased prefetching for pointer-based applications---in particular, those containing recursive data structures. We identify the fundamental problem in prefetching pointer-based data structures and propose a guideline for devising successful prefetching schemes. Based on this guideline, we design three prefetching schemes, we automate the most widely applicable scheme (greedy prefetching) in an optimizing research compiler, and we evaluate the performance of all three schemes on a modern superscalar processor similar to the MIPS R10000. Our results demonstrate that compiler-inserted prefetching can significantly improve the execution...

Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors

by Chi-Keung Luk - In Proceedings of the 28th Annual International Symposium on Computer Architecture , 2001
"... Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive appr ..."
Abstract - Cited by 174 (0 self) - Add to MetaCart
Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive approach is to use idle threads on these machines to perform pre-execution---essentially a combined act of speculative address generation and prefetching--- to accelerate the main thread. In this paper, we propose such a pre-execution technique for simultaneous multithreading (SMT) processors. By using software to control pre-execution, we are able to handle some of the most important access patterns that are typically difficult to prefetch. Compared with existing work on pre-execution, our technique is significantly simpler to implement (e.g., no integration of pre-execution results, no need of shortening programs for pre-execution, and no need of special hardware to copy register values upon thread spawns). Consequently, only minimal extensions to SMT machines are required to support our technique. Despite its simplicity, our technique offers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching.
(Show Context)

Citation Context

...fers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching. 1. Introduction Multithreading [1, 32] and prefetching =-=[8, 20, 22]-=- are two major techniques for tolerating ever-increasing memory latency. Multithreading tolerates latency by executing instructions from another concurrent thread when the running thread encounters a ...

Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors

by Kourosh Gharachorloo, et al.
"... The memory consistency model supported by a multiprocessor architecture determines the amount of buffering and pipelining that may be used to hide or reduce the latency of memory accesses. Several different consistency models have been proposed. These range from sequential consistency on one end, al ..."
Abstract - Cited by 171 (10 self) - Add to MetaCart
The memory consistency model supported by a multiprocessor architecture determines the amount of buffering and pipelining that may be used to hide or reduce the latency of memory accesses. Several different consistency models have been proposed. These range from sequential consistency on one end, allowing very limited buffering, to release consistency on the other end, allowing extensive buffering and pipelining. The processor consistency and weak consistency models fall in between. The advantage of the less strict models is increased performance potential. The disadvantage is increased hardware complexity and a more complex programming model. To make an informed decision on the above tradeoff requires performance data for the various models. This paper addresses the issue of performance benefits from the above four consistency models. Our results are based on simulation studies done for three applications. The results show that in an environment where processor reads are blocking and writes are buffered, a significant performance increase is achieved from allowing reads to bypass previous writes. Pipelining of writes, which determines the rate at which writes are retired from the write buffer, is of secondary importance. As a result, we show that the sequential consistency model performs poorly relative to all other models, while the processor consistency model provides most of the benefits of the weak and release consistency models.
(Show Context)

Citation Context

...sistency models. Indeed, the prefetch mechanism provided in DASH is non-binding, and we will assume such a prefetch mechanism in this section. For a detailed evaluation of nonbinding prefetching, see =-=[14]-=-. The results in this section show that the pipelining of writes allowed by RC (and WC) becomes significantly more important in achieving high performance once the latency of reads is reduced through ...

Two Techniques to Enhance the Performance of Memory Consistency Models

by Kourosh Gharachorloo, Anoop Gupta, John Hennessy - In Proceedings of the 1991 International Conference on Parallel Processing , 1991
"... The memory consistency model supported by a multiprocessor directly affects its performance. Thus, several attempts have been made to relax the consistency models to allow for more buffering and pipelining of memory accesses. Unfortunately, the potential increase in performance afforded by relaxing ..."
Abstract - Cited by 167 (6 self) - Add to MetaCart
The memory consistency model supported by a multiprocessor directly affects its performance. Thus, several attempts have been made to relax the consistency models to allow for more buffering and pipelining of memory accesses. Unfortunately, the potential increase in performance afforded by relaxing the consistency model is accompanied by a more complex programming model. This paper introduces two general implementation techniques that provide higher performance for all the models. The first technique involves prefetching values for accesses that are delayed due to consistency model constraints. The second technique employs speculative execution to allow the processor to proceed even though the consistency model requires the memory accesses to be delayed. When combined, the above techniques alleviate the limitations imposed by a consistency model on buffering and pipelining of memory accesses, thus significantly reducing the impact of the memory consistency model on performance. 1 Intro...
(Show Context)

Citation Context

... is allowed to be issued. Non-binding prefetching is possible if hardware cachecoherence is provided. Software-controlled non-binding prefetching has been studied by Porterfield [20], Mowry and Gupta =-=[19]-=-, and Gharachorloo et al. [7]. Porterfield studied the technique in the context of uniprocessors while the work by Mowry and Gharachorloo studies the effect of prefetching for multiprocessors. In [7],...

Reducing Memory Latency via Non-blocking and Prefetching Caches

by Tien-Fu Chen, Tien-fu Chen, Jean-loup Baer, Jean-loup Baer , 1992
"... Non-blocking caches and prefetching caches are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A non-blocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploit ..."
Abstract - Cited by 164 (2 self) - Add to MetaCart
Non-blocking caches and prefetching caches are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A non-blocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations. A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with pre-miss computations. In this paper, we evaluate the effectiveness of these two hardware-based schemes. We propose a hybrid design based on the combination of these approaches. We also consider compiler-based optimizations to enhance the effectiveness of non-blocking caches. Results from instruction level simulations on the SPEC benchmarks show that the hardware prefetching caches generally outperform non-blocking caches. Also, the relative effectiveness of non-blocking caches is more adversely affected by an increase in memory latency...
(Show Context)

Citation Context

...h one (or more) cache misses until an instruction that actually needs a value to be returned is reached. Such caches exploit the overlap of memory access time with post-miss computations. Prefetching =-=[1, 8, 13, 17, 18]-=- hardware and/or software techniques can eliminate the cache miss penalty by generating prefetch requests to the memory system to bring the data into the cache before the data value is actually used. ...

Automatic Compiler-Inserted I/O Prefetching for Out-of-Core Applications

by Todd C. Mowry, Angela K. Demke, Orran Krieger , 1996
"... Current operating systems offer poor performance when a numeric application's working set does not fit in main memory. As a result, programmers who wish to solve "out-of-core" problems efficiently are typically faced with the onerous task of rewriting an application to use explicit I/ ..."
Abstract - Cited by 162 (6 self) - Add to MetaCart
Current operating systems offer poor performance when a numeric application's working set does not fit in main memory. As a result, programmers who wish to solve "out-of-core" problems efficiently are typically faced with the onerous task of rewriting an application to use explicit I/O operations (e.g., read/write). In this paper, we propose and evaluate a fully-automatic technique which liberates the programmer from this task, provides high performance, and requires only minimal changes to current operating systems. In our scheme, the compiler provides the crucial information on future access patterns without burdening the programmer, the operating system supports non-binding prefetch and re- lease hints for managing I/O, and the operating sys- tem cooperates with a run-time layer to accelerate performance by adapting to dynamic behavior and minimizing prefetch overhead. This approach maintains the abstraction of unlimited virtual memory for the programmer, gives the compiler the flexibility to aggressively move prefetches back ahead of references, and gives the operating system the flexibility to arbitrate between the competing resource demands of multiple applications. We have implemented our scheme using the SUIF compiler and the Hurricane operating system. Our experimental results demonstrate that our fully-automatic scheme effectively hides the I/O latency in out-of- core versions of the entire NAS Parallel benchmark suite, thus resulting in speedups of roughly twofold for five of the eight applications, with one application speeding up by over threefold.
(Show Context)

Citation Context

... interfaces are unacceptable for our purposes. First, for the compiler to successfully move prefetches back far enough to hide the large latency of I/O, it is essential that prefetches be non-binding =-=[19]-=-. The non-binding property means that when a given reference is prefetched, the data value seen by that reference is bound at reference time; in contrast, with a binding prefetch, the value is bound a...

Database architecture optimized for the new bottleneck: Memory access

by Peter Boncz, Stefan Manegold, Martin L. Kersten - In Proceedings of VLDB Conference , 1999
"... In the past decade, advances in speed of commodity CPUs have far out-paced advances in memory latency. Main-memory access is therefore increasingly a performance bottleneck for many computer applications, including database systems. In this article, we use a simple scan test to show the severe impac ..."
Abstract - Cited by 161 (15 self) - Add to MetaCart
In the past decade, advances in speed of commodity CPUs have far out-paced advances in memory latency. Main-memory access is therefore increasingly a performance bottleneck for many computer applications, including database systems. In this article, we use a simple scan test to show the severe impact of this bottleneck. The insights gained are translated into guidelines for database architecture; in terms of both data structures and algorithms. We discuss how vertically fragmented data structures optimize cache performance on sequential data access. We then focus on equi-join, typically a random-access operation, and introduce radix algorithms for partitioned hash-join. The performance of these algorithms is quantified using a detailed analytical model that incorporates memory access cost. Experiments that validate this model were performed on the Monet database system. We obtained exact statistics on events like TLB misses, L1 and L2 cache misses, by using hardware performance counters found in modern CPUs. Using our cost model, we show how the carefully tuned memory access pattern of our radix algorithms make them perform well, which is confirmed by experimental results. ∗*This work was carried out when the author was at the
(Show Context)

Citation Context

...ty. Figure 1 shows that the speed of commercial microprocessors has been increasing roughly 70% every year, while the speed of commodity DRAM has improved by little more than 50% over the past decade =-=[Mow94]-=-. Part of the reason for this is that there is a direct tradeo# between capacity and speed in DRAM chips, and the highest priority has been for increasing capacity. The result is that from the perspec...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University