Results 1 -
7 of
7
Data Prefetch Mechanisms
, 2000
"... The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently use ..."
Abstract
-
Cited by 79 (4 self)
- Add to MetaCart
The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently used data, it is still not uncommon for many programs to spend more than half their run times stalled on memory requests. Data prefetching has been proposed as a technique for hiding the access latency of data referencing patterns that defeat caching strategies. Rather than waiting for a cache miss to initiate a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead. Secondary effects such as cache pollution and increased memory bandwidth requirements must also be taken into consideration. Despite these obstacles, prefetching has the potential to significantly improve overall program execution time by overlapping computation with memory accesses. Prefetching
Push vs. Pull: Data Movement for Linked Data Structures
- In International Conference on Supercomputing
, 2000
"... As the performance gap between the CPU and main memory continues to grow, techniques to hide memory latency are essential to deliver a high performance computer system. Prefetching can often overlap memory latency with computation for array-based numeric applications. However, prefetching for pointe ..."
Abstract
-
Cited by 37 (2 self)
- Add to MetaCart
As the performance gap between the CPU and main memory continues to grow, techniques to hide memory latency are essential to deliver a high performance computer system. Prefetching can often overlap memory latency with computation for array-based numeric applications. However, prefetching for pointer-intensive applications still remains a challenging problem. Prefetching linked data structures (LDS) is difficult because the address sequence of LDS traversal does not present the same arithmetic regularity as array-based applications and the data dependence of pointer dereferences can serialize the address generation process. In this paper, we propose a cooperative hardware/software mechanism to reduce memory access latencies for linked data structures. Instead of relying on the past address history to predict future accesses, we identify the load instructions that traverse the LDS, and execute them ahead of the actual computation. To overcome the serial nature of the LDS address genera...
Automatic Compiler-Inserted Prefetching for Pointer-Based Applications
- IEEE Transactions on Computers
, 1999
"... As the disparity between processor and memory speeds continues to grow, memory latency is becoming an increasingly important performance bottleneck. While software-controlled prefetching is an attractive technique for tolerating this latency, its success has been limited thus far to array-based nume ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
As the disparity between processor and memory speeds continues to grow, memory latency is becoming an increasingly important performance bottleneck. While software-controlled prefetching is an attractive technique for tolerating this latency, its success has been limited thus far to array-based numeric codes. In this paper, we expand the scope of automatic compiler-inserted prefetching to also include the recursive data structures commonly found in pointer-based applications. We propose three compilerbased prefetching schemes, and automate the most widely applicable scheme (greedy prefetching) in an optimizing research compiler. Our experimental results demonstrate that compiler-inserted prefetching can offer significant performance gains on both uniprocessors and large-scale sharedmemory multiprocessors. Keywords--- Caches, prefetching, pointer-based applications, recursive data structures, compiler optimization, shared-memory multiprocessors, performance evaluation. I. Introduction...
Designing a Modern Memory Hierarchy with Hardware Prefetching
- IEEE Transactions on Computers
, 2001
"... AbstractÐIn this paper, we address the severe performance gap caused by high processor clock rates and slow DRAM accesses. We show that, even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated onemegabyte level-two cache, a processor still spends ov ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
AbstractÐIn this paper, we address the severe performance gap caused by high processor clock rates and slow DRAM accesses. We show that, even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated onemegabyte level-two cache, a processor still spends over half its time stalling for L2 misses. Our experimental analysis begins with an effort to tune our baseline memory system aggressively: incorporating optimizations to reduce DRAM row buffer misses, reordering miss accesses to reduce queuing delay, and adjusting the L2 block size to match each channel organization. We show that there is a large gap between the block sizes at which performance is best and at which miss rate is minimized. Using those results, we evaluate a hardware prefetch unit integrated with the L2 cache and memory controllers. By issuing prefetches only when the Rambus channels are idle, prioritizing them to maximize DRAM row buffer hits, and giving them low replacement priority, we achieve a 65 percent speedup across 10 of the 26 SPEC2000 benchmarks, without degrading the performance of the others. With eight Rambus channels, these 10 benchmarks improve to within 10 percent of the performance of a perfect L2 cache. Index TermsÐPrefetching, caches, memory bandwidth, spatial locality, memory system design, Rambus DRAM. 1
Architectural Adaptation for Application-Specific Locality Optimizations
- In Proceedings of the 1997 IEEE International Conference on Computer Design
, 1997
"... We propose a machine architecture that integrates programmable logic into key components of the system with the goal of customizing architectural mechanisms and policies to match an application. This approach presents an improvement over traditional approach of exploiting programmable logic as a sep ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
We propose a machine architecture that integrates programmable logic into key components of the system with the goal of customizing architectural mechanisms and policies to match an application. This approach presents an improvement over traditional approach of exploiting programmable logic as a separate co-processor by preserving machine usability through software and over traditional computer architecture by providing application-specific hardware assists. We present two case studies of architectural customization to enhance latency tolerance and efficiently utilize network bisection on multiprocessors for sparse matrix computations. We demonstrate that using application-specific hardware assists and policies can provide substantial improvements in performance on a per application basis. Based on these preliminary results, we propose that an application-driven machine customization provides a promising approach to achieve high performance and combat performance fragility. I. Introd...
Efficient Communication Using Message Prediction for Cluster of Multiprocessors
- Proceedings of the CANPC’00, Fourth Workshop on Communication, Architecture, and Applications for Networkbased Parallel Computing, held in conjunction with HPCA6
, 1999
"... . With the increasing uniprocessor and SMP computation power available today, interprocessor communication has become an important factor that limits the performance of cluster of workstations. Many factors including communication hardware overhead, communication software overhead, and the user envi ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
. With the increasing uniprocessor and SMP computation power available today, interprocessor communication has become an important factor that limits the performance of cluster of workstations. Many factors including communication hardware overhead, communication software overhead, and the user environment overhead (multithreading, multiuser) affect the performance of the communication subsystems in such systems. A significant portion of the software communication overhead belongs to a number of message copying. Ideally, it is desirable to have a true zero-copy protocol where the message is moved directly from the send buffer in its user space to the receive buffer in the destination without any intermediate buffering. However, due to the fact that message -passing applications at the send side do not know the final receive buffer addresses, early arrival messages have to be buffered at a temporary area. In this paper, we show that there is a message reception communication locality in...
A Survey of Data Prefetching Techniques
, 1996
"... The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory accesses. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently u ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory accesses. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently used data, it is still not uncommon for scientific programs to spend more than half their run times stalled on memory requests. Data prefetching has been proposed as a technique for hiding the access latency of data referencing patterns that defeat caching strategies. Rather than waiting for a cache miss to perform a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead. Secondary effects in the memory system must also be taken into consideration. Despite these obstacles, prefetching has the potential to significantly improve overall program execution time by overlapping computation with memory accesses. Prefetching strategies are diverse and no single strategy has yet been proposed which provides optimal performance. The following survey examines several alternative approaches and discusses the design tradeoffs involved when implementing a data prefetch strategy.

