Results 1 - 10
of
12
Data Forwarding in Scalable Shared-Memory Multiprocessors
- In Proceedings of the 1995 International Conference on Supercomputing
, 1995
"... Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches. This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different levels of support for forwarding. Our simulations of a 32-processor machine show that a slightlyoptimistic support for forwarding speeds up five applications by, on average, 50% for large caches and 30% for small caches. For large caches, many sharing read misses can be eliminated, while for smaller caches, forwarding ...
Adaptive And Integrated Data Cache Prefetching For Shared-Memory Multiprocessors
, 1995
"... ... yield a better overall scheme. We give a detailed description of the compiler analysis necessary for integrated prefetching. The performance of integrated prefetching is compared to software and hardware prefetching, and we show the effect of adapting the scheduling of prefetches at compile ti ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
... yield a better overall scheme. We give a detailed description of the compiler analysis necessary for integrated prefetching. The performance of integrated prefetching is compared to software and hardware prefetching, and we show the effect of adapting the scheduling of prefetches at compile time. Finally, we discuss approaches that combine integrated prefetching with the adaptive hardware prefetching technique.
CPU Cache Prefetching: Timing Evaluation of Hardware Implementations
- IEEE Transactions on Computers
, 1998
"... Abstract—Prefetching into CPU caches has long been known to be effective in reducing the cache miss ratio, but known implementations of prefetching have been unsuccessful in improving CPU performance. The reasons for this are that prefetches interfere with normal cache operations by making cache add ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Abstract—Prefetching into CPU caches has long been known to be effective in reducing the cache miss ratio, but known implementations of prefetching have been unsuccessful in improving CPU performance. The reasons for this are that prefetches interfere with normal cache operations by making cache address and data ports busy, the memory bus busy, the memory banks busy, and by not necessarily being complete by the time that the prefetched data is actually referenced. In this paper, we present extensive quantitative results of a detailed cycle-by-cycle trace-driven simulation of a uniprocessor memory system in which we vary most of the relevant parameters in order to determine when and if hardware prefetching is useful. We find that, in order for prefetching to actually improve performance, the address array needs to be double ported and the data array needs to either be double ported or fully buffered. It is also very helpful for the bus to be very wide (e.g., 16 bytes) for bus transactions to be split and for main memory to be interleaved. Under the best circumstances, i.e., with a significant investment in extra hardware, prefetching can significantly improve performance. For implementations without adequate hardware, prefetching often decreases performance.
The Illinois Aggressive Coma Multiprocessor Project (I-ACOMA)
- In Proc. of the 6th Symposium on the Frontiers of Massively Parallel Computing
, 1996
"... While scalable shared-memory multiprocessors with hardware-assisted cache coherence are relatively easy to program, if truly high-performance is desired, they still require substantial programmer effort. For example, data must be allocated close to the processors that will use them and the applicati ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
While scalable shared-memory multiprocessors with hardware-assisted cache coherence are relatively easy to program, if truly high-performance is desired, they still require substantial programmer effort. For example, data must be allocated close to the processors that will use them and the application must be tuned so that the working set fits in the caches. This is unfortunate because the most important obstacle to widespread use of parallel computing is the hardship of programming parallel machines. The goal of the I-ACOMA project is to explore how to design a highly programmable high-performance multiprocessor. We focus on a flat-coma scalable multiprocessor supported by a parallelizing compiler. The main issues that we are studying are advanced processor organizations, techniques to handle long memory access latencies, and support for important classes of workloads like databases and scientific applications with loops that cannot be compileranalyzed. The project also involves build...
Inferential Queueing and Speculative Push
, 2003
"... Communication latencies within critical sections constitute a major bottleneck in some classes of emerging parallel workloads. In this paper, we argue for the use of two mechanisms to reduce these communication latencies: Inferentially Queued locks (IQLs) and Speculative Push (SP). With IQLs, the pr ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Communication latencies within critical sections constitute a major bottleneck in some classes of emerging parallel workloads. In this paper, we argue for the use of two mechanisms to reduce these communication latencies: Inferentially Queued locks (IQLs) and Speculative Push (SP). With IQLs, the processor infers the existence, and limits, of a critical section from the use of synchronization instructions and joins a queue of lock requestors, reducing synchronization delay. The SP mechanism extracts information about program structure by observing IQLs. SP allows the cache controller, responding to a request for a cache line that likely includes a lock variable, to predict the data sets the requestor will modify within the associated critical section. The controller then pushes these lines from its own cache to the target cache, as well as writing them to memory. Overlapping the protected data transfer with that of the lock can substantially reduce the communication latencies within critical sections. By pushing data in exclusive state, the mechanism can collapse a read-modify-write sequences within a critical section into a single local cache access. The write-back to memory allows the receiving cache to ignore the push. Neither mechanism requires any programmer or compiler support nor any instruction set changes. Our experiments demonstrate that IQLs and SP can improve performance of applications employing frequent synchronization.
Compiler support for data forwarding in scalable sharedmemory multiprocessors
- In International Conference on Parallel Processing
, 1999
"... Abstract As the difference in speed between processor and memory system continues to increase, it is becoming crucial to develop and refine techniques that enhance the effectiveness of cache hierarchies. One promising technique in the context of scalable shared-memory multiprocessors is data forward ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract As the difference in speed between processor and memory system continues to increase, it is becoming crucial to develop and refine techniques that enhance the effectiveness of cache hierarchies. One promising technique in the context of scalable shared-memory multiprocessors is data forwarding. Forwarding hides the latency of communication-induced misses by having producer processors send data to the caches of potential consumer processors in advance. Forwarding can hide the latency effectively, has low instruction overhead, and uses few machine resources. This paper presents a complete implementation of a data forwarding pass in an industrial-strength parallelizing compiler. Complete Fortran applications are analyzed for dependences and, based on the analysis, automatically annotated with forwarding directives. We propose a forwarding framework that includes 4 new instructions: write-forward, writebroadcast, write-update, and write-through. New microarchitectural support is proposed. In our analysis, we assume that the assignment of loop iterations to processors is known. We perform simulations of multiprocessors with different cache, memory, machine sharing, and process migration parameters. We conclude that data forwarding delivers large speedups (six 32-processor applications ran an average of 40 % faster), gets close to the upper bound in performance, and needs compiler support of only medium complexity. 1 Introduction As increases in processor speed continue to outstrip increases in memory speed, it becomes crucial to develop and refine techniques that enhance the effectiveness of memory hierarchies. An important reason why memory hierarchies are sometimes not very effective in multiprocessors is the intrinsic interprocessor communication required by parallel applications. To cope with this communication, popularly known as data sharing, researchers have proposed techniques that overlap the communication latency with processor computation. Perhaps the most well-studied of these techniques is data prefetching. Work by many researchers has shown that prefetching is highly effective [5, 15, 19].
Integrating Fine-Grained Message Passing In Cache Coherent Shared Memory Multiprocessors
, 1996
"... : This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algori ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
: This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algorithm. Data forwarding is used to target interprocessor data communication, rather than synchronization, and is applied to communication-related accesses between successive parallel loops. Prefetching and forwarding are each shown to be more effective for certain types of architectural and application characteristics. Given this result, a new hybrid prefetching and forwarding approach is proposed and evaluated that allows the relative amounts of prefetching and forwarding used to be adapted to these characteristics. When compared to prefetching or forwarding alone, the new hybrid scheme is shown to increase performance stability over varying application characteristics, to reduce processor ins...
Comparing Data Forwarding and Prefetching for Communication-Induced Misses in Shared-Memory MPs
- In Proceedings of the ICS
, 1998
"... As the difference in speed between processor and memory system continues to increase, it is becoming crucial to develop and refine techniques that enhance the effectiveness of cache hierarchies. Two such techniques are data prefetching and data forwarding. With prefetching, a processor hides the lat ..."
Abstract
- Add to MetaCart
As the difference in speed between processor and memory system continues to increase, it is becoming crucial to develop and refine techniques that enhance the effectiveness of cache hierarchies. Two such techniques are data prefetching and data forwarding. With prefetching, a processor hides the latency of cache misses by requesting the data before it actually needs it. With forwarding, a producer processor hides the latency of communication-induced cache misses in the consumer processors by sending the data to the caches of the latter. These two techniques are complementary approaches to hiding the latency of communication-induced misses. This paper compares the effectiveness of data forwarding and data prefetching to hide communication-induced misses. Although both techniques require comparable hardware support, forwarding usually has a lower instruction overhead. We evaluate prefetching and forwarding algorithms in a parallelizing compiler using execution-driven simulations of a sh...
Permission to Make Digital Or Hard Copies of All Or Part of This Work for
- In Proceedings of UIST 2000
, 2000
"... A number of recent systems have provided rich facilities for manipulating the timelines of applications. Such timelines represent the history of an application's use in some session, and captures the effects of the user's interactions with that application. Applications can use timeline manipulation ..."
Abstract
- Add to MetaCart
A number of recent systems have provided rich facilities for manipulating the timelines of applications. Such timelines represent the history of an application's use in some session, and captures the effects of the user's interactions with that application. Applications can use timeline manipulation techniques prosaically as a way to provide undo and redo within an application context; more interestingly, they can use these same techniques to make an application's history directly manipulable in richer ways by users. This paper presents a number of extensions to current techniques for representing and managing application timelines. The first extension captures causal relationships in timelines via a nested transaction mechanism. This extension addresses a common problem in history-based applications, namely, how to represent application state as a set of atomic, incremental operations. The second extension presents a model for "multi-level" time, in which the histories of a set of inter-related artifacts can be represented by both "local" and "global" timelines. This extension allows the histories of related objects in an application to be manipulated independently from one another.
Memory Latency Rediction via Data Prefetching and Data Forwarding in Shared Memory Multiprocessors
, 1994
"... This dissertation considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency due to interprocessor communication in cache coherent, shared memory multiprocessors. The benefits of prefetching and forwarding are considered for large, numerical appl ..."
Abstract
- Add to MetaCart
This dissertation considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency due to interprocessor communication in cache coherent, shared memory multiprocessors. The benefits of prefetching and forwarding are considered for large, numerical application codes with loop-level and vector parallelism. Data prefetching is applied to these applications using two different multiprocessor prefetching algorithms implemented within a parallelizing compiler. Data forwarding considers array references involved in communication-related accesses between successive parallel loops, rather than within a single loop nest. A hybrid prefetching and forwarding scheme and a compiler algorithm for data forwarding are also presented

