Results 1 - 10
of
11
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in . . .
- In Proceedings of the 30th Annual International Symposium on Computer Architecture
, 2003
"... Destination-set prediction can improve the latency/bandwidth tradeoff in shared-memory multiprocessors. The destination set is the collection of processors that receive a particular coherence request. Snooping protocols send requests to the maximal destination set (i.e., all processors) , reducing l ..."
Abstract
-
Cited by 48 (7 self)
- Add to MetaCart
Destination-set prediction can improve the latency/bandwidth tradeoff in shared-memory multiprocessors. The destination set is the collection of processors that receive a particular coherence request. Snooping protocols send requests to the maximal destination set (i.e., all processors) , reducing latency for cache-to-cache misses at the expense of increased traffic. Directory protocols send requests to the minimal destination set, reducing bandwidth at the expense of an indirection through the directory for cache-to-cache misses. Recently proposed hybrid protocols trade-off latency and bandwidth by directly sending requests to a predicted destination set.
Coherence Decoupling: Making Use of Incoherence
, 2004
"... This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cac ..."
Abstract
-
Cited by 37 (2 self)
- Add to MetaCart
This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus, coherence decoupling can greatly reduce --- if not eliminate --- the e#ects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. SCL protocols reduce those latencies by speculatively writing updates into invalid lines, thereby increasing the accuracy of speculation, without complicating the simple, underlying coherence protocol that guarantees correctness.
Data Prefetching And Data Forwarding In Shared Memory Multiprocessors
- In Proceedings of the 1994 International Conference on Parallel Processing, volume II
, 1994
"... This paper studies and compares the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency due to interprocessor communication in cache coherent, shared memory multiprocessors. Two multiprocessor prefetching algorithms are presented and compared. A simple ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
This paper studies and compares the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency due to interprocessor communication in cache coherent, shared memory multiprocessors. Two multiprocessor prefetching algorithms are presented and compared. A simple blocked vector prefetching algorithm, considerably less complex than existing software pipelined prefetching algorithms, is shown to be effective in reducing memory latency and increasing performance. A Forwarding Write operation is used to evaluate the effectiveness of forwarding. The use of data forwarding results in significant performance improvements over data prefetching for codes exhibiting less spatial locality. Algorithms for data prefetching and data forwarding are implemented in a parallelizing compiler. Evaluation of the proposed schemes and algorithms is accomplished via execution-driven simulation of large, optimized, parallel numerical application codes with loop-level and vector parallelism. More data, discussion, and experiment details can be found in [1].
Hardware And Compiler Support For Cache Coherence In Large-Scale Shared-Memory Multiprocessors
, 1996
"... ompiler can detect potentially stale references and what kind of performance can be obtained using a real compiler. iii Also, most of the compiler-directed coherence schemes proposed to date have not addressed the real cost of the required hardware support. For example, many of the schemes require ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
ompiler can detect potentially stale references and what kind of performance can be obtained using a real compiler. iii Also, most of the compiler-directed coherence schemes proposed to date have not addressed the real cost of the required hardware support. For example, many of the schemes require expensive hardware support and assume a cache organization with singleword cache lines and a word-addressable architecture. Also, the issues of synchronization, such as lock variables and critical sections, have been addressed rarely. This dissertation addresses these hardware and compiler implementation issues and investigates the feasibility and performance of the compiler-directed cache coherence approach. We propose a new compiler-directed scheme that can be implemented on a largescale multiprocessor using off-the-shelf microprocessors. The scheme can be adapted to various cache organizations, including multi-word cache lines and byte-addressable architectures. Several system related is
User-level VSM Optimisation and its Application
- Proc. 2 nd Int. Workshop on Applied Parallel Computing (Lecture notes in Computer Science 1041
, 1995
"... . This paper describes user-level optimisations for virtual shared memory (VSM) systems and demonstrates performance improvements for three scientific kernel codes written in Fortran-S and running on a 30 node prototype distributed memory architecture. These optimisations can be applied to all consi ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
. This paper describes user-level optimisations for virtual shared memory (VSM) systems and demonstrates performance improvements for three scientific kernel codes written in Fortran-S and running on a 30 node prototype distributed memory architecture. These optimisations can be applied to all consistency models and directory schemes, whether in hardware or software, which employ an invalidation based protocol. The semantics of these optimisations are carefully stated. Currently these optimisations are performed by the programmer, but there is much scope for automating this process within a compiler. 1 Introduction Virtual shared memory (VSM) systems provide the illusion of a shared address space on distributed memory architectures. A shared memory programming model is attractive because it is simple to program, thus speeding the implementation and porting of parallel programs, and enabling the parallelisation of complex adaptive programs which may be difficult to implement in message...
Compiler analysis of cache coherence: Interprocedural array data-flow analysis and its impact on cache performance
- In IEEE Transactions on Parallel and Distributed Systems
, 2000
"... AbstractÐIn this paper, we present compiler algorithms for detecting references to stale data in shared-memory multiprocessors.The algorithm consists of two key analysis techniques, stale reference detection and locality preserving analysis.While the stale reference detection finds the memory refere ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
AbstractÐIn this paper, we present compiler algorithms for detecting references to stale data in shared-memory multiprocessors.The algorithm consists of two key analysis techniques, stale reference detection and locality preserving analysis.While the stale reference detection finds the memory reference patterns that may violate cache coherence, the locality preserving analysis minimizes the number of such stale references by analyzing both temporal and spatial reuses.By computing the regions referenced by arrays inside loops, we extend the previous scalar algorithms [9] for more precise analysis.We develop a full interprocedural array data-flow algorithm, which performs both bottom-up side-effect analysis and top-down context analysis on the procedure call graph to further exploit locality across procedure boundaries.The interprocedural algorithm eliminates cache invalidations at procedure boundaries, which were assumed in the previous compiler algorithms [9].We have fully implemented the algorithm in the Polaris parallelizing compiler [28].Using execution-driven simulations on Perfect Club benchmarks, we demonstrate how unnecessary cache misses can be eliminated by the automatic stale reference detection.The algorithm can be used to implement cache coherence in the sharedmemory multiprocessors that do not have hardware directories, such as Cray T3D [21]. Index TermsÐCompiler, interprocedural analysis, data-flow analysis, cache coherence, shared-memory multiprocessors. 1
Integrating Fine-Grained Message Passing In Cache Coherent Shared Memory Multiprocessors
, 1996
"... : This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algori ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
: This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algorithm. Data forwarding is used to target interprocessor data communication, rather than synchronization, and is applied to communication-related accesses between successive parallel loops. Prefetching and forwarding are each shown to be more effective for certain types of architectural and application characteristics. Given this result, a new hybrid prefetching and forwarding approach is proposed and evaluated that allows the relative amounts of prefetching and forwarding used to be adapted to these characteristics. When compared to prefetching or forwarding alone, the new hybrid scheme is shown to increase performance stability over varying application characteristics, to reduce processor ins...
Spinning-on-Coherency: A New VSM Optimisation for Write-invalidate
- In Proceedings of High-Performance Computing and Networking Europe
, 1996
"... This paper introduces spinning-on-coherency (SOC) a technique for virtual shared memory (VSM) which enables latency-hiding of remote reads and the removal of related synchronisation points. Coherence-bits are hardware-tags associated with addresses which record local access permissions (such as read ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper introduces spinning-on-coherency (SOC) a technique for virtual shared memory (VSM) which enables latency-hiding of remote reads and the removal of related synchronisation points. Coherence-bits are hardware-tags associated with addresses which record local access permissions (such as read, write, invalid). In SOC a user-thread spins on the particular coherence-bits associated with an address until the new data value is asynchronously propagated and the address becomes valid. Data-propagation occurs when another node issues an update after having written the new value. Performance improvements are demonstrated for two codes, representing the core communication found in Shallow (a well known numerical weather prediction benchmark), and CG (from the NAS Parallel Benchmarks). These are run on a 30 node prototype distributed memory architecture (EDS), with invalidation based sequentially consistent VSM. SOC is also applicable to other consistency models and directory schemes, ...
Models for Performance Prediction of Cache Coherence Protocols
"... Key words: Cache coherence, distributed shared memory, memory access behavior, analytical performance prediction, performance evaluation, dynamic hybrid protocols. In a modern shared memory multiprocessor, it is possible to support more than one protocol for maintaining cache coherence. Possible can ..."
Abstract
- Add to MetaCart
Key words: Cache coherence, distributed shared memory, memory access behavior, analytical performance prediction, performance evaluation, dynamic hybrid protocols. In a modern shared memory multiprocessor, it is possible to support more than one protocol for maintaining cache coherence. Possible candidates might be based on the Write-Back/Invalidate, Write-Through/Invalidate, and Write-Update protocols. Hybrid protocols allow the use of different protocols for different data blocks, and dynamic hybrid protocols additionally allow for changes in the choice of protocol during the execution of an application. In this paper, we introduce a set of analytical models for predicting the performance of parallel applications under various cache coherence protocol assumptions. These models can be used at compile time to determine which protocols are to be used for which data blocks, and also to determine when to change protocols in the case of dynamic protocols. Although we focus on tightly-coupl...
Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors
, 2003
"... Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors Alexander Grbic Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2003 The cache coherence protocol plays an important role in the performance of a distributed shared-me ..."
Abstract
- Add to MetaCart
Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors Alexander Grbic Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2003 The cache coherence protocol plays an important role in the performance of a distributed shared-memory (DSM) multiprocessor. A variety of cache coherence protocols exist and di#er mainly in the scope of the sites that are updated by a write operation. These protocols can be complex and their impact on the performance of a multiprocessor system is often di#cult to assess. To obtain good performance, both architects and users must understand processor communication, data locality, the properties of the interconnection network, and the nature of the coherence protocols. Analyzing the processor data sharing behavior and determining its e#ect on cache coherence communication tra#c is the first step to a better understanding of overall performance. Toward this goal, this dissertation provides a framework for evaluating the coherence communication tra#c of di#erent protocols and considers using more than one protocol in a DSM multiprocessor.

