Results 1 - 10
of
40
Selective, accurate, and timely self-invalidation using last-touch prediction
- In Proceedings of the 27th Annual International Symposium on Computer Architecture
, 2000
"... Communication in cache-coherent distributed shared memory (DSM) often requires invalidating (or writing back) cached copies of a memory block, incurring high overheads. This paper proposes Last-Touch Predictors (LTPs) that learn and predict the “last touch ” to a memory block by one processor before ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
(Show Context)
Communication in cache-coherent distributed shared memory (DSM) often requires invalidating (or writing back) cached copies of a memory block, incurring high overheads. This paper proposes Last-Touch Predictors (LTPs) that learn and predict the “last touch ” to a memory block by one processor before the block is accessed and subsequently invalidated by another. By predicting a last-touch and (self-)invalidating the block in advance, an LTP hides the invalidation time, significantly reducing the coherence overhead. The key behind accurate last-touch prediction is tracebased correlation, associating a last-touch with the sequence of instructions (i.e., a trace) touching the block from a coherence miss until the block is invalidated. Correlating instructions enables an LTP to identify a last-touch to a memory block uniquely throughout an application’s execution. In this paper, we use results from running shared-memory applications on a simulated DSM to evaluate LTPs. The results indicate that: (1) our base case LTP design, maintaining trace signatures on a per-block basis, substantially improves prediction accuracy over previous self-invalidation schemes to an average of 79%; (2) our alternative LTP design, maintaining a global trace signature table, reduces storage overhead but only achieves an average accuracy of 58%; (3) last-touch prediction based on a single instruction only achieves an average accuracy of 41 % due to instruction reuse within and across computation; and (4) LTP enables selective, accurate, and timely self-invalidation in DSM, speeding up program execution on average by 11%. 1
Fine-Grain Distributed Shared Memory on Clusters of Workstations
, 1997
"... Shared memory, one of the most popular models for programming parallel platforms, is becoming ubiquitous both in low-end workstations and high-end servers. With the advent of low-latency networking hardware, clusters of workstations strive to offer the same processing power as high-end servers for a ..."
Abstract
-
Cited by 30 (10 self)
- Add to MetaCart
Shared memory, one of the most popular models for programming parallel platforms, is becoming ubiquitous both in low-end workstations and high-end servers. With the advent of low-latency networking hardware, clusters of workstations strive to offer the same processing power as high-end servers for a fraction of the cost. In such environments, shared memory has been limited to page-based systems that control access to shared memory using the memory's page protection to implement shared memory coherence protocols. Unfortunately, false sharing and fragmentation problems force such systems to resort to weak consistency shared memory models that complicate the shared memory programming model.
Cache-coherent distributed shared memory: perspectives on its development and future challenges
, 1998
"... ..."
HIPIQS: A High-Performance Switch Architecture using Input Queuing
- In Proceedings of the 12th International Parallel Processing Symposium
, 1998
"... Switch-based interconnects are used in a number of application domains including parallel system interconnects, local area networks, and wide area networks. However, very few switches have been designed that are suitable for more than one of these application domains. Such a switch must offer both e ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
Switch-based interconnects are used in a number of application domains including parallel system interconnects, local area networks, and wide area networks. However, very few switches have been designed that are suitable for more than one of these application domains. Such a switch must offer both extremely low latency and very high throughput for a variety of different message sizes. While some architectures with output queuing have been shown to perform extremely well in terms of throughput, their performance can suffer when used in systems where a significant portion of the packets are extremely small. On the other hand, architectures with input queuing offer limited throughput, or require fairly complex and centralized arbitration that increases latency. In this paper we present a new input queue-based switch architecture called HIPIQS (HIgh-Performance Input-Queued Switch). It offers low latency for a range of message sizes, and provides throughput comparable to that of output qu...
The Effectiveness of SRAM Network Caches in Clustered DSMs
, 1998
"... The frequency of accesses to remote data is a key factor affecting the performance of all Distributed Shared Memory (DSM) systems. Remote data caching is one of the most effective and general techniques to fight processor stalls due to remote capacity misses in the processor caches. The design space ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
(Show Context)
The frequency of accesses to remote data is a key factor affecting the performance of all Distributed Shared Memory (DSM) systems. Remote data caching is one of the most effective and general techniques to fight processor stalls due to remote capacity misses in the processor caches. The design space of remote data caches (RDC) has many dimensions and one essential performance trade-off: hit ratio versus speed. Some recent commercial systems have opted for large and slow (S)DRAM network caches (NC), but others completely avoid them because of their damaging effects on the remote/local latency ratio. In this paper we will explore small and fast SRAM network caches as a means to reduce the remote stalls and capacity traffic of multiprocessor clusters. The major appeal of SRAM NCs is that they add less penalty on the latency of NC hits and remote accesses. Their small capacity can handle conflict misses and a limited amount of capacity misses. However, they can be coupled with main memory...
The Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols
, 1998
"... that I have read this dissertation and that in my opinion it is ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
(Show Context)
that I have read this dissertation and that in my opinion it is
Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors
- Proc. Fifth Int’l Symp. High Performance Computer Architecture
, 1999
"... Recent research shows that the occupancy of the coherence controllers is a major performance bottleneck for distributed cache coherent shared memory multiprocessors. A significant part of the occupancy is due to the latency of accessing the directory, which is usually kept in DRAM memory. Most coher ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
(Show Context)
Recent research shows that the occupancy of the coherence controllers is a major performance bottleneck for distributed cache coherent shared memory multiprocessors. A significant part of the occupancy is due to the latency of accessing the directory, which is usually kept in DRAM memory. Most coherence controller designs that use protocol processors for executing the coherence protocol handlers use the data cache of the protocol processor for caching directory entries along with protocol handler data. Analogously, a fast Directory Cache (DC) can also be used by the hardwired coherence controller designs in order to minimize directory access time. However, the existing hardwired controllers do not use a directory cache. Moreover, the performance impact of caching directory entries has not been studied in the literature before. This paper studies the performance of directory caches using parallel applications from the SPLASH-2 suite. We demonstrate that using a directory cache can result in 40 % or more improvement in the execution time of applications that are communication intensive. We also investigate in detail the various directory cache design parameters: cache size, cache line size, and associativity. Our experimental results show that the directory cachesize requirements grow sub-linearly with the increase in the application’s data set size. The results also show the performance advantage of multi-entry directory cache lines, as a result of spatial locality and the absence of sharing of directories. The impact of the associativity of the directory caches on performance is less than that of the size and the line size. Also, we find a clear linear relation between the directorycache miss ratio and the coherence controller occupancy, and between both measures and the execution time of the applications, which can help system architects evaluate the impact of directory cache (or coherence controller) designs on overall system performance. 1
Formal Verification and its Impact on the Snooping versus Directory Protocol Debate
, 2005
"... This invited paper argues that to facilitate formal verification, multiprocessor systems should (1) decouple enforcing coherence from enforcing a memory consistency model and (2) decouple the interconnection network from the cache coherence protocol (by not relying on any specific interconnect order ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
This invited paper argues that to facilitate formal verification, multiprocessor systems should (1) decouple enforcing coherence from enforcing a memory consistency model and (2) decouple the interconnection network from the cache coherence protocol (by not relying on any specific interconnect ordering or synchronicity properties). Of the two dominant classes of cache coherence protocols—directory protocols and snooping protocols—these two desirable properties favor use of directory protocols over snooping protocols. Although the conceptual simplicity of snooping protocols is seductive, aggressive implementations of snooping protocols lack these decoupling properties, making them perhaps more difficult in practice to reason about, verify, and implement correctly. Conversely, directory protocols may seem more complicated, but they are more amenable to these decoupling properties, which simplify protocol design and verification. Finally, this paper describes the recently-proposed token coherence protocol’s adherence to these properties and discusses some of its implications for future multiprocessor systems.
Symbolic Simulation with Approximate Values
- in: Third International Conference on Formal Methods in ComputerAided Design
, 2000
"... . Symbolic methods such as model checking using binary decision diagrams (BDDs) have had limited success in verifying large designs because BDD sizes regularly exceed memory capacity. Symbolic simulation is a method that controls BDD size by allowing the user to specify the number of symbolic va ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
. Symbolic methods such as model checking using binary decision diagrams (BDDs) have had limited success in verifying large designs because BDD sizes regularly exceed memory capacity. Symbolic simulation is a method that controls BDD size by allowing the user to specify the number of symbolic variables in a test. However, BDDs still may blow up when using symbolic simulation in large designs with a large number of symbolic variables. This paper describes techniques for limiting the size of the internal representation of values in symbolic simulation no matter how many symbolic variables are present. The basic idea is to use approximate values on internal nodes; an approximate value is one that consists of combinations of the values 0, 1, and X. If an internal node is known not to affect the functionality being tested, then the simulator can output a value of X for this node, reducing the amount of time and memory required to represent the value of this node. Our algorithm ...
Reliable Verification Using Symbolic Simulation with Scalar Values
- in Proc. DAC, 2000
, 2000
"... This paper presents an algorithm for hardware verification that uses simulation and satisfiability checking techniques to determine the correctness of a symbolic test case on a circuit. The goal is to have coverage greater than that of random testing, but with the ease of use and predictability of d ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This paper presents an algorithm for hardware verification that uses simulation and satisfiability checking techniques to determine the correctness of a symbolic test case on a circuit. The goal is to have coverage greater than that of random testing, but with the ease of use and predictability of directed testing. The user uses symbolic variables in simple directed tests to increase the input space that is explored. The algorithm, which is called quasi-symbolic simulation, simulates these tests using only scalar (0,1,X) values internally causing potentially conservative values to be generated at the outputs. Divide and conquer of the symbolic input space is used to resolve this conservativeness. In the best case, this method is as efficient as symbolic simulation using BDDs and, in the worst case, gives coverage and predictability at least as good as directed testing.