Results 1 - 10
of
159
Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors
- ACM Transactions on Computer Systems
, 1991
"... Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become marke ..."
Abstract
-
Cited by 573 (32 self)
- Add to MetaCart
(Show Context)
Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become markedly more pronounced as applications scale. We argue that this problem is not fundamental, and that one can in fact construct busy-wait synchronization algorithms that induce no memory or interconnect contention. The key to these algorithms is for every processor to spin on separate locally-accessible ag variables, and for some other processor to terminate the spin with a single remote write operation at an appropriate time. Flag variables may be locally-accessible as a result of coherent caching, or by virtue of allocation in the local portion of physically distributed shared memory. We present a new scalable algorithm for spin locks that generates O(1) remote references per lock acquisition, independent of the number of processors attempting to acquire the lock. Our algorithm provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than
Weak Ordering -- A New Definition
, 1990
"... A memory model for a shared memory, multiprocessor commonly and often implicitly assumed by programmers is that of sequential consistency. This model guarantees that all memory accesses will appear to execute atomically and in program order. An alternative model, weak ordering, offers greater perfor ..."
Abstract
-
Cited by 260 (14 self)
- Add to MetaCart
(Show Context)
A memory model for a shared memory, multiprocessor commonly and often implicitly assumed by programmers is that of sequential consistency. This model guarantees that all memory accesses will appear to execute atomically and in program order. An alternative model, weak ordering, offers greater performance potential. Weak ordering was first defined by Dubois, Scheurich and Briggs in terms of a set of rules for hardware that have to be made visible to software. The central hypothesis of this work is that programmers prefer to reason about sequentially consistent memory, rather than having to think about weaker memory, or even write buffers. Following this hypothesis, we re-define weak ordering as a contract between software and hardware. By this contract, software agrees to some formally specified constraints, and hardware agrees to appear sequentially consistent to at least the software that obeys those constraints. We illustrate the power of the new definition with a set of software constraints that forbid data races and an imple-mentation for cache-coherent systems chat is not allowed by the old definition.
Transactional Lock-Free Execution of Lock-Based Programs
- In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems
, 2002
"... This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmer ..."
Abstract
-
Cited by 201 (9 self)
- Add to MetaCart
(Show Context)
This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multithreaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access to shared data. To ensure correctness programmers often rely on conservative locking at the expense of performance. The resulting serialization of threads is a performance bottleneck. Locks also interact poorly with thread scheduling and faults, resulting in poor system performance.
Adaptive cache coherency for detecting migratory shared data,”
- in Proc. of the 20th Int’l Symp. on Computer Architecture,
, 1993
"... ..."
(Show Context)
Adding Networks
, 2001
"... An adding network is a distributed data structure that supports a concurrent, lock-free, low-contention implementation of a fetch&add counter; a counting network is an instance of an adding network that supports only fetch&increment. We present a lower bound showing that adding networks ha ..."
Abstract
-
Cited by 118 (33 self)
- Add to MetaCart
(Show Context)
An adding network is a distributed data structure that supports a concurrent, lock-free, low-contention implementation of a fetch&add counter; a counting network is an instance of an adding network that supports only fetch&increment. We present a lower bound showing that adding networks have inherently high latency. Any adding network powerful enough to support addition by at least two values a and b, where |a |> |b |> 0, has sequential executions in which each token traverses Ω(n/c) switching elements, where n is the number of concurrent processes, and c is a quantity we call one-shot contention; for a large class of switching networks and for conventional counting networks the one-shot contention is constant. On the contrary, counting networks have O(log n) latency [4,7]. This bound is tight. We present the first concurrent, lock-free, lowcontention networked data structure that supports arbitrary fetch&add operations.
Cooperative Shared Memory: Software and Hardware for Scalable Multiprocessors
- ACM Transactions on Computer Systems
, 1992
"... We believe the paucity of massively-parallel, shared-memory machines follows from the lack of a shared-memory programming performance model that can inform programmers of the cost of operations (so they can avoid expensive ones) and can tell hardware designers which cases are common (so they can bui ..."
Abstract
-
Cited by 114 (29 self)
- Add to MetaCart
We believe the paucity of massively-parallel, shared-memory machines follows from the lack of a shared-memory programming performance model that can inform programmers of the cost of operations (so they can avoid expensive ones) and can tell hardware designers which cases are common (so they can build simple hardware to optimize them). Cooperative shared memory , our approach to shared-memory design, addresses this problem. Our initial implementation of cooperative shared memory uses a simple programming model, called Check-In/Check-Out (CICO), in conjunction with even simpler hardware, called Dir 1 SW. In CICO, programs bracket uses of shared data with a check out directive marking the expected first use and a check in directive terminating the expected use of the data. A cooperative prefetch directive helps hide communication latency. Dir 1 SW is a minimal directory protocol that adds little complexity to message-passing hardware, but efficiently supports programs written within the ...
Synchronization Without Contention
, 1991
"... Conventional wisdom holds that contention due to busy-wait synchronization is a major obstacle to scalability and acceptable performance in large shared-memory multiprocessors. We argue the contrary, and present fast, simple algorithms for contention-free mutual exclusion, reader-writer control, and ..."
Abstract
-
Cited by 95 (5 self)
- Add to MetaCart
Conventional wisdom holds that contention due to busy-wait synchronization is a major obstacle to scalability and acceptable performance in large shared-memory multiprocessors. We argue the contrary, and present fast, simple algorithms for contention-free mutual exclusion, reader-writer control, and barrier synchronization. These algorithms, based on widely available fetch_and_phi instructions, exploit local access to shared memory to avoid contention. We compare our algorithms to previous approaches in both qualitative and quantitative terms, presenting their performance on the Sequent Symmetry and BBN Butterfly multiprocessors. Our results highlight the importance of local access to shared memory, provide a case against the construction of so-called "dance hall" machines, and suggest that special-purpose hardware support for synchronization is unlikely to be cost effective on machines with sequentially consistent memory.
Efficient Distributed Shared Memory Based On Multi-Protocol Release Consistency
, 1994
"... A distributed shared memory (DSM) system allows shared memory parallel programs to be executed on distributed memory multiprocessors. The challenge in building a DSM system is to achieve good performance over a wide range of shared memory programs without requiring extensive modifications to the s ..."
Abstract
-
Cited by 68 (5 self)
- Add to MetaCart
A distributed shared memory (DSM) system allows shared memory parallel programs to be executed on distributed memory multiprocessors. The challenge in building a DSM system is to achieve good performance over a wide range of shared memory programs without requiring extensive modifications to the source code. The performance challenge translates into reducing the amount of communication performed by the DSM system to that performed by an equivalent message passing program. This thesis describes four novel techniques for reducing the communication overhead of DSM, including: (i) the use of software release consistency, (ii) support for multiple consistency protocols, (iii) a multiple writer protocol, and (iv) an update timeout mechanism. Release consistency allows modifications of shared data to be handled via a delayed update queue, which masks network latencies. Providing multiple cons...
Diffracting trees
- In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM
, 1994
"... Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrent-data-structure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a dis ..."
Abstract
-
Cited by 64 (13 self)
- Add to MetaCart
(Show Context)
Shared counters are among the most basic coordination structures in multiprocessor computation, with applications ranging from barrier synchronization to concurrent-data-structure design. This article introduces diffracting trees, novel data structures for shared counting and load balancing in a distributed/parallel environment. Empirical evidence, collected on a simulated distributed shared-memory machine and several simulated message-passing architectures, shows that diffracting trees scale better and are more robust than both combining trees and counting networks, currently the most effective known methods for implementing concurrent counters in software. The use of a randomized coordination method together with a combinatorial data structure overcomes the resiliency drawbacks of combining trees. Our simulations show that to handle the same load, diffracting trees and counting networks should have a similar width w, yet the depth of a diffracting tree is O(log w), whereas counting networks have depth O(log 2 w). Diffracting trees have already been used to implement highly efficient producer/consumer queues, and we believe diffraction will prove to be an effective alternative paradigm to combining and queue-locking in the design of many concurrent data structures.
Contention in Shared Memory Algorithms
, 1993
"... Most complexitymeasures for concurrent algorithms for asynchronous sharedmemory architectures focus on process steps and memory consumption. In practice, however, performance of multiprocessor algorithms is heavily influenced by contention, the extent to which processes access the same location at t ..."
Abstract
-
Cited by 63 (1 self)
- Add to MetaCart
(Show Context)
Most complexitymeasures for concurrent algorithms for asynchronous sharedmemory architectures focus on process steps and memory consumption. In practice, however, performance of multiprocessor algorithms is heavily influenced by contention, the extent to which processes access the same location at the same time. Nevertheless, even though contention is one of the principal considerations affecting the performance of real algorithms on real multiprocessors, there are no formal tools for analyzing the contention of asynchronous shared-memory algorithms. This paper introduces the first formal complexity model for contention in multiprocessors. We focus on the standard multiprocessor architecture in which n asynchronous processes communicate by applying read, write, and read-modify-write operations to a shared memory. We use our model to derive two kinds of results: (1) lower bounds on contention for well known basic problems such as agreement and mutual exclusion, and (2) trade-offs betwe...