| John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, February 1991. |
....primitive was developed. This proved to be much more efficient than the IPC based barrier. Additional optimizations to the test then add barrier were investigated, including cache line alignment of the barrier synchronization variable, and exponential backoff while spin waiting at the barrier [6]. The code is displayed in Figure 2. The global variable nt in Figure 2 stands for the number of threads, and is initialized prior to entry to the barrier. The global rollover is similarly initialized to 2 nt Gamma 1. The alignment and size of the synchronization variable (to which the pointer ....
John M. Mellor-Crummey and Michael L. Scott. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Transactions on Computer Systems, Vol 9, No 1, pages 21--65, Feburary 1991.
....QOLB, IQOLB does not require any instruction set support nor does it require any software changes. Software queue based locking schemes were proposed by Anderson [9, 10] and Graunke and Thakkar [55] Mellor Crummey and Scott proposed MCS, an improvement to Anderson s algorithm. The MCS scheme [120, 121] is a software based queued lock scheme. MCS adds requesters for a held lock into a software queue at the time of the request, using atomic operations such as swap and compare swap to update the list. Arbitration for the eventual recipient of the lock is therefore performed in advance, first come, ....
....behavior. In spite of extensive research, OCC techniques have not been popular because of key limitations [124] Lock based synchronization. Lock based synchronization has been extensively studied in literature. These techniques attempt to optimize the lock and data transfer operations [10, 50, 81, 120, 141]. The techniques are not lock free. These techniques suffer from locking overhead and serialization due to lock acquisitions. Martnez and Torrellas introduced Speculative Locks, allowing speculative threads to bypass a held lock and enter a critical section [117] At any time the lock is always ....
[Article contains additional citation context not shown here]
John M. Mellor-Crummey and Michael L. Scott. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, February 1991.
....released causes detrimental hot spot contention. All the released waiters try to acquire the lock at once, exacerbating the wait times at that lock. Recently published techniques for more efficient spin waiting on locks can be used to improve the performance of polling for highly contended locks [23]. These include exponential backoff and software queueing of spin waiters. 23 texts Algorithm (Kcycles) Runtime Ovh. 1 Threads MGrid 2 b O 1,865 1.0 7 6,953 unmatched 2 S b 0.5 1,817 0.97 5,488 2 S b 1 1,885 1.01 5,150 2 Ss oc 7,273 3.90 4,613 Jacobi 4 b O 719 1.0 21 9,931 unmatched 4 ....
....to be an acceptable waiting algorithm. As pointed out in Section 1, this is because we investigate producer consumer and barrier synchronization in addition to mutual exclusion synchronization, and because of the difference in the machine architectures and blocking costs. Other studies [1, 4, 11, 23] have focused on reducing bus (or network) interference caused when spinning is used as a waiting mechanism. These studies explored methods to reduce the overhead of memory contention while spin waiting for locks and barriers. In cases of high lock contention, simple test test set [26] leads to ....
John M. Mellor-Crummey and Michael L. Scott. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Transactions on Computer' Systems, 9(1):21 65, February 1991.
....Therefore, the home page for a body can be some other node that the ones that access it. The clustering benefit, direct and indirect, is not enough to avert an increase in the per node protocol traffic as the clustering degree increases. Unlike Chapter 4, in these runs barnes uses MCS locks [MCS91] instead of the default Blizzard message locks, which have not designed to deal with multiprocessor nodes. The message locks are distributed among the nodes. Each lock or unlock request is implemented by sending an explicit message to the node that the lock resides. When multiprocessor nodes are ....
....were involved only two messages need to be exchanged. The util protocol library implements synchronization primitives, typically used in PARMACS applications. The primitives supported include locks and barriers. The user can choose at compile time the lock implementation between message and MCS [MCS91] locks. In message locks, the default lock implementation, the locks are distributed in a round robin fashion among the nodes. Then, each lock or unlock request is implemented by sending an explicit message to the node that the lock resides. MCS locks are shared memory based locks. Message locks ....
John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21-65, February 1991.
.... algorithms: ours assuming a garbage collector (HF) or using reference counting (HF RC) and Israeli and Rappoport s CAS1 based design as the only practical alternative from Figure 1 (IR) We also implemented two lock based schemes using the queued spin lock design of Mellor Crummey and Scott [12], either with one lock to protect the entire vector (MCS) or with fine grain (i.e. per entry) locks (MCS FG) All our measurements exclude benchmark initialisation. We start timing after all threads are created and signal that they are executing; from this point the benchmark executes for two ....
John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21-- 65, February 1991.
.... location pointed to by the first argument is equal to the value 7 Forces Queued locks resolve Fairness very well ( Memory Bandwidth and Memory Size reasonably well ( and Memory Latency rather poorly ( Gamma) Solution Use a queued lock primitive such as the MCS lock shown in Figure 3 [MCS91a]. The idea behind the queued lock is that each spinning CPU has its own queue element to spin on, so that only the CPU that has just been granted the lock will incur cache misses to access the new lock state. This is in sharp contrast to the test and set lock s behavior, where every spinning CPU ....
John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ASPLOS, January 1991.
....other words, we ignore any cache displacements caused by the actions of the operating system or other processes running on the same processor. The rst local spin algorithms were algorithms in which read modify write primitives are used to enqueue blocked processes onto the end of a spin queue [12, 28, 47]. These algorithms, along with several related algorithms published more recently, are surveyed in Section 3.1. In each of these algorithms, a constant number of remote memory references is required per critical section execution. The algorithms vary in the synchronization primitives used, and ....
....is stored in a shared array. Each of these algorithms has O(1) time complexity under the CC model, but unbounded time complexity under the DSM model. In the third algorithm we consider, the spin queue is stored as a shared linked list. This algorithm, which was proposed by Mellor Crummey and Scott [47], has O(1) time complexity under both the CC and DSM models. Algorithm TA. T. Anderson s algorithm [12] denoted Algorithm TA, is shown in Figure 2. Algorithm TA uses both fetch and inc and atomic add. A fetch and inc primitive that takes an increment value as input can be used in place of ....
[Article contains additional citation context not shown here]
J. Mellor-Crummey and M. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21-65, February 1991.
.... problems include assigning successive memory addresses to processors [18] balancing the computational load on a computer system while minimizing the maximum load on a server [6,35,37,39] and implementing barrier data structures in order to synchronize processes operating at di erent speeds [1, 24, 30,32]. In a seminal paper, Aspnes et al. 5] proposed balancing networks as a new approach to solving balancing problems. Balancing networks, resembling comparator networks (see, e.g. 15, Chapter 28] or [29, Section 5.3.4] are constructed from simple multi input, multioutput computing elements ....
J. M. Mellor-Crummey and M. L. Scott, \Algorithms for Scalable Synchronization on Shared Memory Multiprocessors," Technical Report 342, Department of Computer Science, UniversityofRochester, April 1990.
.... in the case of spin on read mechanism indicates a complexity of O(n 2 ) in the bus traffic (where n is the total number of processors in the system) 2,9] Similarly, it has been shown that the other approaches like barrier synchronization with counters could achieve O(nlogn) proportionality [2,10,11] and the CBL scheme with private caches [2] has a bus traffic complexity of O(n) Table 2) Table 2. Bus traffic (contention) complexities of different mechanisms. mechanism Spin Spinon read CBL Barrier (counters) SoCSU Traffic Exponential O(n 2 ) O(n) O(nlogn) constant Therefore, ....
J. M. Mellor-Crummey, M.L. Scott, Algorithms for scalable synchronization on shared-memory multiprocessors, ACM Trans. Comput. Syst., 9, 1, February 1991, pp 21-65.
.... while the DASH project [Lenoski et al. 1990] was one of the earliest to introduce latency hiding strategies (an issue now with uniprocessor systems) Tree based barriers attempt to distribute the synchronization overhead, so a barrier does not become a hot spot for global contention for locks [Mellor Crumney and Scott 1991]. 76 6.11 Exercises 6.1. You have two alternatives of similar price for buying a computer: a 4 processor system with 1 Gbyte of RAM and 20Gbytes of disk, but which cannot be upgraded further . a 2 processor system with 256 Kbytes of RAM and 10 Gbytes of disk, which can be expanded to 20 ....
JM Mellor-Crumney and ML Scott. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, ACM Trans. on Computer Systems, vol. 9 no. 1 February 1991, pp 21--65.
....solutions that introduce blocking are penalised by locking that introduces priority inversion, deadlock scenarios and performance bottlenecks. The time that a process can spend blocked while waiting to get access to the critical section can form a substantial part of the algorithm execution time [5, 9, 10, 14]. There are two main reasons that locking is so expensive. The rst reason is the convoying eoeect that blocking synchronisation suoeers from: if a process holding the lock is preempted, any other process waiting for the lock is unable to perform any useful work until the process that hold the ....
J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS), 9(1):2165, Feb. 1991.
....the aim to lower the contention when the system is in a high congestion situation. These implementations give dioeerent execution times under dioeerent contention instances. But still the time spend by the processes on the synchronisation can form a substantial part of the program execution time [9, 15, 16, 18, 29]. The reason for this is that typical synchronisation is based on blocking that introduces performance bottlenecks because of busy waiting and convoying. Busy waiting tends to produce a large amount of memory and interconnection network contention. The convoying eoeect that takes place when a ....
J. M. Mellor-Crummey and M. L. Scott, Algorithms for Scalable Synchronization on SharedMemory Multiprocessors, ACM Trans. on Computer Systems, 9(1), pp. 21-65 February 1991.
....a small percentage of requests from each processor destined for the particular memory module can severely degrade the effective communication bandwidth of a large system. Some early researchers in this area resorted to hardware approaches [3] 9] 14] and others sought software solutions [10] [11]. Hardware approaches [3] incorporate certain hardware in the interconnection network to trap and combine access requests heading toward the same memory location for hot spot relief. However, the cost overhead due to added hardware poses a major concern. A less costly hardware combining technique ....
....on which it spins. This scheme successfully disperses processors waiting for a single spin location at a remote memory module and lets each waiting processor spin on a location situated at a different remote memory module, with all spin locations formed an array. Mellor Crummey and Scott s [11] list based queuing lock (called the MCS lock) was inspired by the Queue On Lock bit described in [14] but implemented in software. This scheme uses a fetch and store atomic instruction on a lock located in a remote memory module to link the involved processors to form a single waiting list. The ....
[Article contains additional citation context not shown here]
J.M. Mellor-Crummey and M.L. Scott, "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors," ACM Trans. Computer Systems, vol. 9, no. 1, pp. 21-65, Feb. 1991.
....to complete operations in a nite number of steps [44] As we are more concerned with the eciency of synchronization algorithms, waitfreedom is often too expensive to implement for synchronization. On the other hand, lock free algorithms with careful implementation can be quite ecient [124] In [75], the authors described an algorithm for constructing busy wait synchronization that aims to reduce lock contentions. It is well known that conservative parallel simulation cannot beat the critical path [50] To make things worse, it is considered as impractical to identify and therefore schedule ....
J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21-65, 1991.
.... has yet occurred) While synchronization can be a dominant source of overhead, at least that portion of the overhead owing to non productive usage of shared resources can often be eliminated to a large extent through the use of appropriate scalable synchronization primitives and algorithms [1] [10]. Process management overhead refers to the time required to create, destroy and schedule multiple units of sequential execution. Although operating system processes are expensive (as operating systems are slow) they may be used sparingly (typically one process per processor) relying instead on ....
J. M. Mellor-Crummey, M. L. Scott, "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors", ACM Transactions on Computer Systems, Vol. 9, No. 1 (February 1991), pp. 21-65.
.... Recent work on shared memory mutual exclusion has shed some light on the following fundamental question: given a system model, what is the time complexity of an optimal mutual exclusion algorithm With fetch and primitives the answer is trivial, because constant time algorithms are already known [10, 17, 25]. However, with other primitives, there are still gaps between the best known algorithms and lower bounds. In sequential programming, time complexity is easily de ned: one simply counts the total number of operations performed by an algorithm. However, in concurrent programming, this de nition is ....
.... assuming each enabled read or write operation is executed within some constant time bound, termed a time unit [13] Recent work on scalable mutual exclusion algorithms has shown that the most crucial factor in determining an algorithm s performance is the amount of interconnect trac it generates [10, 17, 25, 30]. In light of this, we de ne the time complexity of a mutual exclusion algorithm to be the worst case number of remote memory references by one process in order to enter and then exit its critical section. A remote memory reference is a shared variable access that requires an interconnect ....
[Article contains additional citation context not shown here]
J. Mellor-Crummey and M. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21-65, February 1991.
....solutions that introduce blocking are penalised by locking that introduces priority inversion, deadlock scenarios and performance bottlenecks. The time that a process can spend blocked while waiting to get access to the critical section can form a substantial part of the algorithm execution time [7, 11, 12, 18]. There are two main reasons that locking is so expensive. The rst reason is the convoying eoeect that blocking synchronisation suoeers from: if a process holding the lock is preempted, any other process waiting for the lock is unable to perform any useful work until the process that hold the ....
J. M. Mellor-Crummey and M. L. Scott, Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, ACM Trans. on Computer Systems, 9(1), February 1991.
.... release lock( my pe ( Table 5: MCS Lock The MCS Lock, as prototyped by Mellor Crummey and Scott, guarantees FIFO ordering of lock acquisitions, spins on local variables only, requires a small constant amount of space per lock, and works equally well on machines with and without coherent caches [10]. We have adjusted the MCS Lock to the virtual shared Cray T3E memory model (see Appendix) The consistent implementation of the MCS lock is shown in Table 5. Notice that, most of our lock related variables are of type short; this entails no severe limitation on the use of our routines and faster ....
Mellor-Crummey, J. M. and Scott, M. L. Algorithms for scalable synchronization on sharedmemory multiprocessors. ACM Trans. Comp. Syst. C-9 (1), 1991, pp. 21--65.
No context found.
John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, February 1991.
No context found.
John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM ToCS, 9(1):21--65, 1991.
No context found.
John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. Technical Report 342, Computer Science Department, University of Rochester, Rochester, NY, April 1990.
No context found.
John M. Mellor-Crummey and Michael L. Scott. Algorithms for Scalable Synchronization on SharedMemory Multiprocessors. ACM Transactions on Computer Systems, Vol 9, No 1, pages 21--65, February 1991.
No context found.
John M. Mellor-Crummey and Michael L. Scott. Algorithms for Scalable Synchronization on SharedMemory Multiprocessors. ACM Transactions on Computer Systems, Vol 9, No 1, pages 21--65, February 1991.
No context found.
John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM ToCS, 9(1):21--65, 1991.
No context found.
John M. Mellor-Crummey and Michael L. Scott. Algorithms for Scalable Synchronization on SharedMemory Multiprocessors. ACM Transactions on Computer Systems, Vol 9, No 1, pages 21--65, February 1991.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC