42 citations found. Retrieving documents...
J.M. Mellor-Crummey and M.L. Scott, "Synchronization without Contention," Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Systems, pp. 269-278, Apr. 1991.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
An Empirical Evaluation of Performance-Memory Trade-offs in.. - Das, Fujimoto (1997)   (4 citations)  (Correct)

....predicates such as GVT computation or choosing one or more suitable events for cancelback are implemented by stopping all processors using barrier synchronization locks. An efficient and scalable barrier synchronization algorithm called the tournament barrier is used for fast synchronization [23]. The processors resume again with a barrier after such computations have been completed. The global computation within each pair of barriers is optimized as much as possible. Such computation is invoked only when the system runs out of memory by trying to send an event. After synchronizing at a ....

J.M. Mellor-Crummey and M.L. Scott, "Synchronization without Contention," Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Systems, pp. 269-278, Apr. 1991.


Mechanisms for Efficient Shared-Memory, Lock-Based Synchronization - Kagi (1999)   (2 citations)  (Correct)

....mechanism. It also presents the first quantitative evaluation of the locking primitive called QOLB [GVW89] with a microbenchmark and six shared memory parallel applications comparing its performance with sixteen previously proposed locking constructs including test set, test test set [RS84] MCS [MCS91a, MCS91b], and the reactive synchronization algorithm [LA94] Fourth, it discusses practical issues in implementing each of the identified locking mechanisms on current and future shared memory multiprocessors. Fifth, I show that it is feasible to implement on today s hardware an efficient synchronization ....

....systems with arbitrary network topologies. Anderson [And89, And90] and Graunke and Thakkar [GT90] independently describe queue based locking algorithms implemented entirely in software. Sub sequently, Mellor Crummey and Scott describe improvements to Anderson s algorithm in related papers [MCS91a, MCS91b]. These proposals allocate data structures in shared memory and insert processors in lists or circular arrays using atomic instructions such as swap or compare swap to update the concurrent data structures correctly. The price of maintaining the queue in software is somewhat larger inefficiencies. ....

[Article contains additional citation context not shown here]

John M. Mellor-Crummey and Michael L. Scott. Synchronization without contention. In Proceedings of the Fourth Symposium on Architectural Support for Programming Languages and Operating Systems, pages 269--278, April 1991.


Compiling for Hierarchical Shared-Memory Multiprocessors - Martens, Jayasimha (1994)   (Correct)

....a NUMA architecture, though in the NUMA efficiently may be more difficult to define. In a typical NUMA machine commonLevel(L) would place the item in one of the processors in the list L. Ideally, to reduce the cost of busy waiting, the item would be placed in a consumer rather than the producer[25, 26]; this may require an additional parameter to commonLevel( Some coherence problems may be caused by a naive implementation of PEs writing into other PEs memories; this writing will occur due to the fact that memory local to one PE will be in the hierarchy with respect to another PE. This ....

John M. Mellor-Crummey and Michael L. Scott. Synchronization without Contention. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 269--278, April 1991. 31


Efficient Techniques for Nested and Disjoint Barrier.. - Ramakrishnan, Scherson, .. (1999)   (4 citations)  (Correct)

....of message passing protocols. Since dedicated hardware barrier trees are intrinsically parallel and have very low latency, they are usually an order of magnitude faster than software barriers. There exist numerous algorithms and methods in the literature for improved software barriers, including [1, 7, 10, 13, 14, 12, 20], but these are improvements on a mechanism that is inherently slow. Methods for masking the latency of barriers have also been proposed [5, 6] These methods hide the synchronization overhead as well as the time spent waiting for other processors to reach the barrier. They depend on being able to ....

J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 269-- 278, April 1991.


A Comparison of SCI and Typhoon - Ackaouy, Rotenberg (1995)   (Correct)

....well) and it provides the base hardware on top of which QOLB can be implemented. Queue based locking is an efficient mechanism for implementing spin locks in multiprocessors. QOLB maintains the queue in hardware and consequently does not incur the overhead of the all software MCS algorithm [16]. QOLB also has the distinct advantage of being able to pass a data structure along with the lock when it is relinquished. 1.2 Typhoon The Typhoon architecture approaches the problem of cache coherent shared memory in a novel way, drawing from both software and hardware paradigms. The designers ....

J. M. Mellor-Crummey, and M. L. Scott, "Synchronization Without Contention," Proceedings of the 4th Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 269-278, April 1991. 8


The Raven Kernel: a Microkernel for Shared Memory Multiprocessors - Ritchie (1993)   (1 citation)  (Correct)

....cycles. This bottleneck appears to become a factor in systems with more than eight processors. Also, the cost of cache coherency in some systems can impose other bottlenecks to the system. An alternative to spin waiting involves techniques based on wait free synchronization [Her91] Her90] MCS91b] and data structures known as lock free objects [Ber91] MP91] The idea here is optimistic: allow concurrent data accesses without blocking. After a modification to a data structure is made, the algorithm checks to see if structures are consistent, and if not, the operation is rolled back. ....

J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In PROC of the Fourth ASPLOS, pages 269--278, Santa Clara, CA, 8-11 April 1991. In CAN 19:2, OSR 25 (special issue), and ACM SIGPLAN Notices 26:4.


Evaluating Synchronization on Shared Address Space.. - Kumar, Jiang.. (1999)   (11 citations)  (Correct)

....all processors except the first one in the queue can be delayed by using backo# proportional to each processor s position in the queue. When Fetch Op is used to implement ticket lock, even the spinning is not in cache and the backo# decreases the amount of network tra#c generated. MCS An MCS lock [11] is also a fair lock that uses a distributed linked list to maintain the queue of waiters. Each waiter spins on a separate node of the linked list. This allows the processor releasing the lock to selectively signal only the processor waiting at the head of the queue, thereby avoiding the ....

....does only slightly better. However, llscTickProp performs significantly better because of the proportional backo# although it is hurt when the contention is low. As expected, llscMcs performs the best among the LL SC locks. It is worth noting that the extra network transactions that llscMcs incurs [11] when there is only one waiting processor can be observed. The performance of the two Fetch Op based ticket locks is similar up to 16 processors beyond which the reduced contention at memory due to proportional backo# pays o#. Overall, the fopTicketProp lock performs the best under any contention ....

Mellor-Crummey, J. M., and Scott, M. L. Synchronization without contention. In Architectural Support for Programming Languages and Operating Systems (Santa Clara, California, April 8--11, 1991), pp. 269--278.


Job Scheduling in Multiprogrammed Parallel Systems - Feitelson (1997)   (16 citations)  (Correct)

.... IRIX [357, 106] It is also used in thread packages such as Presto [59, 192] and in microtasking on multiprocessor Cray supercomputers [220] Much progress has been made lately regarding the efficient implementation of locks, especially by way of exploiting local caches with hardware coherence [491, 19, 239, 246, 399]. The main idea behind these implementations is that it is possible for each PE to busywait on a variable in its own cache, thereby avoiding any extra load on the communication network. But the variable in the cache represents a variable in the shared memory, so when a PE releases the lock, all it ....

J. M. Mellor-Crummey and M. L. Scott, "Synchronization without contention". In 4th Intl. Conf. Architect. Support for Prog. Lang. & Operating Syst., pp. 269--278, Apr 1991.


Job Scheduling in Multiprogrammed Parallel Systems - Feitelson (1997)   (16 citations)  (Correct)

....system which is based on IRIX. It is also used in thread packages such as Presto, and in microtasking on multiprocessor Cray supercomputers. Much progress has been made lately regarding the efficient implementation of locks, especially by way of exploiting local caches with hardware coherence [297, 12, 143, 149, 242]. The main idea behind these implementations is that it is possible for each PE to busywait on a variable in its own cache, thereby avoiding any extra load on the communication network. But the variable in the cache represents a variable in the shared memory, so when a PE releases the lock, all it ....

J. M. Mellor-Crummey and M. L. Scott, "Synchronization without contention". In 4th Intl. Conf. Architect. Support for Prog. Lang. & Operating Syst., pp. 269--278, Apr 1991.


Communicators: Object-Based Multiparty Interactions for Parallel .. - Feitelson (1991)   (5 citations)  (Correct)

....region interaction in the fuzzy barrier example of fig. 8 is also a reading interaction. Thus processes do not need to be serialized when they exit the barrier region. The importance of this distinction is that the interactions can then be performed according to a readers writers locking scheme [20, 27]: any number of reading interactions may be performed simultaneously, while only writing interactions require mutual exclusion. Note that the identification of the interactions as reading or writing is part of the implementation, and so is the readers writers protocol. The programmer is shielded ....

J. M. Mellor-Crummey and M. L. Scott, "Synchronization without contention". In 4th Intl. Conf. Architect. Support for Prog. Lang. & Operating Syst., pp. 269--278, Apr 1991.


High Performance Switch Architectures For CC-Numa Multiprocessors - Iyer (1999)   (Correct)

....arrived out of order and holds it for later service. Such a situation is possible due to the existence of virtual channels and a round robin channel allocation policy. The synchronization method used in our simulations is based on spin locks using test and set operation with exponential backoff [49]. Exponential backoff improves the system throughput by enforcing delays between consecutive test and set operations based on an initial constant k that increases geometrically for every subsequent attempt. Barriers used in many of the applications were implemented using a shared counter. We also ....

J. M. Mellor-Crummey and M. L. Scott, "Synchronization Without Contention," In Proceedings of Fourth Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 269-278, Apr. 1991, ACM.


Nested Actions in Eos - Daynès, Gruber (1992)   (Correct)

....is mutual exclusion achieved Test and set is now a common assembler instruction in a wide variety of processor, it is fast and requires only one machine word of storage. For long, test and set instruction has been known to have poor performance in multiprocessors. This is no longer true though [12, 11]. However, test and set operations may significantly increase paging in persistent systems. Each test and set operation upon a memory word dirties the page it is in. Hence, if test and set operations are applied directly on store pages, paging is drastically increased. Similarly to locks, ....

J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In ASPLOS, International Conf. on Architectural Support for Programming Languages and Operating Systems, pages 269--278, Santa Clara, CA (USA), April 1991.


Fast Locks in Distributed Shared Memory Systems - Hermannsson, Wittie (1994)   (Correct)

....test and set. Only one gets the lock. To reduce mis trials before rechecking a lock, an exponential backoff delay after release of a lock with test test and set lessens contention[2] The delay is locally doubled whenever the lock is unlocked, but an attempt to obtain it fails. Queue based locks[9, 10, 2, 15, 16, 3, 18] are alternatives to retested locks. A lock request is sent to a lock owner. If the lock is free, permission is granted; if busy, the request is queued. When the lock is freed, the next queued process gets permission. Lock queues can be supported in hardware[9, 15] or in software[10, 2, 16, 3, ....

....2, 15, 16, 3, 18] are alternatives to retested locks. A lock request is sent to a lock owner. If the lock is free, permission is granted; if busy, the request is queued. When the lock is freed, the next queued process gets permission. Lock queues can be supported in hardware[9, 15] or in software[10, 2, 16, 3, 18]. Test and set with hardware support[20] is adequate on tightly coupled multiprocessors. Queue based locks are needed in distributed memory systems to minimize network traffic after lock release. On multicomputers connected by networks, locating the lock owner is an issue. Distributed directory ....

J.M. Mellor-Crummey and M.L. Scott. Synchronization Without Contention. 4th Int. Conf. on Arch. Support for Prog. Lang. and Op. Sys. (ASPLOS) , 269--278, April 1991.


Eager Combining: A Coherency Protocol for Increasing.. - Ricardo Bianchini (1994)   (6 citations)  (Correct)

....al. 1983] and the IBM RP3 [Pfister et al. 1985] can alleviate contention for spin locks by combining requests to a single memory location. Alternatively, spin locks can be implemented so as to spin on local memory only, thereby eliminating most remote references associated with synchronization [Mellor Crummey and Scott, 1991]. Optimization algorithms can avoid contention by examining or updating the global solution infrequently. Linear algebra algorithms can exploit the properties of numerical equations to improve locality of reference, and as a side effect eliminate most producer consumer sharing [Gallivan et al. ....

J. M. Mellor-Crummey and M. L. Scott, "Synchronization Without Contention," Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 269--278, April 1991.


Efficient Software Synchronization on Large Cache.. - Magnusson, Landin.. (1994)   (6 citations)  (Correct)

....is no possibility of starvation. Anderson s lock requires an atomic fetch and increment operation. The pseudo code for Andersons lock is given in figure 1. is the modulus operator. N is the number of processors that can conceivably compete for the lock. As pointed out by MellorCrummey and Scott [MCS91b], Andersons lock is incorrect unless the number of processors is struct lock f int flag[N] int L; int f; g gt acquire(Q) struct lock Q; f int P, f, g; f = Q flag[myid] atomic f swap P = Q L; g = Q f; Q L = myid; Q f = f; g while (Q flag[P] g) g gt release(Q) struct ....

....f, g; f = Q flag[myid] atomic f swap P = Q L; g = Q f; Q L = myid; Q f = f; g while (Q flag[P] g) g gt release(Q) struct lock Q; f Q flag[myid] 1; g Figure 2: Graunke and Thakkar s lock an exponent of 2. This is easily corrected, and they present one such correction [MCS91b]. 1 Anderson compared his queue lock with spin locks with various back off strategies. In all cases, Anderson s queue lock performed worse for 1 4 processors, and better for 7 or more. His implementation on the Sequent Symmetry Model B, however, did not use an atomic fetch andincrement, since ....

[Article contains additional citation context not shown here]

J.M. Mellor-Crummey and M.L. Scott. Synchronization Without Contention. In Proceedings of the 4th Annual Architectural Support for Programming Languages and Operating Systems, pages 269--278, 1991.


A Concurrent Fast-Fits Memory Manager - Johnson (1991)   (6 citations)  (Correct)

....to the W only algorithm, trading variance in the execution time for increased concurrency. For this implementation, we assume a spin lock implementation of the R and W locks in which the head of the queue can be read by the processes, such as the one described by Mellor Crummy and Scott [14]. The key to the simulation is the observation that in the RWU algorithm, at most one W lock will be in a node s lock queue at a time (except for the anchor, where there is no problem) The processes use the following protocol to place locks. In order to upgrade from a R lock to a U lock, the ....

J.M. Mellor-Crummey and M.L. Scott. Synchronization without contention. In Fourth Intn's Conference on Architectural Support for Programming Languages and Operating Systems, pages 269--278, 1991.


Performance Evaluation of Hierarchical Ring-Based Shared.. - Mark Holliday (1992)   (12 citations)  (Correct)

....spots locations. Significant progress has been made in reducing hot spot traffic, especially hot spot traffic due to synchronization. Techniques include separate synchronization networks (possibly with combining) 18] and hot spot free software algorithms that use distributed data structures [20]. Furthermore, flow control mechanisms may be useful, especially when hot senders (processors with usually high request rates or a high favoritism to the hot spot) are a factor. Evaluating alternative techniques for reducing hot spot traffic is outside the scope of this paper. Instead, we ....

....systems can become unstable at favorite memory probabilities on the order of 1 to 2 under reasonable request rates. If the memory queues are of inadequate length, significantly lower favorite memory probabilities can cause instability. The techniques proposed in the synchronization literature [20, 18] (such as separate synchronization networks with combining or software algorithms using distributed data structures) will likely reduce the likelihood of hot spots. The simulated system does provide flow control in that the number of cycles before a source processing module submits a retry is a ....

J.M. Mellor-Crummey and M.L. Scott. Synchronization without contention. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 269--278, Santa Clara, CA, April 1991.


Design and Implementation of a Multi-purpose Cluster System Network .. - Ang (1999)   (Correct)

....of shared memory systems software has no direct control over data movement, which occurs indirectly in response to cache misses. Although it simplifies programming, this feature also creates inefficiency, particularly for control oriented communication. The work of Mellor Crummey and Scott [74, 75] on shared memory implementations of mutex lock and barrier provides an interesting illustration. At a meta level, the solution is to understand the behavior of the underlying coherence protocol, and craft algorithms that coax it into communicating in a way that is close to what a direct message ....

J. M. Mellor-Crummey and M. L. Scott. Synchronization Without Contention. In Proceedings of the Fourth International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS IV), pages 269 -- 278, Apr. 1991.


Implementation and Performance of Munin - Carter, Bennett, Zwaenepoel (1991)   (391 citations)  (Correct)

....read misses. 3 Implementation 3.1 Overview Munin executes a distributed directory based cache consistency protocol [1] in software, in which each directory entry corresponds to a single object. Munin also implements locks and barriers, using a distributed queuebased synchronization protocol [20, 26]. During compilation, the sharing annotations are read by the Munin preprocessor, and an auxiliary file is created for each input file. These auxiliary files are used by the linker to create a shared data segment and a shared data description table, which are appended to the Munin executable file. ....

....delayed update queue was used by the Myrias SPS multiprocessor [13] It performed the copy on write and diff in hardware, but required a restricted form of parallelism to ensure correctness. Munin s implementation of locks is similar to existing implementations on shared memory multiprocessors [20, 26]. An alternative approach for parallel processing on distributed memory machines is to have the compiler produce a message passing program starting from a sequential program, annotated by the programmer with data partitions [4, 30] Given the static nature of compile time analysis, these ....

John M. Mellor-Crummey and Michael L. Scott. Synchronization without contention. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Systems, pages 269--278, April 1991.


Techniques for Reducing Consistency-Related.. - Carter, Bennett.. (1993)   (59 citations)  (Correct)

....Each Munin node maintains a synchronization object directory, analogous to the data object directory, containing state information for the synchronization data. 3.5. 1 Locks Munin employs a queue based implementation of locks similar to existing implementations on shared memory multiprocessors [21, 29]. This allows a thread to request ownership of a lock and block awaiting a reply, without repeated queries. The system associates an ownership token and a distributed queue with each lock. A probable owner mechanism like that described above is used to locate the token or end of the queue ....

J.M. Mellor-Crummey and M.L. Scott. Synchronization without contention. In Proceedings of the 4th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 269--278, April 1991.


Efficient Techniques for Fast Nested Barrier.. - Ramakrishnan, Scherson, .. (1995)   (2 citations)  (Correct)

....of message passing protocols. Since dedicated hardware barrier trees are intrinsically parallel and have very low latency, they are usually an order of magnitude faster than software barriers. There exist numerous algorithms and methods in the literature for improved software barriers, including [1, 5, 7, 11, 12, 10, 17], but these are improvements on a mechanism that is inherently slow. Methods for masking the latency of barriers have also been proposed [3, 4] These methods hide the synchronization overhead as well as the time spent waiting for other processors to reach the barrier. They depend on being able to ....

J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 269--278, April 1991.


SOFTQOLB: An Ultra-Efficient Synchronization Primitive for.. - Kägi, Goodman   (Correct)

.... that perform better than currently implemented alternatives have been proposed [5, 7, 26] One such locking primitive is the Queue On Lock Bit (QOLB) synchronization primitive proposed by Goodman, Vernon, and Woest [5] A study [9] has shown that QOLB performs better (up to 100 ) than MCS [17], an efficient all software queuebased locking primitive inspired by the QOLB work. However, more efficient locking primitives in general and QOLB in particular are currently not available on multiprocessors because, as described, they require modifications to current generation commodity ....

....and Mellor Crummey and Scott proposed pure software solutions to minimize network traffic and synchronization access latencies. Mellor Crummey and Scott (MCS) implement a queue as a software linked list, and use atomic operations such as SWAP and COMPARE AND SWAP to update the list correctly [17]. Anderson presented a scheme that implements a queue as a circular array [2] Like QOLB, these algorithms also reduce the network traffic to a constant number of traversals per synchronization access (however these schemes require at least six network traversals versus one for QOLB [1] and ....

John M. Mellor-Crummey and Michael L. Scott. Synchronization Without Contention. In Proceedings of the Fourth Symposium on 13 Architectural Support for Programming Languages and Operating Systems, pages 269--278, April 1991.


PROTEUS: A High-Performance Parallel-Architecture.. - Brewer, Dellarocas.. (1991)   (144 citations)  (Correct)

....The data has been normalized to Quinn s data to clarify the error in the Proteus results. 20 8 RELATED WORK results [CS90] that were measured on a Supernode multiprocessor [Nic88] Proteus also reproduced the results published in Synchronization without Contention by MellorCrummey and Scott [MCS91] This paper compared locking algorithms on both a Sequent Symmetry and a BBN Butterfly. In general, any effect that we expected to see has actually appeared. More importantly, all unexpected results have (so far) proven to be real effects rather than inaccuracies introduced by Proteus. For ....

J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), pages 269--278, April 1991. REFERENCES 25


Evaluating Switch Architectures and Memory Management Policies.. - Bhuyan Iyer   (Correct)

....cache controller detects if a message has arrived out of order and holds it to be serviced later. Such a situation is possible due to the existence of virtual channels. The synchronization method used in our simulations is based on spin locks using test and set operation with exponential backoff [19]. Barriers used in many of the applications were implemented using a shared counter. We also experimented with other types of barriers to reduce contention, however, our experience suggests that the main overhead in synchronization is the contention for the lock itself, not for the shared variable ....

J. M. Mellor-Crummey and M. L. Scott, "Synchronization Without Contention," In Proceedings of Fourth Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 269--278, Santa Clara, CA, April 1991, ACM.


Relaxed Consistency and Synchronization in Parallel Processors - Zucker (1992)   (3 citations)  (Correct)

....and other programs running on other processors. In the worst case it can take O(n 2 ) time for the system to stabilize [10] Anderson [10] considers a number of ways to implement locks that do not exhibit under high loads the deleterious behavior of Test Test Set as do Mellor Crummey and Scott [76, 77]. However, I will review queuing locks from [56] which also deal with this problem, since I implement them in my simulator (see Section 2.3) In a simulation environment queuing locks will provide equivalent performance to MCS locks [76] which are implemented in a very similar fashion) and are ....

....case of low lock contention it should have little impact, and in the case of high contention the superiority of the algorithm should be more important. Comparison of the lock algorithms I have reviewed and others has been done in Anderson [10] Graunke and Thakkar[56] and Mellor Crummey and Scott [77]. However, those studies used artificial benchmarks, sometimes with artificially high levels of lock contention. The questions then are two fold: what are the locking patterns of real parallel programs, and how do more advanced locking algorithms work in the case of 11 : Lock Structure ....

John M. Mellor-Crummey and Michael L. Scott. Synchronization without contention. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 269--278, April 1991.


Execution Based Evaluation of Multistage Interconnection.. - Akhilesh Kumar   (Correct)

....The node configuration and the network interface. Inputs Control Outputs (c) Wormhole router with virtual channels Figure 1: The system architecture. and wormhole routing. Also, we implement a directory based cache coherence protocol [4] in our simulator and realistic synchronization techniques [5]. Virtual channels [1] are used in wormhole networks to avoid deadlocks and to improve link utilization and network throughput. In a recent paper[6] we evaluated virtual channels in 2 D torus wormhole networks and showed that they improve performance considerably. The following section presents ....

....the number of nodes. An write invalidation protocol has been implemented. The protocol has been modified to make it work in a network that does not guarantee in order delivery of messages. The synchronization method is based on spin locks using test test and set operation with exponential backoff [5]. Barriers were implemented using a shared counter. Packet switching Packet buffers FFT FWA LU MATMUL MP3D 1 599.6 3437.1 329.1 2022.6 794.5 2 275.7 1463.2 186.8 574.5 371.6 4 266.3 1433.9 185.9 555.1 352.6 Wormhole routing Flit buffers FFT FWA LU MATMUL MP3D 1 136.6 539.5 71.0 226.6 308.3 2 ....

J. M. Mellor-Crummey and M. L. Scott, "Synchronization Without Contention," In Proceedings of Fourth Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 269--278, Santa Clara, CA, April 1991, ACM.


An Evaluation of Fine-Grain Producer-Initiated.. - Abdel-Shafi, Hall.. (1997)   (11 citations)  (Correct)

....systems) our results are likely to be conservative for the improvements achievable using prefetching and remote writes in future systems. 3.2 Kernels and Applications Used We use two synchronization kernels and five applications for our experiments. The two kernels are MCS locks [13] and tree barriers. We chose these kernels because they are of fundamental importance to shared memory applications, and are well suited to producer initiated communication. The five applications are Radix from the SPLASH 2 suite [26] Water and MP3D (without locking for cells) from the SPLASH ....

....to cache conflicts. There is also remaining overhead due to load imbalance, but such overhead is usually not directly targeted by memory system optimizations like remote writes or prefetches. 4. 2 Synchronization Kernels MCS locks: An MCS lock is efficient under moderate to high lock contention [13]. Processors needing a lock form a queue and spin locally on different variables. The processor holding the lock releases it by writing directly to the variable of the next processor in the queue, using WriteSend in the RW version. If no processors are waiting, then a global lock variable is ....

J. M. Mellor-Crummey and M. L. Scott. Synchronization Without Contention. In ASPLOS-IV, 1991.


Efficient Distributed Shared Memory Based On Multi-Protocol.. - Carter (1993)   (45 citations)  (Correct)

.... data item [BHJL86, DCM 90] Conventional shared memory implementations for synchronization operations such as locking can lead to high overheads in multiprocessor cache systems [ALL89] leading researchers to develop more efficient algorithms for synchronization operations [GVW89, HFM88, MCS91] Given the high overhead associated with sending even one extra message in a software DSM system, it is of paramount importance to provide a synchronization package separate from the shared memory implementation. Doing so allows us to optimize the mechanisms used to handle synchronization ....

.... in Ivy [Li86] or force system designers to augment their shared memory system with complex heuristics to support spinlocks as was done in Mether [MF90] Locks Munin employs a queue based implementation of locks similar to existing implementations on shared memory multiprocessors [GVW89, MCS91] Using queue based locks allows a thread to request ownership of a lock and then block awaiting a reply without repeated queries. A distributed queue identifies the user threads waiting for each lock, wherein each enqueued thread knows only the identity of the thread that follows it on the ....

[Article contains additional citation context not shown here]

J.M. Mellor-Crummey and M.L. Scott. Synchronization without contention. In Proceedings of the 4th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 269--278, April 1991.


An Analysis of the Interactions of Overhead-Reducing .. - Kagi, Aboulenein.. (1995)   (Correct)

....simulations both for current technology and technology that we anticipate will be available five years hence. We find that QOLB (of which this study performs the first detailed simulations) shows a large and consistent improvement, much larger than that predicted by Mellor Crummey and Scott [19]. The relaxation of memory ordering constraints also provides a consistent performance improvement. In accordance with prior results, we show that a more aggressive memory model produces more substantial performance improvements. The optimization for twonode sharing shows mixed results, ....

....until the processor finds it unlocked. On a multiprocessor, these repeated accesses often translate directly into network traffic that leads to heavy network contention and potentially severe performance degradation. We therefore compare two more advanced primitives, QOLB [25] and MCS locks [19]. The second class of optimizations that we examine consists of a range of memory models. Programmers naturally assume a memory model formally called sequential consistency, defined by Lam 2 port [17] The strict ordering of sequential consistency severely limits concurrency of memory ....

[Article contains additional citation context not shown here]

John M. Mellor-Crummey and Michael L. Scott. Synchronization Without Contention. In Proc. of the Fourth Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 269--278, April 1991.


Techniques for Reducing Consistency-Related.. - Carter, Bennett.. (1993)   (59 citations)  (Correct)

....of Munin s synchronization primitives cause their invoking thread to block on an acquire and cause the local delayed update queue to be purged on a release . 3.5. 1 Locks Munin employs a queue based implementation of locks similar to existing implementations on shared memory multiprocessors [30, 45]. This allows a thread to request ownership of a lock and block awaiting a reply, without repeated queries. The system associates an ownership token and a distributed queue with each lock. A probable owner mechanism like that described above is used to locate the token or the end of the queue ....

J.M. Mellor-Crummey and M.L. Scott. Synchronization without contention. In Proceedings of the 4th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 269--278, April 1991.


Impact of Switch Design on the Application Performance of.. - Bhuyan Wang (1998)   (2 citations)  (Correct)

....on a write operation. We have modified the coherence protocol where the cache controllers detect if a message has arrived out of order and holds it to be serviced later. The synchronization method used in our simulations is based on spin locks using testand set operation with exponential backoff [10]. Barriers used in many of the applications were implemented using a shared counter. We also experimented with other types of barriers to reduce contention, however, our experience suggests that the main overhead in synchronization is the contention for the lock itself, not for the shared variable ....

J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. Proceedings of Fourth Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA, pages 269--278, April 1991.


Implementing Data-Parallel Software on Dataflow Hardware - Shaw (1993)   (2 citations)  (Correct)

....scales with the log of the size of the machine. If the machine is a bus based machine like the Sequent, the counter based implementation may be the best, since all remote memory requests are serialized by the bus. To be fair, these are not necessarily the best implementations for these machines. [38] describes several different algorithms for barrier synchronization implemented on the BBN Butterfly and a Sequent Symmetry. The best implementation (a dissemination barrier) on the Butterfly required about 110 microseconds on 32 processors, and the best implementation on the Sequent required ....

John M. Mellor-Crummey and Michael L. Scott. Synchronization Without Contention. In Proceedings of the Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA, pages 269--278, April 1991.


Evaluating Virtual Channels for Cache-Coherent Shared-Memory.. - Akhilesh Kumar (1996)   (4 citations)  (Correct)

....cache controllers detect if a message has arrived out of order and holds it to be serviced later. The scheme is similar to the scheme used in MIT Alewife system [16] The synchronization method used in our simulations is based on spinlocks using test test and set operation with exponential backoff [17]. Barriers used in many of the applications were implemented using a shared counter. 2.3 Simulation parameters The system parameters used in the simulation are listed in Table 1. We simulated an 8 Theta 8 torus network with 8KB of cache and 32KB of memory per node. A small cache size was ....

J. M. Mellor-Crummey and M. L. Scott, "Synchronization Without Contention," In Proceedings of ASPLOS IV, pp. 269--278, April 1991.


Optimistic Synchronization in Distributed Shared Memory - Hermannsson, Wittie (1994)   (2 citations)  (Correct)

....section. 1.3 Synchronization Hardware primitives for repeated lock tests such as Test and set[3] Test test and set[17] and their extensions[1] evolved on shared memory multiproces sors. In distributed systems repeatedly testing locks produces too much network traffic. Queue based locks[9, 10, 1, 5, 14, 2, 16] are alternatives. A lock request is sent to a lock owner. If the lock is free, permission is granted. If it is busy, the request is queued. When the lock becomes available, the next process in the queue gets permission. Lock queues can be supported in hardware[9, 5] or software[10, 1, 14, 2, 16] ....

....10, 1, 5, 14, 2, 16] are alternatives. A lock request is sent to a lock owner. If the lock is free, permission is granted. If it is busy, the request is queued. When the lock becomes available, the next process in the queue gets permission. Lock queues can be supported in hardware[9, 5] or software[10, 1, 14, 2, 16]. Queue based locks are needed in distributed memory systems, even those with local lock copies, to lessen network traffic after lock release. When moving from multiprocessors connected by busses to multicomputers connected by networks, locating the lock owner becomes an issue. Distributed ....

J.M. Mellor-Crummey and M.L. Scott. Synchronization Without Contention. 4th Int. Conf. on ASPLOS, 269--278, April 1991.


Performance Evaluation of Hierarchical Ring-Based Shared.. - Mark Holliday (1992)   (12 citations)  (Correct)

....spots locations. Significant progress has been made in reducing hot spot traffic, especially hot spot traffic due to synchronization. Techniques include separate synchronization networks (possibly with combining) 18] and hot spot free software algorithms that use distributed data structures [20]. Furthermore, flow control mechanisms may be useful, especially when hot senders (processors with usually high request rates or a high favoritism to the hot spot) are a factor. Evaluating alternative techniques for reducing hot spot traffic is outside the scope of this paper. Instead, we ....

....If the memory queues are of inadequate length, significantly lower favorite memory probabilities can cause instability. The techniques proposed in the synchronization literature (such as separate synchronization networks with combining or software algorithms using distributed data structures [20, 18]) may well reduce the likelihood of hot spots. The simulated system does provide flow control in that the number of cycles before a source processing module submits a retry is a function of the number of retries previously sent [27] More sophisticated flow control mechanisms could be considered. ....

J.M. Mellor-Crummey and M.L. Scott. Synchronization without contention. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 269--278, Santa Clara, CA, April 1991.


Hierarchical Message Stability Tracking Protocols - Guo, van Renesse, Vogels.. (1997)   (3 citations)  (Correct)

....in this section, we will show that the three basic protocols have their limitation in scalability. The most obvious way to improve scalability is to use hierarchy. Tree structures have been used in reliable multicast protocols in distributed systems [9, 14] and barrier synchronization algorithms [13] in parallel systems. To improve scalability significantly, we derive two structured stability tracking protocols by adding a spanning tree structure to the basic protocols. They are dubbed S CoordP and S Train since they are derived from CoordP and Train respectively. As we will see later, S ....

J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA, April 1991.


ASPEN: High-Performance Hardware Support for Distributed.. - Maxham (1994)   (Correct)

....lock mechanism in [25] Queueing locks allow first in, first out access to a given memory location without the costly test and set spinning of traditional locks. An analysis of various software synchronization primitives for shared memory multiprocessors is given by Mellor Crummey and Scott in [33]. 3.1.4 Communication The Aspen daughterboard differs from high speed network interfaces such as SHRIMP [8] which provide simple inter node write propagation, by fully implementing sharedmemory. Both Aspen and SHRIMP like interfaces snoop on their host s bus and use tags to determine whether to ....

J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In Proceedings of the 4th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 269--278. ACM, April 1989.


Toward The Design Of Large-Scale, Shared-Memory Multiprocessors - Scott (1992)   (3 citations)  Self-citation (Scott)   (Correct)

....Cube Networks Contention for data and synchronization objects is a potential problem that must be addressed when designing very large systems. Synchronization can be handled in a variety of manners, both in hardware and in software. Examples include software combining [Yew87] software queueing [Mell91] and hardware queueing [Good89a] Although synchronization issues are important, I do not address them in this thesis. I do address the problem of read contention, however, as it is closely related to the cache coherence mechanism. There are several situations in which concurrent read requests to ....

....caused by synchronization objects, it would be better to avoid the contention altogether. Mechanisms that support locks using software or hardware queueing can prevent lock contention, and can be used as primitives to support contention free barriers and other synchronization operations as well [Mell91, Good89a]. In addition, the QOLB hardware synchronization primitive [Good89a] can improve the performance of pruning cache directories by removing interference from migratory data. QOLB automatically migrates lock protected data from one cache to another, allowing the data to remain in globally modified ....

Mellor-Crummey, J. M. and M. L. Scott, Synchronization Without Contention, Proc. ASPLOS IV, April 1991, 269-278.


Fine-Grain Producer-Initiated Communication in Cache-Coherent.. - Abdel-Shafi (1997)   Self-citation (Mellor-crummey)   (Correct)

....is important for applications with high and or bursty bandwidth requirements (e.g. Radix from the SPLASH 2 suite [WOT 95] 3.3 Kernels and Applications Used We use two synchronization kernels and five applications for our experiments. The two kernels are MCS locks and tree barriers [MCS91a] We chose these kernels because they are of fundamental importance to shared memory applications, and are wellsuited to producer initiated communication. The five applications are Radix from 21 the SPLASH 2 suite [WOT 95] Water and MP3D (without locking for cells) from the SPLASH suite ....

....to cache conflicts. There is also remaining overhead due to load imbalance, but such overhead is usually not directly targeted by memory system optimizations like remote writes or prefetches. 4. 2 Synchronization Kernels MCS locks: An MCS lock is efficient under moderate to high lock contention [MCS91a] Processors needing a lock form a queue and spin locally on different variables. The processor holding the lock releases it by writing directly to the variable of the next processor in the queue. The passing of the lock to the next processor can be performed using a WriteSend operation. Using ....

[Article contains additional citation context not shown here]

J. M. Mellor-Crummey and M. L. Scott. Synchronization Without Contention. In Proceedings Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, April 1991.


Scheduler-Conscious Synchronization - Kontothanassis, Wisniewski, Scott (1994)   (19 citations)  Self-citation (Scott)   (Correct)

.... (which guarantee that no process continues past a given point in a computation until all other processes have reached that point) Of particular interest in recent years have been scalable synchronization algorithms, which employ backoff or distributed data structures to minimize contention [1, 9, 11, 19, 20, 25, 26, 30, 31, 32, 38, 44, 45]. The purpose of backoff is to reduce the frequency with which spinning processes access a common synchronization variable. The purpose of distributed data structures is to allow each process to spin on a separate, locally accessible variable. Unfortunately, busy waiting in user level code tends ....

....running time of O(p) for the barrier algorithm, which becomes unacceptable as the number of processors p grows large. Several researchers have shown how to solve these problems by building scalable barriers, with log depth tree or FFT like patterns of point to point notifications among processes [3, 11, 20, 25, 30, 32, 38, 45]. Unfortunately, the deterministic notification patterns of scalable barriers may require that processes run in a different order from the one chosen by the scheduler. The problem is related to, but more severe than, the preemption while waiting problem in FIFO locks. With a lock the scheduler may ....

J. M. Mellor-Crummey and M. L. Scott. Synchronization Without Contention. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 269--278, Santa Clara, CA, April 1991.


Fast, Contention-Free Combining Tree Barriers for.. - Scott, Mellor-Crummey (1994)   Self-citation (Mellor-crummey Scott)   (Correct)

No context found.

J. M. Mellor-Crummey and M. L. Scott, "Synchronization Without Contention," Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, April 1991, pp. 269-278.


Identification And Optimization Of Sharing Patterns For Scalable.. - Kaxiras (1998)   (4 citations)  (Correct)

No context found.

John M. Mellor-Crummey, Michael L. Scott, "Synchronization Without Contention." In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV). April 8, 1991.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC