| J. R. Goodman, M. K. Vernon, P. J. Woest, "A Set of Efficient Synchronization Primitives for a Large-Scale Shared-Memory Multiprocessor." In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (APLOS III), April 1989. |
....data. In this thesis, I propose optimizations derived from observing instructions, i.e. the behavior of the program itself. I will briefly discuss static and in greater length dynamic instruction based optimizations for migratory sharing. 5.2. 1 QOLB Using the QOLB synchronization primitive [35] can be considered a static instruction based scheme for migratory sharing. QOLB provides both synchronization for the critical section that protects the migratory data and automatic transfer of the data i.e. QOLB is a migratory optimization on its own. However, data must be collocated with the ....
J. R. Goodman, M. K. Vernon, P. J. Woest, "A Set of Efficient Synchronization Primitives for a Large-Scale Shared-Memory Multiprocessor." In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (APLOS III), April 1989.
.... In practice, the performance of many shared memory algorithms is often limited by conflicts at certain widely shared memory locations, often called hot spots [30] Reducing hot spot conflicts has been the focus of hardware architecture design [15, 16, 22, 29] and experimental work in software [5, 13, 14, 25, 27]. Counting networks are also non blocking: processes that undergo halting failures or delays while using a counting network do not prevent other processes from making progress. This property is important because existing shared memory architectures are 1 One can implement a balancer using a ....
J. Goodman, M. Vernon, and P. Woest. A set of efficient synchronization primitives for a large-scale shared-memory multiprocessor. In 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, April 1989.
....short critical section. 1 INTRODUCTION 2 shared memory algorithms is often limited by conflicts at certain widelyshared memory locations, often called hot spots [30] Reducing hot spot conflicts has been the focus of hardware architecture design [15, 16, 22, 29] and experimental work in software [5, 13, 14, 25, 27]. Counting networks are also non blocking: processes that undergo halting failures or delays while using a counting network do not prevent other processes from making progress. This property is important because existing shared memory architectures are themselves inherently asynchronous; process ....
J. Goodman, M. Vernon, and P. Woest. A set of efficient synchronization primitives for a large-scale shared-memory multiprocessor. In 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, April 1989.
....limitations on processor to memory bandwidth, performance suffers when too many processes attempt to access the same memory location at the same time. Such hot spot contention is well documented, and has been the subject of extensive research both in hardware [2, 11, 12, 20, 29] and in software [3, 9, 10, 27, 32]. ffl Latency: The time needed to choose a value is strongly affected by the number of variables a process must access. We will show that (not surprisingly) there is an inherent (inverse) relationship between the maximum contention at a variable and the number of variables accessed. ffl Waiting: ....
J. Goodman, M. Vernon, and P. Woest. A set of efficient synchronization primitives for a large-scale shared-memory multiprocessor. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, April 1989.
....efficient and flexible than the semaphore register method. A simpler multiprocessor, in terms of both hardware and system software, can be built upon the lock table mechanism. 10.2.3.4.2. Queue on SyncBit (QOSB) QOSB is a synchronization primitive that can be used to implement critical sections [31]. The primitive enqueues a processor to a lock variable (SyncBit) which is part of a cache line, to enforce a first come first serve discipline for the lock variable (and its associated critical section) The primitive was originally proposed for a large scale, cache coherent shared memory ....
J. R. Goodman, M. K. Vernon, and P. J. Woest, "A Set of Efficient Synchronization Primitives for a LargeScale Shared-Memory Multiprocessor," in Proc. ASPLOS-III, Boston, MA, April 1989.
....of the tree. The nodes of a combining tree are realized by the implicit storage in the interconnect nodes, i.e. wait buffers. Another alternative is to use explicit storage in memory to construct the combining tree. This method of request combining called software combining in the literature [5, 31, 32], is classified as PPC since the processors bear full responsibility for the combining of requests: the processors establish the combining set and distribute the results and there are no demands of the network at all. In software combining, one shared location is divided into L locations which ....
....However, the L locations (nodes of the combining tree) must be distributed across the memory modules in order to alleviate excessive contention for a single memory module. Yew, Tzeng, and Lawrie show how software combining can be used for barrier operations [32] Goodman, Vernon and Woest [5] and Johnson [12] extend the work of Yew, Tzeng, and Lawrie to carry out arbitrary Fetch F operations with a software combining tree. Tang and Yew also provide several algorithms for traversing a combining tree where the type of memory access determines which algorithm is chosen (e.g. barrier ....
Goodman, J. R., Vernon, M. K., and Woest, P. J., "A Set of Efficient Synchronization Primitives for a Large-Scale Shared-Memory Multiprocessor," in Proceedings ASPLOS-III, Boston, MA, pp. 64-73, April 1989.
....limitations on processor to memory bandwidth, performance suffers when too many processors attempt to access the same memory location at the same time. Such hot spot contention is well documented, and has been the subject of extensive research both in hardware [2, 10, 11, 16, 22] and in software [3, 8, 9, 20, 25]. ffl Latency: The time needed to choose a value is strongly affected by the number of variables a processor must access. We will show that (not surprisingly) there is an inherent (inverse) relationship between the maximum contention at a variable and the number of variables accessed. ffl ....
J. Goodman, M. Vernon, and P. Woest. A set of efficient synchronization primitives for a large-scale shared-memory multiprocessor. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, April 1989.
.... In practice, the performance of many shared memory algorithms is often limited by conflicts at certain widely shared memory locations, often called hot spots [19] Reducing hot spot conflicts has been the focus of hardware architecture design [1, 8, 12, 14, 11] and experimental work in software [3, 9, 10, 16, 20]. Counting networks are also non blocking: processes that undergo halting failures or delays while using a counting network do not prevent other processes from making progress. This property is important because existing shared memory architectures are themselves inherently asynchronous; process ....
J. Goodman, M. Vernon, and P. Woest. A set of efficient synchronization primitives for a large-scale shared-memory multiprocessor. In 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, April 1989.
....make the shared data local to the processor where the process is executing. As 2 the granularity of sharing becomes finer, these delays may dominate the time to complete a task. One mechanism that shows promise for reducing and or eliminating all three of these latencies is the QOLB primitive [GVW89] In this paper we explore hardware support for locks, critical sections and data exchange using QOLB. Even though software solutions exist for this type of synchronization [MCS91, And90] one of the main points of this paper is that there is a significant benefit in having such hardware support ....
....nodes. Due to the similarity between QOLB and the SCI implementation of cache coherence, it is natural to extend that implementation to include QOLB, and such an extension is provided as an option to the base SCI protocol. This paper also discusses and analyzes this extension. Previous work [GVW89] has described the use of QOLB primarily for eliminating contention over the interconnect. The present work discusses QOLB s ability to reduce memory latency by (1) making synchronization common operations more efficient through the elimination of most traversals of the interconnect and (2) by ....
[Article contains additional citation context not shown here]
J. R. Goodman, M. K. Vernon, , and P. J. Woest. "A Set of Efficient Synchronization Primitives for a Large-Scale Shared-Memory Multiprocessor". In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III). ACM, April 1989.
....as Fetch F operations [GGKM83, BrMW85, ZhYe87] Unfortunately, most synchronization algorithms which rely on these primitives require spinning over the interconnect or, as in the case of using Test Test Set to acquire a lock, cause excessive bus traffic. The QOLB 2 primitive, was proposed [GoVW89] to alleviate these problems, as well as to provide additional benefits by reducing memory latency. It is a sharedmemory operation that adds a processor to a hardware queue of waiters for a line. QOLB allows a process to spin on a locally cached shadow copy of a line. When the actual line is ....
....maintaining QOLB queues is discussed in the section on performance. The QOLB mechanism allows for efficient process synchronization by providing a direct implementation of #################################### 2 QOLB (Queue On Lock Bit) is pronounced Colby . It was originally was called QOSB in [GoVW89] but was changed to be more precise. binary semaphores. As a non blocking operation, QOLB can prefetch (i.e. make local) a line of data while a process performs useful work. Combining these two operations along with a simple software convention, QOLB becomes a synchronizing prefetch operation. ....
[Article contains additional citation context not shown here]
Goodman, J. R., M. K. Vernon, and P. J. Woest, "A Set of Efficient Synchronization Primitives for a Large-Scale SharedMemory Multiprocessor," Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III), April 1989, pp. 64-75.
....5. Comparison with Previous Approaches Prefetching techniques have been the principal approach advocated in the literature for reducing read latencies (i.e. for overlapping communication and computation) in large scale SM systems. Synchronized prefetching techniques have been proposed [Good89] for Table 1: State Transitions for MP Requests in the SM MP Coherence Protocol ########################################################### Cache Block Next Request State State Action ########################################################### ....
Goodman, J. R., M. K. Vernon and P. J. Woest, "A Set of Efficient Synchronization Primitives for a LargeScale Shared-Memory Multiprocessor", Proc. 3rd Int'l. Conf. on Architectural Support for Programming Languages and Operating Systems, Boston , pp. 64-75, April 1989.
No context found.
J. Goodman, M. Vernon, and P. Woest, "A Set of Efficient Synchronization Primitives for a Large-Scale Shared-Memory Multiprocessor," International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 64-73, 1989.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC