| Andreas Nowatzyk, Gunes Aybay, Michael Browne, Edmund Kelly, David Lee, and Michael Parkin. The S3.mp Scalable Shared Memory Multiprocessor. Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences - Vol. I: Architecture, January, 1994, pp. 144-153. |
....enabled by these enhancements serves to increase the overall throughput of CCs. To further enhance CC throughput, these three optimization techniques can be combined with previously proposed CC optimizations such as multiple protocol engines, pipelining, and split request response streams [1, 2, 8, 9, 11, 13, 14, 15]. Multiple protocol engines replicate the core processing engine of a CC to increase concurrency. Hardwired CCs are pipelined to increase their bandwidth. Split request response streams allocate dedicated hardware resources in a CC to process a request and a response in parallel. We evaluate ....
....protocol. This unit consists of protocol handlers that are invoked by the PE to process specific coherence transactions. 2. 3 Previously Proposed Optimizations Prior studies proposed optimized CC designs that included multiple PEs per CC, pipelined PEs, and split request response streams [1, 2, 8, 9, 11, 13, 14, 15]. In this section, we briefly describe these CC organizations. 2.3.1 Multiple Protocol Engines The multiple PE optimization enhances the baseline CC architecture of Figure 2(a) to include multiple PEs (Figure 2(b) Each PE handles protocol transactions directed to a different set of memory line ....
[Article contains additional citation context not shown here]
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke, and S.Vishin. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of 1995.
....each set of L2 keeps an Overflow Counter with the number of lines mapped to this set that are currently in the overflow area (Figure 3 (a) If this counter is zero, the Table is not accessed and the miss proceeds normally. We organize the Table as a set associative cache where, like in S3.mp [16], the tag array information of each set is stored in an additional line. Specifically, Figure 3 (c) shows one set of a 3 way set associative Table. The first line contains the address tags, access bits, and other information for the three lines that currently reside in the set. To be able to ....
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke, and S. Vishin. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the
....by two decoupled buses. The node controller sends messages to, and receives messages from, the network, and also handles all SLC misses from the local CPU. The node con guration is similar to a number of research and commercial systems, including Stanford s FLASH [24] Sun s S3.mp prototype [30], and the SGI Origin2000 [25] A large direct mapped cache is used for the SLC. Direct mapped caches have more con ict misses due to their lack of associativity. However the performance of direct mapped caches for hits is better than set associative caches because of the simpler hardware ....
.... node controller is tied up with one action and cannot perform another [19] It is possible to reduce the occupancy for node controllers by introducing the ability to overlap the processing of requests, e.g. by pipelining [27] or by splitting the processing of local CPU misses and remote requests [30]. However these techniques only reduce (rather than avoid) occupancy, which will still be a problem when there is contention for a node controller. The types of access to data that occur in shared memory applications have an e ect on 10 TABLE IV Invalidation profiles (from the Splash 2 results ....
Andreas Nowatzyk, Gunes Aybay, Michael Browne, Edmund Kelly, Michael Parkin, Bill Radke, and Sanjay Vishin. The S3.mp scalable shared memory multiprocessor. In the International Conference on Parallel Processing, Vol. 1, pages 1-10, August 1995.
....is provided (e.g. routing) but we are mainly concerned with endpoint functionality in this paper. More recently, groups have proposed networks that sit in between these two extremes, i.e. they present a reliable interconnect that is flexible enough to be used in a LAN environment (e.g. [21]) Table 1 shows how the basic operations were implemented in some example systems that implement remote write on distributed memory hardware. The left hand columns represent systems that implement a shared memory programming model based on a weak consistency coherency protocol (e.g. 20] ....
Andreas Nowatzyk, Gunes Aybay, Michael Browne, Edmund Kelly, David Lee, and Michael Parkin. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the TwentySeventh Hawaii International Conference on System Sciences - Vol. I: Architecture, pages 144--153. IEEE, January 1994.
....the late 1980s. To demonstrate their effectiveness, several cache coherent non uniform memory access (CC NUMA) hardware DSM machines were built in the research community (e.g. DASH [26] Alewife [2] FLASH [22] Typhoon [36] and commercial machines followed (e.g. SGI Origin 2000 [23] Sun S3.mp [33], Sequent NUMA Q [29] HP Exemplar [1] Data General Aviion [7] At the same time, a large research effort produced a set of scientific benchmarks with whichtoevaluate DSM machines [48] Most high performance hardware DSM machines have tightly integrated node or memory controllers that connect ....
....are performed. The key insight into solving the coherence problem in active memory systems is that the active memory controller controls both the coherence protocol and the fetching of the requested data by the processor. In architectures like the Stanford FLASH multiprocessor [22] and the S3.mp [33] the coherence protocol itself is programmable or extensible. Thus, it is possible to treat active memory support as an extension of the cache coherence protocol. In this case, the active memory controller can enforce coherence between the original and shadow address spaces. Although active ....
A. Nowatzyk et al. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the 24th International Conference on Parallel Processing,1995.
....so, the data is provided to the secondary cache. Otherwise, the data is requested from the corresponding node and eventually stored in the remote cache before passing it to the processor. The remote cache is kept coherent at all times. This approach is used in 1 current machines like Sun s S3.mp [6] and Sequent s STiNG [5] In this paper, we call this machine organization NUMA RC, for non uniform memory access with remote cache. A second approach is to use a Cache Only Memory Architecture (COMA) organization. COMA machines [2, 3, 7, 8, 9] organize the distributed memories as caches called ....
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke, and S. Vishi. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the 1995 International Conference on Parallel Processing, pages I1--I10, August 1995.
....instructions lead to much lower protocol engine latency and occupancy. 2.5. 2 Directory Storage The Piranha design supports directory data with virtually no memory space overhead by computing ECC at a coarser granularity and utilizing the unused bits for storing the directory information [31,38]. ECC is computed across 256 bit boundaries (typical is 64 bit) leaving us with 44 bits for directory storage per 64 byte line. Compared to having a dedicated external storage and datapath for directories, this approach leads to lower cost by requiring fewer components and pins, and provides ....
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, W. Radke, and S. Vishin. The S3.mp Scalable Shared Memory Multiprocessor. In International Conference on Parallel Processing (ICPP'95), pages I.1 - I.10, July 1995.
....et al. 7] rely on the processor caches being write through or the memory controller being able to detect cache state transitions in the second level cache. These systems also do not provide a way for the user to control the protocol implemented by the hardware. 4 The S3.mp multiprocessor system [23] contains programmable micro controllers to implement DSM. This programmable feature is used to support both the CC NUMA and SCOMA memory models. However, it does not provide a way for the user to select the model this selection is performed when the machine is booted. There have been several ....
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke, and S. Vishin. The S3.mp scalable shared memory multiprocessor. In Proceedings of the 1995 International Conference on Parallel Processing, 1995.
....queue arbitrations, directory accesses, protocol processing, and state maintenance. The increased parallelism and overlapping of coherence operations offered by these approaches serves to decrease the overall occupancy of coherence controllers. Multiple PEs were employed in the Sun S3.mp [15] architecture. The coherence controller dedicates one protocol engine for handling transactions to local addresses and one for remote addresses. The architects of the Sequent STiNG [8] system also considered a similar approach as one of the ways to reduce controller occupancy. However, the impact ....
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke, and S.Vishin. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of 1995 International Conference on Parallel Processing, August 1995.
....resources in a network of workstations for parallel programming. These solutions work reasonably well for certain applications (message passing, coarse grain parallelism) but generally they suffer from their huge software overhead. Recently, several research projects have suggested a new approach [3, 10]. By adding a moderate amount of hardware for a dedicated interconnection between the workstations, communication latencies can be lowered by a factor of 1000 (with respect to PVM on a LAN) Through the use of appropriate protocols (originally developed for large scale multiprocessors) the ....
Nowatzyk, A., Aybay, G., Browne, M., Kelly, E., Parkin, M., Radke, B., Vishin, S.: The S3.mp Scalable Shared Memory Multiprocessor. Proceedings of the 24th International Conference of Parallel Processing, Oconomowoc, WI, USA, August 1995.
....that node if it satisfies a progress criterion. This strategy guarantees forward progress per node, but not per line. However, virtuall all modern processors have a bounded instruction issue window. Using this proerpty, and that the protocol actions of a line do not interfere those of another line [15], one can show that forward progress is guaranteed per each line as well as each remote node. ....
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke, and S. Vishin. The S3.mp scalable shared memory multiprocessor. In Proceedings of the 1995 International Conference on Parallel Processing, 1995.
....controller is not tightly coupled with the processor, the cache must be put into write through rather than write back mode so that stores to memory can be snooped by the network interface; this results in an increase in bus traffic between the cache and main memory. The S3.mp multiprocessor system [21] was developed with the goal of using a hardware supported DSM system in a spatially distributed system connected by a local area network. For the interconnect it used a new CMOS serial link which supported greater than 1Gbit sec transfer rate. The shared memory hardware system was tightly coupled ....
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke, and S. Vishin. The S3.mp scalable shared memory multiprocessor. In Proceedings of the 1995 International Conference on Parallel Processing, 1995.
....advantages, even in the presence of a protocol processor. We are currently exploring these aspects. A possibility that remains unexplored in this study is to combine tag free local memory, like in SCC NUMA, with associative memory for fine grain replication of remote data, like in SC COMA. S3.mp [16] is an example of a hardware DSM based on this idea. This would seem to perfectly suit the needs of all benchmarks. Pages with strong processor affinity would be placed into the tag free local memory, whereas the other pages would be placed in round robin manner in the associative memory, so that ....
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke and S. Vishin. The S3.mp Scalable Shared Memory Multiprocessor. Proc. of the Int. Parallel Processing Conference, pp. I-1-I-10, 1995.
....the properties of COMA was done by the DICE group [16] To our knowledge, there is no working prototype of a Flat COMA. The Illinois Aggressive COMA (I ACOMA) 29] group is currently building one. A variation of the COMA, using a page grain in the allocation policy, is the Simple COMA [23] S3.mp [20] has provided a testbed for its implementation. A variety of DSM systems have used the main processor for protocol processing. The Alewife project [4] was the first to experiment with software extensions to a NUMA protocol implemented mostly in hardware. The software protocol actions were written ....
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke and S. Vishin. The S3.mp Scalable Shared Memory Multiprocessor. Proc. of the Int. Parallel Processing Conference, pp. I-1-I-10, 1995.
....checking can even be implemented at application level. A drawback of this solution is the relative slow response time to access violations from the operating system which adds to the communication delay. Recently hardware assisted solutuions for workstations appeared in form of plug in cards [26, 8]. These cards implement an alternate shared address space which can be mapped into the users address range. A common characteristic of these designs is the use of a proprietary low latency interconnect with a bandwidth in the 1GBit s range. By completely bypassing the operating system the latency ....
Andreas Nowatzyk, Gunes Asbay, Michael Browne, Edmund Kelly, and Michael Parkin. The S3.mp Scalable Shared Memory Multiprocessor. Technical report, Sun Microsystems Computer Corporation, 1994.
....coherent access to memory that is distributed among the nodes of the multiprocessor. In DASH and Alewife, the cache coherence protocol is hardwired in custom hardware finite state machines (FSMs) within the coherence controllers. Instead of hardwiring protocol handlers, the Sun Microsystems S3.mp [8] multiprocessor uses hardware sequencers for modularity in implementing protocol handlers. Subsequent designs for scalable shared memory multiprocessors, such as the Stanford FLASH [4] and the Wisconsin Typhoon machines [11] have touted the use of programmable protocol processors instead of ....
....two protocol FSMs in the HWC implementation. We use the term protocol engine to refer to both the protocol processor in the PPC design and the protocol FSM in the HWC design. For distributing the protocol requests between the two engines, we use a policy similar to that used in the S3.mp system [8], where protocol requests for memory addresses on the local node are handled by one protocol engine (LPE) and protocol requests for memory addresses on remote nodes are handled by the other protocol engine (RPE) Only the LPE needs to access the directory. Figures 4 and 5 show the two engine HWC ....
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke, and S.Vishin. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of 1995 International Conference on Parallel Processing, August 1995.
....through the memory controller. For multiprocessor configurations (both off chip and integrated L2 cases) we assume the coherence controller (CC) network interface controller (NIC) and network routers (NR) are integrated tightly with the memory controller (MC) e.g. as in the S3.mp design [14]) 2 The tight integration of the coherence controller with the memory controller has several advantages. First, the latency for data in the local or home memory is reduced for protocols that involve the coherence controller in the access path of these cases. Second, the coherence controller can ....
....coherence controller in the access path of these cases. Second, the coherence controller can use part of the memory as its directory store. Furthermore, there are techniques, such as computing ECC across a larger number of bits and utilizing the unused bits for storing the directory information [14, 19], that provide the option of supporting the directory data with virtually no memory space overhead. Given the trend towards larger main memories, dedicated directory storage can become a significant cost factor. The above techniques lead to lower costs by requiring fewer components and pins, and ....
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke, and S. Vishin. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the 24th International Conference on Parallel Processing, August 1995.
....and give pointers to further work. 2 Reactive Proxies The severity of node controller contention is both application and architecture dependent [6] Controllers can be designed so that there is multi threading of requests (e.g. the Sun S3.mp is able to handle two simultaneous transactions [12]) which slightly alleviates the occupancy problem but does not eliminate it. Some contention is inevitable, and will increase the latency of transactions. The key problem is that queue lengths at controllers, and hence contention, are nonuniformly distributed around the machine. One way of ....
Andreas Nowatzyk, Gunes Aybay, Michael Browne, Edmund Kelly, Michael Parkin, Bill Radke, and Sanjay Vishin. The S3.mp scalable shared memory multiprocessor. In Proceedings of the International Conference on Parallel Processing Vol. 1, pages 1--10, August 1995.
No context found.
Andreas Nowatzyk, Gunes Aybay, Michael Browne, Edmund Kelly, David Lee, and Michael Parkin. The S3.mp Scalable Shared Memory Multiprocessor. Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences - Vol. I: Architecture, January, 1994, pp. 144-153.
No context found.
Nowatzyk, A., et al.: The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the 24th International Conference on Parallel Processing , 1995.
No context found.
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, and M. Parkin. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the International Conference on Parallel Processing, volume I, pages 1--10, Aug. 1995.
No context found.
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, D. Lee, and M. Parkin. The S3.mp Scalable Shared Memory Multiprocessor. In Proceedings of the 27th Hawaii International Conference on System Sciences. Volume 1: Architecture, pages 144--153, Los Alamitos, CA, USA, January 1994.
No context found.
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke, and S. Vishi. The s3.mp scalable shared memory multiprocessor. In Proceedings of the 1995 International Conference on Parallel Processing, pages I1--I10, August 1995.
No context found.
Andreas Nowatzyk, Gunes Aybay, Michael Browne, Edmund Kelly, Michael Parkin, Bill Radke, and Sanjay Vishin. The S3.mp scalable shared memory multiprocessor. In ICPP, 1995. .
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC