| Ioannis Schoinas, Babak Falsafi, Mark D. Hill, James R. Larus, Christopher E. Lucas, Shubhendu S. Mukherjee, Steven K. Reinhardt, Eric Schnarr, and David A. Wood. Implementing fine-grain distributed shared memory on commodity SMP workstations. Technical Report 1307, March 1996. |
....be lumped to a specific point in time, such as the lazy release consistency (LRC) protocol [20] Fine grain SW DSM systems with a more traditional cacheline sized coherence unit have also been implemented. Here, the access control check is either done by altering the errorcorrecting codes (ECC) [37] or by in line code snippets (small fragments of machine code) 37] 35] The small cache line size reduces the false sharing for these systems, but the explicit access control check adds extra latency for each load or store operation to global data. The most efficient access check reported to ....
....(LRC) protocol [20] Fine grain SW DSM systems with a more traditional cacheline sized coherence unit have also been implemented. Here, the access control check is either done by altering the errorcorrecting codes (ECC) 37] or by in line code snippets (small fragments of machine code) [37], 35] The small cache line size reduces the false sharing for these systems, but the explicit access control check adds extra latency for each load or store operation to global data. The most efficient access check reported to date is three extra instructions adding three extra cycles for each ....
[Article contains additional citation context not shown here]
I. Schoinas, B. Falsafi, M. D. Hill, J. R. Larus, C. E. Lucas, S. S. Mukherjee, S. K. Reinhardt, E. Schnarr, and D. A. Wood. Implementing Fine-Grain Distributed Shared Memory On Commodity SMP Workstations. Technical Report 1307, Computer Sciences Department, University of Wisconsin--Madison, March 1996.
....concerns of cluster based parallel computing, several software DSM systems have been built that do not rely on specialized hardware to provide programmers with shared memory. These systems include Ivy [28] TreadMarks [21] Munin [4] Brazos [43] CRL [18] MGS [49] CVM [20] Blizzard S [40], Shasta [39] Cashmere 2L [46] and SoftFLASH [9] The underlying principle in these machines is to leverage commodity parts particularly the use of commodity processors, node boards, networks, and operating systems to build a scalable DSM machine. Most software DSM systems rely on trapping ....
I. Schoinas, B. Falsafi, M. D. Hill, J. R. Larus, C. E. Lukas, S. S. Mukherjee, S. K. Reinhardt, E. Schnarr, and D. A. Wood. Implementing Fine-Grain Distributed Shared Memory on Commodity SMP Workstations. Technical Report 1307, University of Wisconsin Computer Sciences, March 1996.
....and the state table to be shared. 1 Servicing a request to the home by any processor on a node further requires sharing the directory state. A degenerate form of the above optimization involves dedicating one or more processors on the SMP for message or protocol processing only (as in Typhoon 0 [13]) Our current implementation exploits all of the above optimizations except eliminating local messages when the requester and home are colocated and load balancing the service of incoming requests. These optimizations are not 1 In some protocols, the owner processor must always be consulted on ....
....a relaxed memory model (e.g. Alpha, PowerPC, and Sparc) Again, the need for exact labeling sacrifices transparency. A few software or hybrid hardware software DSM systems have explored dedicating the second processor on a dual processor SMP node for protocol and message handling (e.g. Typhoon 0 [13], Home Based LRC [19] These systems do not exploit any of the intra node data sharing and clustering benefits of SMP nodes. Furthermore, any speedup numbers reported for P processors must be qualified by the fact that the system actually uses 2P general purpose processors to achieve that ....
I. Schoinas, B. Falsafi, M. D. Hill, J. R. Larus, C. E. Lukas, S. S. Mukherjee, S. K. Reinhardt, E. Schnarr, and D. A. Wood. Implementing Fine-Grain Distributed Shared Memory on Commodity SMP Workstations. Technical Report 1307, University of Wisconsin Computer Sciences, Mar. 1996.
....many NIs per node. 6 Related Work Related work in network interface support for SVM has discussed how NIs can be used for several purposes: Fast communication to improve the performance of traditional send and receive communication. This type of support has been exploited in many SVM projects [18, 26, 33, 51, 46, 41, 40, 42] and is also used in our base system, HLRC SMP [40] Protocol processing in the network interface. This choice lies at the other end of the spectrum. The network interface can be used not only to avoid interrupting the compute processor but also to perform full blown protocol processing, ....
I. Schoinas, B. Falsafi, M. D. Hill, J. R. Larus, C. E. Lucas, S. S. Mukherjee, S. K. Reinhardt, E. Schnarr, and D. A. Wood. Implementing fine-grain distributed shared memory on commodity smp workstations. Technical Report 1307, University of Wisconsin-Madison, Mar. 1996.
....application data and the state table to be shared. Servicing a request to the home by any processor on a node further requires sharing the directory state. A degenerate form of the above optimization involves dedicating one or more processors on the SMP for message processing only (as in Typhoon 0 [14]) The results in Section 4 are for an implementation that exploits all of the above optimizations except eliminating local messages when the requester and home are colocated and load balancing the handling of incoming messages. We implemented these latter optimizations, but did not see any ....
....to make efficient use of the broadcast and total message ordering properties of Memory Channel, while Shasta can run on more traditional networks as well. Cashmere achieves good speedups on a number of applications with large problem sizes. A few software or hybrid hardware software DSM systems [14, 20] have explored dedicating the second processor on a dual processor SMP node for message handling. These systems do not exploit any of the intra node data sharing and clustering benefits of SMP nodes. Furthermore, any speedup numbers reported for P processors must be qualified by the fact that the ....
I. Schoinas, B. Falsafi, M. D. Hill, J. R. Larus, C. E. Lukas, S. S. Mukherjee, S. K. Reinhardt, E. Schnarr, and D. A. Wood. Implementing Fine-Grain Distributed Shared Memory on Commodity SMP Workstations. Technical Report 1307, University of Wisconsin Computer Sciences, Mar. 1996.
....(SMP) have emerged in the last few years as one of the most attractive basic blocks for larger scale shared memory machines. From Pentium quads used in hardware systems, such as the Sequent NUMA Q [51] or the HAL S 1 [80] to dual SPARC modules used in hybrid DSM systems, such as Typhoon 0 [63][72], and to four processor AlphaServers used in all software DSM systems, such as Shasta [69] all SMP nodes consist of two or four processors connected by a bus or a crossbar to a uniform access memory. In light of this trend, it is important to understand the interaction between SC COMA s approach ....
I. Schoinas, B. Falsafi, M.D. Hill, J.R. Larus, C.E. Lukas, S.S. Mukherjee, S.K. Reinhardt, E. Schnarr, and D.A. Wood. Implementing Fine-Grain Distributed Shared Memory on Commodity SMP Workstations. Technical Report 1307, Computer Sciences Department, University of Wisconsin--Madison, February 1996.
....not CC NUMA become more useful for CC NUMA machines at larger scale [2] and how problem size affects these results. We would also like to look at the impact of these optimizations on systems that support fine grained coherence with either more commodity oriented controllers [16] or in software [10, 19], thus completing the performance portability picture, and to enlarge our coverage by including more applications in our suite. Finally, it may be interesting to examine how optimizations performed on the applications compare with the use of custom protocols to improve the performance of the ....
Schoinas I., Falsafi B., Hill M., Larus J., Lucas C., Mukherjee S., Reinhardt S., Schnarr E., and Wood D. Implementing Fine-Grain Distributed Shared Memory On Commodity SMP Workstations. Technical Report 1307, Computer Sciences Department, University of Wisconsin--Madison, March 1996.
....types of loads in the 64 bit Alpha architecture. We recently became aware that a form of the flag technique has been proposed for use in hardware in the StarT NG machine [5] The Blizzard S project has also recently incorporated the flag technique by adapting this idea from the StarT NG design [17]. With this optimization, the Blizzard S overhead is 3 instructions at most loads and 8 instructions at most stores. Run time overheadson a 66 MHz HyperSPARC processor (with a 8K first level data cache and 256K second level data cache) are reported for five applications, of which only one is a ....
....the Blizzard S overhead is 3 instructions at most loads and 8 instructions at most stores. Run time overheadson a 66 MHz HyperSPARC processor (with a 8K first level data cache and 256K second level data cache) are reported for five applications, of which only one is a SPLASH 2 application [17]. For the Barnes Hut application, the reported Blizzard S overhead on the Sparc is 1.6, while the Shasta overhead on the 275 MHz Alpha is 1.08. We have also measured the Shasta overhead for the appbt application. The Blizzard S overhead is 1.9, while the Shasta overhead is 1.19. There are several ....
I. Schoinas, B. Falsafi, M. D. Hill, J. R. Larus, C. E. Lukas, S. S. Mukherjee, S. K. Reinhardt, E. Schnarr, and D. A. Wood. Implementing Fine-Grain Distributed Shared Memory on Commodity SMP Workstations. Technical Report 1307, University of Wisconsin Computer Sciences, Mar. 1996.
....at a finer granularity. FGDSM systems achieve performance competitive to SVM systems [ZIS 97] without having to resort to weak consistency models. Blizzard CM 5 [SFL 94] developed by the author and others for the TMC CM 5, was the first FGDSM system on messaging passing hardware. Blizzard COW 1 [SFH 96] which is its direct descendant, was developed on the Wisconsin Cluster of Workstations (COW) and is the focus of this thesis. Digital s Shasta [SGT96] is another FGDSM system inspired by Blizzard CM 5. Unlike Shasta, Blizzard uses the Tempest interface [Rei94] to support software distributed ....
....target was calculated based on the fine grain tag value. In this way, condition codes were not modified. A similar access check sequence is still used today when better alternatives cannot be applied (Figure 2 1 (c) To address these problems, Eric Schnarr developed a new binary rewriting tool [SFH 96] based on the EEL executable rewriting library [LS95] Besides software access control, I have been using this tool for rewriting Blizzard executables for other purposes as well (e.g. implicit network polling; see Chapter 3) The EEL library provides extensive infrastructure to analyze and ....
[Article contains additional citation context not shown here]
Ioannis Schoinas, Babak Falsafi, Mark D. Hill, James R. Larus, Christopher E. Lucas, Shubhendu S. Mukherjee, Steven K. Reinhardt, Eric Schnarr, and David A. Wood. Implementing fine-grain distributed shared memory on commodity smp workstations. Technical Report 1307, Computer Sciences Department, University of Wiscon- sin-Madison, March 1996.
....at userlevel. It requires adding 7 instructions at each back edge in an application s control ow graph to check a control register for message arrival. Because the T0 device supports cachable control registers, the common case (that no message has arrived) incurs an overhead of only 6 or 7 cycles [28]. When a message does arrive, the round trip time for the mechanism is 1.5 microseconds, which includes the cost of clearing the T0 register with an uncached store. The trade o between the two mechanisms clearly depends upon the frequency of message arrivals: for frequent messages, polling works ....
Ioannis Schoinas, Babak Falsa, Mark D. Hill, James R. Larus, Christopher E. Lucas, Shubhendu S. Mukherjee, Steven K. Reinhardt, Eric Schnarr, and David A. Wood. Implementing Fine-Grain Distributed Shared Memory On Commodity SMP Workstations. Technical Report 1307, March 1996.
....user level. It requires adding 7 instructions at each back edge in an application s control flow graph to check a control register for message arrival. Because the T0 device supports cachable control registers, the common case (that no message has arrived) incurs an overhead of only 6 or 7 cycles [30]. When a message does arrive, the round trip time for the mechanism is 1.5 microseconds, which includes the cost of clearing the T0 register with an uncached store. The trade off between the two mechanisms clearly depends upon the frequency of message arrivals: for frequent messages, polling works ....
Ioannis Schoinas, Babak Falsafi, Mark D. Hill, James R. Larus, Christopher E. Lucas, Shubhendu S. Mukherjee, Steven K. Reinhardt, Eric Schnarr, and David A. Wood. Implementing Fine-Grain Distributed Shared Memory On Commodity SMP Workstations. Technical Report 1307, March 1996.
....can be performed in the kernel trap handler fast ( few thousand cycles) Using similar techniques, we have been able to reduce the roundtrip time for synchronous traps on 66 MHz HyperSparcs, from 101 secs with the standard Solaris 2. 4 signal interface to 5 secs with optimized kernel interfaces [45]. Other researchers have reported similar results [40,50] However, specialized fast interfaces are not viable in the long run. First, we violate kernel structuring principles. Device specific support must be implemented at the lowest kernel levels where no public interfaces exist to support this ....
....Myrinet Results To demonstrate the feasibility of our approach, we present results from an implementation on real hardware (Myrinet) This implementation started from the same NI control program distributed by Berkeley. Subsequently, it was modified to support the messaging subsystem of Blizzard [46,44,45], our fine grain distributed shared memory system. To this date, it has been kept source level compatible with the Berkeley Active Message library. Less than 200 lines of C code in the NI control program and device driver were required to support address translation. The testbed consists of Sun ....
Ioannis Schoinas, Babak Falsafi, Mark D. Hill, James R. Larus, Christopher E. Lucas, Shubhendu S. Mukherjee, Steven K. Reinhardt, Eric Schnarr, and David A. Wood. Implementing Fine-Grain Distributed Shared Memory On Commodity SMP Workstations. Technical Report 1307, Computer Sciences Department, University of Wisconsin--Madison, March 1996.
....which implements the shared memory and message passing mechanisms specified by Tempest [15] On the CM 5, we used Blizzard E [19] which implements fine grain DSM entirely in software. On COW, we used Blizzard Typhoon 0, which uses a small hardware extension to implement fine grain DSM efficiently [18]. Each COW node is a 2 processor SMP, with one processor being used for computation and the other for communication. 7.1 DSMC DSMC, and its implementation in C with user defined reductions, are described in detail in Section 4. The hand optimized implementation also uses bulk messages to move ....
Ioannis Schoinas, Babak Falsafi, Mark D. Hill, James R. Larus, Christopher E. Lucas, Shubhendu S. Mukherjee, Steven K. Reinhardt, Eric Schnarr, and David A. Wood. Implementing Fine-Grain Distributed Shared Memory On Commodity SMP Workstations. Technical Report 1307, Computer Sciences Department, University of Wisconsin--Madison, March 1996.
....user level. It requires adding 7 instructions at each back edge in an application s control flow graph to check a control register for message arrival. Because the T0 device supports cachable control registers, the common case (that no message has arrived) incurs an overhead of only 6 or 7 cycles [28]. When a message does arrive, the round trip time for the mechanism is 1.5 microseconds, which includes the cost of clearing the T0 register with an uncached store. The trade off between the two mechanisms clearly depends upon the frequency of message arrivals: for frequent messages, polling works ....
Ioannis Schoinas, Babak Falsafi, Mark D. Hill, James R. Larus, Christopher E. Lucas, Shubhendu S. Mukherjee, Steven K. Reinhardt, Eric Schnarr, and David A. Wood. Implementing Fine-Grain Distributed Shared Memory On Commodity SMP Workstations. Technical Report 1307, March 1996.
....processors can access the interface SRAM directly using uncached memory operations; however, the interface processor can access host memory only via the DMA engine. Our prototype s interface processors run custom software derived from Berkeley s Active Messages implementation [14] Schoinas et al. [49] describe our modifications and enhancements for Tempest. As in the Berkeley implementation, our software allocates user accessible send and receive queues in the shared SRAM. Each entry holds header information, a few words of message data, and an optional pointer to more message data in host ....
I. Schoinas, B. Falsafi, M. D. Hill, J. R. Larus, C. E. Lucas, S. S. Mukherjee, S. K. Reinhardt, E. Schnarr, and D. A. Wood. "Implementing fine-grain distributed shared memory on commodity SMPworkstations." Technical Report 1307, Computer Sciences Department, University of Wisconsin--Madison, Mar. 1996.
....have demonstrated efficient implementations of fine grain (i.e. coherence at 32 128 byte granularity) shared memory for message passing machines. For example, the Blizzard system at Wisconsin implements, in software, coherent shared memory on a CM 5 [32] and on a cluster of workstations [31]. A key question is whether a compiler is justified in incurring the overheads of a DSM based shared address space, in preference to low level message passing. This paper reports an experimental study that explores this question. In this study, our platform is a cluster of SPARCstation 20 ....
....alleviates the problem of throwing away expensively fetched remote data due to finite size of the level two cache. These experiments ran on a cluster of dual processor SPARCstation 20 workstations running Solaris 2. 4 connected by a Myrinet network (all commodity parts) This implementation (see [31] for details) uses a small custom hardware device [28] that sits on the memory bus of each workstation and accelerates access control functions. Note that the coherence protocol itself is written in unprivileged software. Purely software implementations of fine grain access control also exist, but ....
Ioannis Schoinas, Babak Falsafi, Mark D. Hill, James R. Larus, Christopher E. Lucas, Shubhendu S. Mukherjee, Steven K. Reinhardt, Eric Schnarr, and David A. Wood. Implementing Fine-Grain Distributed Shared Memory On Commodity SMP Workstations. Technical Report 1307, Computer Sciences Department, University of Wisconsin--Madison, March 1996.
No context found.
Ioannis Schoinas, Babak Falsafi, Mark D. Hill, James R. Larus, Christopher E. Lucas, Shubhendu S. Mukherjee, Steven K. Reinhardt, Eric Schnarr, and David A. Wood. Implementing fine-grain distributed shared memory on commodity SMP workstations. Technical Report 1307, March 1996.
No context found.
I. Schoinas, B. Falsafi, M. D. Hill, J. R. Larus, C. E. Lucas, S. S. Mukherjee, S. K. Reinhardt, E. Schnarr, and D. A. Wood. Implementing Fine-Grain Distributed Shared Memory On Commodity SMP Workstations. Technical Report 1307, Computer Sciences Department, University of Wisconsin--Madison, March 1996.
No context found.
I. Schoinas, B. Falsafi, M. D. Hill, J. R. Larus, C. E. Lucas, S. S. Mukherjee, S. K. Reinhardt, E. Schnarr, and D. A. Wood. Implementing fine-grain distributed shared memory on commodity smp workstations. Technical Report 1307, University of Wisconsin-Madison, March 1996.
No context found.
Ioannis Schoinas et al., "Implementing Fine-Grain Distributed Shared Memory on Commodity SMP Workstations." Technical Report TR-1307, University of WisconsinMadison, March 1996
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC