| Sandhya Dwarkadas, Kourosh Gharachorloo, Leonidas Kontothanassis, Daniel J. Scales, Michael L. Scott, and Robert Stets. Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory. In Proceedings of the International Symposium on High Performance Computer Architecture, 1999. |
.... Barnes, Raytrace, Water Nsquared (Water NS) Water Spatial (Water SP) Ocean (OceanRW) and Volume rendering (Volrend) 5 of these programs were slightly modified to make them execute more efficiently, and these modifications are usually made when the programs are executed on software DSM systems [11, 23, 25]. The actual modifications are shown in Table 7.1. Table 7.1: Modifications on SPLASH 2 programs Program Modification LU Contig owner of block (i; j) owner of block (j; i) FFT sender initiated Transposition [23] Raytrace elimination of unused lock operation for ray ID [23] Ocean rowwise ....
.... Modifications on SPLASH 2 programs Program Modification LU Contig owner of block (i; j) owner of block (j; i) FFT sender initiated Transposition [23] Raytrace elimination of unused lock operation for ray ID [23] Ocean rowwise partition (Ocean RW) 25] Barnes sequential tree construction [11] ffl Compiler We used our optimizing compiler RCOP [37, 59] in which optimizing techniques for the UDSM ADSM are implemented. The RCOP analyzes the source code of shared memory parallel programs which are based on the LRC (lazy release consistency) model, solves dataflow equations for ....
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative evaluation of fine- and coarse-grain approaches for software distributed shared memory. In Proc. of the 5th Int. Symp. on High-Performance Computer Architecture (HPCA), pages 260--269, January 1999.
....0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 FFT LU c LU nc Radix Barnes FMM Ocean c Ocean nc Radiosity Raytrace Water nsq Water sp Average E6000 16 CPUs CC NUMA 2x8 DSZOOM WF 2x8 Figure 8: Application speedups for Sun Enterprise E6000, 2node CC NUMA, and 2 node DSZOOM WF. 2L [41] [10], CRL [19] GeNIMA [5] Ivy [26] 27] MGS [44] Munin [8] Shasta [35] 34] 32] 33] 10] SiroccoS [36] SoftFLASH [11] and TreadMarks [21] Most of them suffer from synchronous interrupt protocol processing. We belive that many of these implementations would benefit from a more efficient ....
....Ocean nc Radiosity Raytrace Water nsq Water sp Average E6000 16 CPUs CC NUMA 2x8 DSZOOM WF 2x8 Figure 8: Application speedups for Sun Enterprise E6000, 2node CC NUMA, and 2 node DSZOOM WF. 2L [41] 10] CRL [19] GeNIMA [5] Ivy [26] 27] MGS [44] Munin [8] Shasta [35] 34] 32] 33] [10], SiroccoS [36] SoftFLASH [11] and TreadMarks [21] Most of them suffer from synchronous interrupt protocol processing. We belive that many of these implementations would benefit from a more efficient protocol implementation; such the one described here. The DSZOOM WF s basic approach is ....
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture, pages 260-- 269, January 1999.
.... Over the last few years we have seen the development of several benchmark suites for shared memory parallel systems [25, 26, 30, 33] These benchmark suites have proven to be invaluable research tools, providing a basis for comparison between various shared memory architectures (e.g. [13, 25]) Results from these benchmark suites have, however, also been taken as a measure of the performance that can be obtained on shared memory architectures for the classes of applications that these benchmark programs represent. 0 7803 9802 5 2000 10.00 (c) 2000 IEEE. This paper demonstrates that, ....
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative evaluation of fine- and coarse-grain approaches for software distributed shared memory. In Proceedings of the Fifth International Symposium on High-Performance Computer Architecture, pages 260--269, Jan. 1999.
.... protocols [6] reduce the impact of the large page size, fine grained sharing and false sharing remain problematic [1] Fine grained DSM systems have been built using code instrumentation, but they have been limited by the cost of instrumentation and lack of communication aggregation [8]. The system presented here, DOSA, uses the ability to distinguish pointers from data at run time to achieve efficient fine grained sharing using VM support and without using instrumentation. It does so by introducing a level of indirection that allows objects to reside at different virtual memory ....
....a new language or API to the programmer to express distributed sharing, while DOSA does not. DOSA aims to provide transparent object sharing for existing typed languages, such as Java. Furthermore, none of Orca, Jade, COOL, or SAM use VM based mechanisms for object sharing. Dwarkadas et al. [8] compared Cashmere, a coarse grained system, somewhat like TreadMarks, and Shasta, an instrumentation based system, running on an identical platform a cluster of four 4 way AlphaServers connected by a Memory Channel network. In general, Cashmere outperformed Shasta on coarse grained ....
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative evaluation of fineand coarse-grain approaches for software distributed shared memory. In Proceedings of the Fifth International Symposium on High-Performance Computer Architecture, pages 260--269, Jan. 1999.
....is considerably (up to 150 ) better than in TreadMarks. Unfortunately, there is no similarly available implementation of fine grained shared memory, so an explicit comparison with such a system could not be made, but we offer some speculations based on published results comparing Cashmere [6], a coarse grained system, to Shasta. The outline of the rest of this paper is as follows. Section 2: API and memory model. Section 3: Implementation and comparison with conventional systems. Section 4: Compiler optimizations for coarse grained applications. Section 5: Experimental methodology. ....
....extensively to TreadMarks [1] and Shasta [11] using them as examples of coarse grained and fine grained DSM systems. Qualitatively similar comparisons can be made with other DSM systems [9, 5, 16] We have also compared our work to the MultiView approach used in Millipede [8] Dwarkadas et al. [6] compare Cashmere, a coarsegrained system, and Shasta running on an identical platform a cluster of four four way AlphaServers connected by a Memory Channel network. Both systems are designed to leverage the Memory Channel network and take advantage of the hardware shared memory within each SMP ....
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative evaluation of fine- and coarse-grain approaches for software distributed shared memory. In Proceedings of the Fifth International Symposium on High-Performance Computer Architecture, pages 260--269, Jan. 1999.
....remote write system area network [7] a PCI based network with a peak point to point bandwidth of 75 MBytes sec and a one way, cache to cache latency for a 64 bit remote write operation of 3.3 secs. Previous work has examined the performance of the system on a variety of standard benchmarks [4, 15], as well as a widely used genetic linkage analysis program [3] In this paper, we demonstrate the utility of SDSM using our example application, TVD. Figure 4 shows the execution time of the application as the number of processors is varied from 1 to 32. The test case uses a 256x256 grid and ....
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory. In Proc. of the 5th Intl. Symp. on High Performance Computer Architecture, Jan. 1999.
....in applications originally developed for hardware multiprocessors. In contrast, Cashmere s performance can degrade dramatically in the presence of finegrain synchronization, but for more scalable applications it provides better overall performance. Our results were presented at HPCA in January [3]; a summary appears in figure 2. As shown by the stacked bars in several applications, performance differences between Cashmere and Shasta can sometimes be reduced by program modifications that address coherence and synchronization granularity. 2.3 Compiler Integration Work continued this past ....
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. Scales, M. L. Scott, and R. Stets. Comparative evaluation of fine- and coarse-grain approaches for software distributed shared memory. In Proceedings of the Fifth High Performance Computer Architecture Symposium, pages 260--269, January 1999.
....hardware within SMP nodes, perhaps the most natural programming paradigm for these clusters is Software Distributed Shared Memory (SDSM) since it utilizes the hardware within a node efficiently. Several studies have already determined the positive impact of SMP based clusters on SDSM performance [12, 14, 20, 21, 22, 25]. Many of these same studies utilized low latency networks. However, the benefits of advanced network features (for example, remote memory access) have not been directly quantified. In this paper, we examine the impact of advanced networking features on the performance of the state of the art ....
....memory subsystem to track data accesses, allows multiple concurrent writers, employs home nodes (i.e. maintains one master copy of each shared data page) and leverages shared memory within SMPs to reduce protocol overhead. In practice, Cashmere 2L has been shown to have very good performance [12, 17, 25]. Cashmere was originally designed for a cluster consisting of AlphaServer SMPs connected by a Compaq Memory Channel network, which offers low messaging latencies, write access to remote memory, inexpensive broadcast, and total ordering. Cashmere therefore attempted to maximize performance by ....
[Article contains additional citation context not shown here]
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory. In Proceedings of the Fifth International Symposium on High Performance Computer Architecture, Orlando, FL, January 1999.
....accesses, allows multiple concurrent writers, employs home nodes (i.e. maintains one master copy of each shared data page) maintains a global page directory, and leverages shared memory within SMPs to reduce protocol overhead. In practice, Cashmere 2L has been shown to have very good performance [12, 28]. Cashmere was originally designed to maximize performance by placing shared data directly in remotely writable memory, using remote writes and broadcast to replicate the page directory among nodes, and relying on network total order and reliability to avoid acknowledging the receipt of metadata ....
....memory space. Methods to eliminate this restriction are a focus of ongoing research [7, 30] 2. Protocol Variants and Implementation Cashmere was designed for SMP clusters connected by a high performance system area network, such as Compaq s Memory Channel [15] Earlier work on Cashmere [12, 28] and other systems [12, 14, 22, 23, 24] has quantified the benefits of SMP nodes to SDSM performance. In this paper, we examine the performance impact of the special network features. We begin by providing an overview of the Memory Channel network and its programming interface. Following this ....
[Article contains additional citation context not shown here]
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory. In Proc. of the 5th Intl. Symp. on High Performance Computer Architecture, Orlando, FL, Jan. 1999.
....track data accesses, allows multiple concurrent writers, employs home nodes (i.e. maintains one master copy of each shared data page) a global page directory, and leverages shared memory within SMPs to reduce protocol overhead. In practice, Cashmere 2L has been shown to have very good performance [12, 28]. Cashmere was originally designed to maximize performance by placing shared data directly in remotely writable memory, using remote write and broadcast to replicate the page directory among nodes, and relying on network total order and reliability to avoid acknowledging the receipt of metadata ....
....migration optimization. Section 4 covers related work, and Section 5 outlines our conclusions. 2 Protocol Variants and Implementation Cashmere was designed for SMP clusters connected by a high performance system area network, such as Compaq s Memory Channel network [15] Earlier work on Cashmere [12, 28] and other systems [12, 14, 22, 23, 24] has quantified the benefits of SMP nodes to SDSM performance. In this paper, we will examine the performance impact of the special network features. We begin by providing an overview of the Memory Channel network and its programming interface. Following this ....
[Article contains additional citation context not shown here]
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory. In Proceedings of the Fifth International Symposium on High Performance Computer Architecture, Orlando, FL, January 1999.
No context found.
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory. University of Rochester CS TR 699, October 1998. Also available as Western Research Lab TR 98/7.
....spent handling coherence misses. Messaging time covers the time spent handling messages when the processor is not already stalled. Finally, Protocol time represents the remaining overhead introduced by the protocol. Additional performance data is available in an extended version of this paper [5], including detailed execution statistics for the two systems (e.g. the number of messages, amount of message traffic, and various protocol specific measurements) and the execution time breakdown for the larger data set. 4.2.1 Coarse Grain Access and Synchronization The applications in this ....
....for either Shasta or Cashmere. Figure 4 presents the speedups for the modified applications along with the unmodified results for 16 processor runs with the large dataset sizes. The corresponding execution time breakdowns are not shown here, but are available in an extended version of the paper [5]. 4.3.1 Modifications for Shasta The modifications we consider for Shasta are guaranteed not to alter program correctness, and can therefore be applied safely without a deep understanding of the application. This is consistent with Shasta s philosophy of transparency and simple portability. The ....
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative Evaluation of Fineand Coarse-Grain Approaches for Software Distributed Shared Memory. University of Rochester CS TR 699, October 1998. Also available as Western Research Lab TR 98/7.
No context found.
Sandhya Dwarkadas, Kourosh Gharachorloo, Leonidas Kontothanassis, Daniel J. Scales, Michael L. Scott, and Robert Stets. Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory. In Proceedings of the International Symposium on High Performance Computer Architecture, 1999.
No context found.
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative evaluation of fine- and coarse-grain approaches for software distributed shared memory. In Proc. of the 5th Int. Symp. on High-Performance Computer Architecture (HPCA), pages 260--269, January 1999.
No context found.
S. Dwarkadas, K. Gharachorloo, L. Kontothanassis, D. J. Scales, M. L. Scott, and R. Stets. Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture, pages 260--269, January 1999.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC