20 citations found. Retrieving documents...
S. C. Woo, J. P. Singh and J. L. Hennessy, The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors, in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 219-229, October 4-7, 1994.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Integrating Non-blocking Synchronisation in Parallel.. - Tsigas, Zhang (2002)   (Correct)

....for each of the applications. Application Input Ocean 1026 radiosity largeroom volrend 256x256x126 spark98 sf5.1.pack water spatial 1331 molecules water nsquared 1331 molecules Figure 4: Applications and inputs 3. 2 Application Description Ocean simulates eddy currents in an ocean basin [27]. Both its inherent and induced (at page granularity) data referencing patterns generally involve one producer with one consumer. Volrend renders three dimensional volume data into an image using a ray casting method [17] The volume data are read only. Its inherent data referencing pattern on ....

S. C. Woo, J. P. Singh and J. L. Hennessy, The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors, in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 219-229, October 4-7, 1994.


Efficient Runtime Support for Cluster-Based Distributed Shared.. - Speight (1997)   (3 citations)  (Correct)

....SPLASH 2, and NAS) Two other appli cations were used in previous DSM studies evaluating the Munin and TreadMarks systems. Finally, ILINK is a genetics application that was recently used to find the gene responsible for Parkinson s disease. Descriptions of the SPLASH 2 applications appear in [50, 44, 51, 49]; NAS benchmarks are outlined in [6] the two SPLASH applications are described in [45] the Munin programs are described in [14] and a detailed description of ILINK can be found in [17] Here, we briefly summarize the applications studied. applications used. 67 Table 4.1 gives relevant ....

S. Woo, J. Singh, and J. Hennessy. The performance advantages of integrating block data transfer in cache-coherent multiprocessors. In The Sixth Izterzatiozal Cofcrccc o Architectural Support for' Programruling Laguagcs ad Opcratig Systems, pages 219 229, Oct 1994.


Integrating Non-blocking Synchronisation in Parallel.. - Tsigas, Zhang (2002)   (Correct)

....for each of the applications. Application Input Ocean 1026 radiosity largeroom volrend 256x256x126 spark98 sf5.1.pack water spatial 1331 molecules water nsquared 1331 molecules Figure 4: Applications and inputs 3. 2 Application Description Ocean simulates eddy currents in an ocean basin [27]. Both its inherent and induced (at page granularity) data referencing patterns generally involve one producer with one consumer. Volrend renders three dimensional volume data into an image using a ray casting method [17] The volume data are read only. Its inherent data referencing pattern on ....

S. C. Woo, J. P. Singh and J. L. Hennessy, The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors, in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 219-229, October 4-7, 1994.


DSZOOM - Low Latency Software-Based Shared Memory - Radovic, Hagersten (2001)   (Correct)

....from the original Stanford University distribution, which were originally developed for hardware multiprocessors. The applications are: Barnes Hut (hierarchical N body method) FFT (complex 1 D version of the radix # # six step FFT algorithm [Bai90] LU (blocked LU decomposition, see [WSH94] for more details) CLU (blocked LU decomposition with contiguous allocation of data, more optimized version of LU) Radix (integer radix sort kernel) Radiosity (iterative hierarchical diffuse radiosity method [HSA91] Raytrace (rendering of a three dimensional scene using ray tracing) Wa t ....

S. C. Woo, J. P. Singh, and J. L. Hennessy. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOSVI) , pages 219--229, October 1994.


A Multiprotocol Communication Support for the Global.. - Nieplocha, Ju, Straatsma (2000)   (1 citation)  (Correct)

....is one of the kernel programs from SPLASH 2, to evaluate the performance of our approach. The LU program factors a dense matrix into the product of a lower triangular and an upper triangular matrix. The factorization uses blocking to exploit temporal locality w.r.t. individual submatrix elements [12]. Originally designed to run on shared memory systems, this benchmark can only be used on a single SMP node of the IBM SP. Some modifications were needed to use the global address space model. We also developed a Pthread version of the benchmark to evaluate the performance of our modifications ....

S.C. Woo, J.P. Singh, J.L. Hennessy, The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. Proc.6th ASPLOS, 1994.


Performance Portability and Scalability in Shared-Address-Space.. - Jiang (2000)   (Correct)

....key patterns determine how applications interact with di#erent system characteristics and granularities. Regular Applications LU performs the blocked LU factorization of a dense matrix. We begin with the non contiguous version of LU, which uses the natural 2 D arrays to represent the 2 D matrix [85]. Its inherent data sharing pattern (at word granularity) involves one producer with multiple consumers. Read and write accesses are both quite fine grained with these data structures. Since a page spans multiple sub rows from di#erent blocks, it su#ers false sharing and fragmentation in SVM ....

....involves one producer with multiple consumers. Read and write accesses are both quite fine grained with these data structures. Since a page spans multiple sub rows from di#erent blocks, it su#ers false sharing and fragmentation in SVM systems. Ocean simulates eddy currents in an ocean basin [85]. It consists largely of nearest neighbor calculations on regular grids, including a multi grid solver [11] Both its inherent and induced (at page granularity) data referencing patterns generally involve one producer with one consumer. Read and write accesses are coarse grained internally to a ....

[Article contains additional citation context not shown here]

S. C. Woo, J. P. Singh, and J. Hennessy. The performance advantages of integrating block data transfer in cache-coherent multiprocessing. In The Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, October 1994.


Architectural Mechanisms for Explicit Communication.. - Umakishore.. (1995)   (5 citations)  (Correct)

....controller hardware. The important point to note is that these mechanisms are integrated with the basic underlying directory based coherence maintenance. In this sense they are very different from the explicit communication primitives such as those proposed in MIT Alewife [19] or Stanford Flash [36], in that there is no address space management nor explicit coherence maintenance burden at the application level on the programmer for using our primitives. Of the proposed primitives, PSET WRITE provides explicit communication for the static case, SYNC WRITE provides explicit communication for ....

S. C. Woo, J. P. Singh, and J. L. Hennessy. The performance advantages of integrating block data transfer in cache-coherent multiprocessors. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, October 1994.


Adaptive Granularity: Transparent Integration of Fine-Grain and.. - Park (1996)   (1 citation)  (Correct)

....this works well for fine grain data, bulk transfer of data can sometimes be more effective for some applications. Bulk transfer has several advantages over fine grain communications: fast pipelined data transfer, overlap of communication with computation, and replication of data in local memory [29]. To exploit the advantages of both fine grain and coarse grain communications, more recent shared memory machines such as Stanford FLASH and Wisconsin Typhoon have begun to integrate both models within a single architecture and to implement a coherence protocol in software rather than in ....

....machines such as Stanford FLASH and Wisconsin Typhoon have begun to integrate both models within a single architecture and to implement a coherence protocol in software rather than in hardware. In order to use the bulk transfer facility on the machine, several approaches such as explicit messages [12, 29] and a new programming model [13] have been proposed. With the explicit message approach, message passing communication primitives such as send receive or memory copy are used selectively to communicate coarse grain data and load store communications are used for fine grain data communications in ....

[Article contains additional citation context not shown here]

S. Woo, J. Singh, and J. Hennessy. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, 219--229, November 1994.


Adaptive Granularity: Transparent Integration of Fine- and.. - Daeyeon Park (1996)   (1 citation)  (Correct)

....grant CDA 9216321. some application, data bulk transfer can sometimes be more effective. Bulk transfer has several advantages over fine grain communication: 1) the pipelining of data transfers, 2) the overlapping of communication with computation, and 3) the replication of data in local memory [WooS94]. To exploit the advantages of fine grain and coarse grain communication, more recent shared memory machines such as Stanford FLASH and Wisconsin Typhoon have begun to integrate both models within a single architecture and to implement coherence protocols in software rather than in hardware. In ....

....and Wisconsin Typhoon have begun to integrate both models within a single architecture and to implement coherence protocols in software rather than in hardware. In order to use the bulk transfer facility on these machines, several approaches have been proposed such as explicit messages [Hein94][WooS94] and new programming models [ChanR94] In explicit messages, message passing communication primitives such as send receive or memory copy are used selectively to communicate coarse grain data, while load store communication is used for fine grain data [WooS94] In other words, two communication ....

[Article contains additional citation context not shown here]

S. Woo, J. Singh, and J. Hennessy, "The Performance Advantages of Integrating Block Data Transfer in CacheCoherent Multiprocessors", in Proc. of the 6th ASPLOS Conf., 219-229, November 1994.


Architectural Mechanisms for Explicit.. - Ramachandran.. (1995)   (6 citations)  (Correct)

....controller hardware. The important point to note is that these mechanisms are integrated with the basic underlying directory based coherence maintenance. In this sense they are very different from the explicit communication primitives such as those proposed in MIT Alewife [22] or Stanford Flash [39], in that there is no address space management nor explicit coherence maintenance burden at the application level on the programmer for using our primitives. Of the proposed primitives, PSET WRITE provides explicit communication for the static case, SYNC WRITE provides explicit communication for ....

S. C. Woo, J. P. Singh, and J. L. Hennessy. The performance advantages of integrating block data transfer in cache-coherent multiprocessors. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, October 1994.


The Sensitivity of Communication Mechanisms to.. - Chong, Barua.. (1998)   (10 citations)  (Correct)

....[1] Fugu [30] and the Wisconsin Typhoon [38] support several variants of shared memory and messaging styles. The availability of machines with multiple mechanisms has led to an increasing amount of insight on the effectiveness of the various mechanisms for different applications [8] 15] 44] [46] [21] 10] Message passing mechanisms, usually in the form of user level active messages and efficient bulk transfer of data, offer good performance on programs with known communication patterns since data can be communicated when produced rather than when requested by the program, thereby ....

....Our study shows, however, that bandwidth across the bisection of the machine may become a critical cost in supporting shared memory on modern machines. Such costs will make message passing and specialized user level protocols [15] increasingly important as processor speeds increase. Woo et al. [46] compared bulk transfer with shared memory on simulations of the FLASH multiprocessor [19] running the SPLASH [18] suite. They found bulk transfer performance to be disappointing due to the high cost of initiating transfer and the difficulty in finding computation to overlap with the transfer. ....

S. C. Woo, J. P. Singh, and J. L. Hennessy. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI). ACM, October 1994.


An Integrated Compile-Time/Run-Time Software.. - Dwarkadas, Cox.. (1996)   (6 citations)  (Correct)

....bridge the gap by providing the flexibility of shared memory while taking advantage of bulk transfer. Several recent proposals for hardware shared memory machines include a message passing subsystem designed in part to allow applications to take advantage of bulk data transfer [19, 20] Woo et al. [24] evaluate one such design in the context of the Flash system. There are many differences between their work and ours. The Flash bulk data transfer consists of multiple cache lines as opposed to multiple pages in our work, and the latencies used in the Flash simulation are much smaller than in our ....

S.C. Woo, J.P. Singh, and J.L. Hennessy. The performance advantages of integrating block data transfer in cachecoherent multiprocessors. In Proceedings of ASPLOS-6, October 1994.


An Evaluation of Fine-Grain Producer-Initiated.. - Abdel-Shafi, Hall.. (1997)   (11 citations)  (Correct)

....bulk transfer) primitives are primarily useful for regular, coarse grain data sharing patterns. In such cases, however, software prefetching is also highly effective and bulk transfer primitives have been shown to provide little additional performance benefit over prefetching for scientific codes [25]. In contrast, fine grain producer initiated primitives appear to be useful in certain cases where prefetching is inapplicable or insufficient. In particular, prefetching may not be effective for data references where either (1) the value to be read is not produced sufficiently early (before the ....

S. C. Woo et al. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. In ASPLOS-VI, 1994.


On the Use and Performance of Explicit Communication Primitives.. - Qin, Baer (1996)   (2 citations)  (Correct)

....message handler) This is in contrast with cache miss requests that need to be performed to completion. 4 Experimental Methodology 4. 1 Applications and Experiments For our experiments, we selected three kernel applications, FFT, LU factorization, and RADIX sort from the SPLASH 2 benchmark suite [32, 33]. These applications have been coded with a CC NUMA system in mind, thus they already have some communication optimizations embedded in them. They also exhibit coarse grain regular communication patterns that can be exploited by the proposed communication primitives. Table 2 summarizes some ....

....yields execution times for the software optimization comparable, in fact even slightly lower, to those of the hardware implementation. 5. 2 LU Factorization The application The SPLASH 2 parallel implementation of the LU factorization of a dense matrix has been optimized to exploit data locality [33]. Nonetheless, the serial sections of the application produce a fair amount of load imbalance. The input matrix is divided into submatrices, or blocks, which are assigned to processors in a 2D scatter decomposition fashion. In essence, every processor is responsible for factorizing the same number ....

S. C. Woo, J. P. Singh, and J. L. Hennessy. The performance advantages of integrating block data transfer in cache-coherent multiprocessors. In Proceedings of 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), pages 219--229, 1994.


Speeding up Irregular Applications in Shared-Memory.. - Zhang, Torrellas (1995)   (25 citations)  (Correct)

....adapts to the object size. Unfortunately, both schemes fail to use any object information that could be extracted, often very easily, from the source code. Instead of using help from the compiler or programmer, all is left to the hardware. Two other techniques, regions [3] and block transfers [19], use the opposite approach, namely they use the help of the software to identify objects. However, given the overheads involved, these techniques are designed for large objects. In addition, identifying the regions or blocks and annotating the application is likely to require significant ....

S. Woo, J. Singh, and J. Hennessy. The Performance Advantages of Integrating Block Data Transfer in CacheCoherent Multiprocessors. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 219-- 229, October 1994.


An Integrated Shared-Memory / Message Passing API for.. - Speight, Abdel-Shafi, ..   (Correct)

....are implemented using the message layer to send and receive coherence packets. In addition to message passing, Flash supports cache coherent block transfer [9] In an evaluation of the performance of block transfer in Flash, Woo et al. reported limited benefit for the applications studied [20]. The Wisconsin Typhoon [16] also integrates shared memory and message passing by exposing both models in the Tempest interface. Like Flash, Typhoon employs a processor located at the network interface to support the message interface. 5 CONCLUSIONS This paper has described the Brazos Common ....

S.C. Woo, J.P. Singh, J.L. Hennessy. The Performance Advantages of Integrating Block Data Transfer in CacheCoherent Multiprocessors. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems. p. 219-229, 1994.


Efficient Runtime Support for Cluster-Based Distributed Shared.. - Speight (1997)   (3 citations)  (Correct)

....SPLASH 2, and NAS) Two other applications were used in previous DSM studies evaluating the Munin and TreadMarks systems. Finally, ILINK is a genetics application that was recently used to find the gene responsible for Parkinson s disease. Descriptions of the SPLASH 2 applications appear in [50, 44, 51, 49]; NAS benchmarks are outlined in [6] the two SPLASH applications are described in [45] the Munin programs are described in [14] and a detailed description of ILINK can be found in [17] Here, we briefly summarize the 67 applications studied. Table 4.1 gives relevant information about each of ....

S. Woo, J. Singh, and J. Hennessy. The performance advantages of integrating block data transfer in cache-coherent multiprocessors. In The Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 219--229, Oct 1994.


Hardware Support for Flexible Distributed Shared Memory - Reinhardt, al. (1998)   (1 citation)  (Correct)

....they directly support the higher level shared memory abstraction. In contrast, other systems that seek to integrate message passing and shared memory treat user level message passing as a complementary alternative to rather than a fundamental building block for shared memory communication [22, 27, 56]. To optimize communication in these systems, critical portions of the program must be rewritten in a message passing style. Of course, if desired, Tempest programmers can also dispense with shared memory and use messages directly for example, to implement synchronization primitives. This ....

S. C. Woo, J. P. Singh, and J. L. Hennessy. "The performance advantages of integrating block data transfer in cache-coherent multiprocessors." In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), pages 219--229, Oct. 1996.


Combining Compile-Time and Run-Time Support for.. - Dwarkadas, Lu.. (1999)   (3 citations)  (Correct)

.... can be utilized by the run time, not only to optimize communication, but also to balance load [15] Several recent proposals for hardware shared memory machines include a message passing subsystem designed in part to allow applications to take advantage of bulk data transfer [20] 21] Woo et al. [33] evaluate one such design in the context of the Flash system. While Woo et al. focus on establishing the magnitude of the performance benefits of bulk data transfer with hardware based shared memory, we have explored in addition ways for the compiler to automate the use of the bulk data transfer ....

S.C. Woo, J.P. Singh, and J.L. Hennessy. The performance advantages of integrating block data transfer in cache-coherent multiprocessors. In Proceedings of the 6th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 219--231, October 1994.


OS Support for Improving Data Locality on CC-NUMA.. - Verghese, Devine.. (1996)   (14 citations)  (Correct)

No context found.

S. Woo, J. P. Singh, J. L. Hennesey. The performance advantages of integrating block data transfer in cache-coherent multiprocessors. In Proceedings, Architectural Support for Programming Languages and Operating Systems, pages 219-232, October 1994.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC