| Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors. Stanford University Technical Report No. CSL-TR-93-593, December 1993. |
....and bulk data transfer. They measured several versions of a simple Jacobian SOR code on an Alewife simulator, which supported both shared memory and message passing. Their results agreed with our finding (Section 5) that message passing and shared memory can perform equally well. Woo et al. [23] studied the implications of adding a message passing like block transfer facility to shared memory. They added this feature to an architectural simulator of the shared memory Stanford FLASH machine and modified five programs to use it. They found that block transfer was difficult to use ....
....a large volume of data between a producer and consumer. EM3D MP sends a couple hundred messages to transfer the data that requires several hundred thousand cache misses and many times that many protocol messages. Mechanisms for bulk data transfer and more efficient protocols have been proposed [23, 20]. We identified two major sources of overhead in shared memory programs. First is the cost of moving large quantities of data with a request response shared memory protocol and of updating these values with an invalidation based protocol. 9 Second is the cost of synchronization, which, in many ....
Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors. Technical Report CSL-TR-93-593, Department of Computer Science, Stanford University, November 1993. To appear in ASPLOS VI.
....communication is usually more natural to the load store programming paradigm and we fo cus on it here (see Section 5.5) 2 . The performance advantages of sender initiated load store communication depend intimately on details of the architecture and application and are discussed in [17]. In our block transfer versions, we use the type of communication that is most natural to the application, which is sender initiated for all our applications except for Cholesky, where receiver initiated block transfer has some important advantages (see Section 5.3) 3 Architecture and ....
....alternatives for incorporating block transfer, often at successive levels of implementation complexity and performance gain. While we have experimented with many intermediate block transfer versions, we do not discuss them here for reasons of space. A more complete discussion can be found in [17]. We start with highly optimized load store versions of the applications, discuss the most effective ways to incorporate block transfer, and examine the performance benefits, trying to isolate their sources as much as possible. We also examine how the effectiveness of block transfer changes with ....
[Article contains additional citation context not shown here]
Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors. Technical Report CSL-TR-93-593, Stanford University, December 1993.
....which is then used to permute the keys for the next iteration. All to all communication occurs in the permutation step. The permutation is inherently a sender determined one, so that keys are communicated through writes rather than reads. The SPLASH 2 implementation is described more fully in [WSH93]. Ocean: The ocean simulation studies large scale ocean movements based on eddy and boundary currents, and is an enhanced version of the Ocean application in the SPLASH suite. The major differences between this version and the SPLASH version are: i) it is written in C rather than FORTRAN, ii) ....
....subgrids are allocated contiguously and locally in the nodes that own them, and (iv) it uses a red black Gauss Seidel multigrid technique based on that presented in [Bra77] whereas the SPLASH version uses a relaxed Gauss Seidel SOR solver. The SPLASH 2 implementation is described more fully in [WSH93] Barnes: The BARNES application simulates the interaction of a system of bodies (galaxies or particles, for example) in three dimensions over a number of time steps, using the Barnes Hut hierarchical N body method. While similar to the Barnes Hut application in SPLASH, it differs in two respects: ....
Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors. Stanford University Technical Report No. CSL-TR-93-593, December 1993.
.... and machine sizes, that the hardware platform they simulated (the Thinking Machines CM 5) is now quite dated, and they used different programs with somewhat less challenging communication patterns than we do (e.g. none so challenging as FFT or Radix sorting) Another simulation study by Woo et al. [16] studied the impact of using a block transfer (message passing) facility to accelerate hardware coherent shared memory on a system that provides integrated support for block transfer. They found that block transfer did not promise to improve performance as greatly as had been expected. Both these ....
.... support earlier results from simulation studies that indicate that explicit message passing does not have substantial advantages on efficient modern hardware coherent multiprocessors over the native load store CC SAS model, even for regular applications with naturally coarse grained communication [3, 16]. These results are despite the fact that we do not use prefetching to hide remote access latency in our CC SAS programs (or to hide local access latency in all models) which among our applications would help in only one case (FFT) by 10 15 . In terms of programmability, we found that the CC SAS ....
Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The performance advantages of integrating message-passing in cachecoherent multiprocessors. In Proceedings of Architectural Support For Programming Languages and Operating Systems, 1994.
....to improve the communication to computation ratio, ii) grids are conceptually represented as 4 D arrays, with all subgrids allocated contiguously and locally in the nodes that own them, and (iii) it uses a red black Gauss Seidel multigrid equation solver [Bra77] rather than an SOR solver. See [WSH93] for more details. Radiosity: This application computes the equilibrium distribution of light in a scene using the iterative hierarchical diffuse radiosity method [HSA91] A scene is initially modeled as a number of large input polygons. Light transport interactions are computed among these ....
Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors. Stanford University Technical Report No. CSL-TR-93-593, December 1993.
.... that read remote data when the problem partitions are large relative to the cache sizes, the write buffers are deep (since remote writes take longer to satisfy than local writes, the write buffers must be deeper to prevent the processor from stalling) and adequate network bandwidth is available [WSH93]. Because of these limitations, the version that is used in this work utilizes remote reads. During the three matrix transpose phases, all processors attempt to communicate with all other processors. Having each processor attempt to communicate with processor 0, then with processor 1, etc. is a ....
Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors. Stanford University Technical Report No. CSL-TR93 -593, December 1993.
....in Table 3.4. MP3D comes from the vector supercomputing world, is not optimized for parallel computation, and therefore causes a lot of communication. It is included for its value as a communication stress test. Descriptions of the applications can be found in: LU and FFT [RSG93] Ocean and Radix [WSH93]; Barnes and MP3D [SWG92] The problem sizes we use for the applications are realistic, and would be run on 16 processor machines in practice. However, owing to the costs of simulation, they clearly are not the largest problem sizes one would run. Some of the applications have two important ....
Steven Cameron Woo, Jaswinder Pal Singh, and John Hennessy. The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors. Technical Report CSL-TR-93-593, Stanford University, December 1993.
No context found.
Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors. Stanford University Technical Report No. CSL-TR-93-593, December 1993.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC