10 citations found. Retrieving documents...
S. Luna, "Implementing an Efficient Portable Global Memory Layer on Distributed Memory Multiprocessors," U. C. Berkeley Technical Report #CSD-94-810, May 1994.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Implementing Split-C on the Meiko Computing Surface 2.. - Chad Yoshika Wa   (Correct)

....ResultsNovember 15, 1994 1 Implementing Split C on the Meiko Computing Surface 2: Preliminary Results Chad Yoshikawa 1 1. 0 Introduction Split C, a split phase variant of the C programming language, has been ported to many platforms including the CM 5 and an FDDI network of HP workstations[LUN94]. Traditionally, the language has been implemented on an Active Messages [vE93] substrate. In the Meiko CS 2 implementation, however, we are exploring an alternate communication layer, the Elan library. The Elan library [EL93] is a communications paradigm that utilizes completion events on a ....

S. Luna. Implementing an Efficient Portable Global Memory Layer on Distributed Memory Multiprocessors, Masters thesis, Computer Science Division EECS, U.C. Berkeley, May 1994.


HPAM: An Active Message layer for a Network of HP Workstations - University (1994)   (64 citations)  (Correct)

....memory copy. We did not view ordering as an essential property because it is not required for all higher level communication abstractions. For example, for a shared memory abstraction with explicit completion events (Split C) 3] ordering is not as critical a property to support as reliability [9]. Ordering can be implemented cheaply on top of HPAM since the programmer is freed from reliability concerns. 3 3. Abstraction: Active Messages with Request Reply Active messages present a simple mechanism to the programmer: each message contains the address of a user level handler (code segment) ....

Luna, Steve. "Implementing an Efficient Portable Global Memory Layer on Distributed Memory Multiprocessors", UCB Technical report UCB/CSD 94/#810, 1994


Modeling the Benefits of Mixed Data and Task Parallelism - Chakrabarti, Demmel, Yelick (1995)   (24 citations)  (Correct)

....are normalized to a BLAS 3 FLOP, and the model is fit to data generated from analytical models [9, 10, 23] The curves were fit for 2 P 500 and 100 n = N 1=2 10000. An estimate of memory per processor in megabytes is given in the column marked M P. Estimates for ff and fi are in part from [23, 27, 18, 1]. of this magnitude (which we will call packing loss ) may substantially mask the benefits which would otherwise be obtained from mixed parallelism. Furthermore, we know of no tighter analysis of this constant for a given graph. We are thus faced with the following problem. Data parallelism is ....

S. Luna. Implementing an efficient portable global memory layer on distributed memory multiprocessors. Technical Report UCB/CSD-94-810, University of California, Berkeley, CA 94720, May 1994.


Towards Modeling the Performance of a Fast Connected.. - Steven Lumetta   (9 citations)  (Correct)

....local and global objects helps us to apply the cost model for optimization. Using Split C also gives our implementation portability. Versions of Split C exist on the Cray T3D, the IBM SP 1 and SP 2, the Intel Paragon, the Thinking Machines Corp. CM 5, the Meiko CS 2, and networks of workstations [2, 20, 23, 29]. Although our algorithm accepts arbitrary graphs as input, obtaining optimal performance requires a reasonable partitioning of the graph across processors to enhance locality and load balancing. Partitioning techniques rely on the ability to determine properties of the graph structure. ....

S. Luna, "Implementing an Efficient Portable Global Memory Layer on Distributed Memory Multiprocessors," U. C. Berkeley Technical Report #CSD-94-810, May 1994.


Efficient Resource Scheduling in Multiprocessors - Chakrabarti (1996)   (1 citation)  (Correct)

....10 4 9 CM5 VU CMMD 1:4 Theta 10 4 103 IBM SP1 MPL 2:8 Theta 10 4 50 IBM SP2 MPL 1:2 Theta 10 4 60 Table 2.1: Estimates of message startup overhead ff and transfer time per double fi scaled to the peak floating point operation time for different machines. Estimates are in part from [107, 117, 88, 7]; the network software are described in these references. to be very large on most distributed memory machines, although reasonable bandwidth can be supported for sufficiently large messages [109, 106] See Table 2.1 for some idea of the CPU and communication speeds of current multiprocessors. ....

....are normalized to a BLAS 3 FLOP, and the model is fit to data generated from analytical models [40, 42, 107] The curves were fit for 2 P 500 and 100 n = N 1=2 10000. An estimate of memory per processor in megabytes is given in the column marked M P. Estimates for ff and fi are in part from [107, 117, 88, 7]. of processors, network latency, and network bandwidth. Using these given functions, we first estimate the parallel running time r(N; P ) for a given machine and problem, then fit Equation (3.1) to it using Matlab. The results are presented in Table 3.1. 3.4.2 Regular task trees The second part ....

S. Luna. Implementing an efficient portable global memory layer on distributed memory multiprocessors. Technical Report UCB/CSD-94-810, University of California, Berkeley, CA 94720, May 1994.


Towards Modeling the Performance of a Fast Connected.. - Lumetta.. (1996)   (9 citations)  (Correct)

....let us focus on the process of optimization while hiding specific hardware details. Split C also gives our implementation portability, with versions running on the Cray T3D, the IBM SP 1 and SP 2, the Intel Paragon, the Thinking Machines CM 5, the Meiko CS 2, and networks of workstations [2, 24, 27, 34]. 2.3 Parallel platforms We consider three large scale parallel machines: the Cray T3D, the Meiko CS 2, and the Thinking Machines CM 5. These machines offer a range of computational and communication performance against which to evaluate the algorithm implementation. In each case, the Split C ....

....roughly 20 s [27, 34] The CM 5 is based on the Cypress Sparc microprocessor, clocked at 33 MHz, with a 64 kB unified instruction and data cache. A Split C global read involves issuing a CMAML active message to access the remote location and to reply with the value, taking approximately 12 s [8, 24]. Traditional measures of the computational performance of the node, such as LINPACK, MFLOPS, and SPECmarks, offer little indication of the performance on this integer graph algorithm, which stresses the storage hierarchy, so instead we calibrate the local node performance empirically in Section ....

S. Luna, "Implementing an Efficient Portable Global Memory Layer on Distributed Memory Multiprocessors, " U. C. Berkeley Technical Report #CSD-94-810, May 1994.


Portable Library Support for Irregular Applications - Wen (1995)   (2 citations)  (Correct)

....of the machines used in this work. For portability, we use the communication libraries provided by the vendors instead of the research prototypes such as the generic active messages (GAM) CKK 94] developed by the NOW group. The measurements for the CM5, the Paragon, and the SP1 are from Luna [Lun94] and the measurements for the Sparc cluster are from Keeton et al. [KAP95] On every machine, the communication time for small messages is entirely dominated by the send and receive overheads on each end. 3.1.2 Our approach Although distributed memory architectures provide high performance and ....

....layer is free of any machine dependent code. The runtime layer has been ported on three types of communication libraries: active messages (CMAML) cooperative message passing (MPLp, NX) and BSD sockets using the TCP IP protocol stack. We reuse code from GAM [LC95, CKK 94] and libsplitc [Lun94] the communication library for the Split C language [CDG 93] Most modifications to their code are for integrating the the communication primitives with the Multipol thread layer. In the paragraphs that follow, we sketch the implementation of these ports and describe the source of their ....

Steve Luna. Implementing an efficient portable global memory layer on distributed memory multiprocessors. Master's thesis, Computer Science Division, University of California at Berkeley, 1994.


Connected Components on Distributed Memory Machines - Krishnamurthy, Lumetta.. (1994)   (12 citations)  (Correct)

....all of which are stars and are marked with unique values, apply a modified Shiloach Vishkin algorithm. Iterate over the following steps until done: 4 Implementations of the language exist on a variety of machines including the IBM SP 2, the Intel Paragon, the Cray T3D, and the Meiko CS 2 [1, 8, 10, 11]. CONNECTED COMPONENTS ON DISTRIBUTED MEMORY MACHINES 7 edge remote collapsed edge Processor 1 Processor 2 local nodes representative Figure 5. Collapsing remote edges. By collapsing remote edges before entering the global phase, we reduce the amount of work required for each iteration of that ....

S. Luna, "Implementing an Efficient Portable Global Memory Layer on Distributed Memory Multiprocessors," U. C. Berkeley Technical Report #CSD-94-810, May 1994.


Mantis User's Guide, Version 1.0 - Steven Lumetta And   (Correct)

No context found.

S. Luna, "Implementing an Efficient Portable Global Memory Layer on Distributed Memory Multiprocessors," U. C. Berkeley Technical Report #CSD-94-810, May 1994.


HPAM: An Active Message layer for a Network of HP - Workstations Richard Martin   (Correct)

No context found.

Luna, Steve. "Implementing an Efficient Portable Global Memory Layer on Distributed Memory Multiprocessors", UCB Technical report UCB/CSD 94/#810, 1994

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC