69 citations found. Retrieving documents...
J. Bala, Bruck, R. Cypher, P. Elustando, A. Ho, C.-T. Ho, S. Kipnis, and M. Snir. CCL: A portable and tunable collective communication library for scalable parallel computers. Journal of Parallel and Distributed Computing, 6(2):154-- 164, February 1995.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

Scalability versus Execution Time in Scalable Systems - Sun (2002)   (Correct)

.... on a square 2 D torus with p processors (i.e. 2 D mesh, wraparound, square) 13] If a hypercube topology or a multistage Omega network is assumed the communication cost would be log(p) r 12(p 1) b and log(p) r 8(p 1) n 1 b for single systems and systems with multiple right sides, respectively [12, 14]. 4.2. Scalability Analysis The scalability analysis of the PDD algorithm for solving single systems can be found in [11] In the following, we give a scalability analysis of the PDD algorithm for solving systems with multiple right sides, where the number of right sides does not increase with ....

....SP2 than on the Paragon. This means the PPT algorithm has a better scalability on the SP2 than on the Paragon. The better scalability may be due to various reasons, including larger memory and more efficient all to all communication subroutines available on the SP2. Interested readers may refer to [14] for more information on all to all communications. The emphasis here is that when an algorithm is not ideally scalable, its scalability does vary with machine parameters. Range comparison is not only useful in algorithm or software development. It is also applicable in evaluating hardware ....

V. Bala et al., Ccl: A portable and tunable collective communication library for scalable parallel computers, IEEE Trans. Parallel Distrib. Systems 6 (Feb. 1995), 154--164.


A Randomized Parallel Sorting Algorithm With an Experimental.. - Helman, Bader, JaJa (1998)   (3 citations)  (Correct)

....versions. Also, when reading or writing more than a single element, bulk data transports are provided with corresponding bulk read and bulk write primitives. Our collective communication primitives, described in detail in [4] are similar to those of the MPI [24] the IBM POWERparallel [7], and the Cray MPP systems [12] and, for example, include the following: transpose, bcast, gather,andscatter. Brief descriptions of these are as follows. The transpose primitive is an all to all personalized communication in which each processor has to send a unique block of data to every ....

....different implementations of the communication primitives were allowed for each machine. Wherever possible, we tried to use the vendor supplied implementations. In fact, IBM does provide all of our communication primitives as part of its machine specific Collective Communication Library (CCL) [7] and MPI. As one might expect, they were faster than the high level SPLIT C implementation. Tab l e s 1 , 2, 3,and4 display the performance of our sample sort as a function of input distribution for a variety of input sizes. In each case, the performance is essentially independent of the input ....

V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C.-T. Ho, S. Kipnis, and M. Snir. CCL: A portable and tunable collective communication library for scalable parallel computers. IEEE Transactions on Parallel and Distributed Systems, 6, 2 (Feb. 1995), pp. 154--164.


A New Deterministic Parallel Sorting Algorithm With an.. - Helman, JaJa, Bader (1997)   (8 citations)  (Correct)

....versions. Also, when reading or writing more than a single element, bulk data transports are provided with corresponding bulk read and bulk write primitives. Our collective communication primitives, described in detail in [4] are similar to those of the MPI [16] the IBM POWERparallel [6], and the Cray MPP systems [9] and, for example, include the following: transpose, bcast, gather,andscatter.Brief descriptions of these are as follows. The transpose primitive is an all to all personalized communication in which each processor has to send a unique block of data to every ....

....different implementations of the communication primitives were allowed for each machine. Wherever possible, we tried to use the vendor supplied implementations. In fact, IBM does provide all of our communication primitives as part of its machine specific Collective Communication Library (CCL) [6] and MPI. As one might expect, they were faster than the high level SPLIT C implementation. Optimal Number of Samples s for Sorting on T3D Number of Processors int. proc. 8 16 32 64 128 Table 1: Optimal number of samples s for sorting the [WR] integer benchmark on the Cray T3D, for a ....

V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C.-T. Ho, S. Kipnis, and M. Snir. CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers. IEEE Transactions on Parallel and Distributed Systems, 6:154--164, 1995.


A Randomized Parallel Sorting Algorithm With an Experimental.. - Helman, Bader, JaJa (1998)   (3 citations)  (Correct)

....versions. Also, when reading or writing more than a single element, bulk data transports are provided with corresponding bulk # read and bulk # write primitives. Our collective communication primitives, described in detail in [4] are similar to those of the MPI [24] the IBM POWERparallel [7], and the Cray MPP systems [12] and, for example, include the following: transpose, bcast, gather, and scatter. Brief descriptions of these are as follows. The transpose primitive is an all to all personalized communication in which each processor has to send a unique block of data to every ....

....different implementations of the communication primitives were allowed for each machine. Wherever possible, we tried to use the vendor supplied implementations. In fact, IBM does provide all of our communication primitives as part of its machine specific Collective Communication Library (CCL) [7] and MPI. As one might expect, they were faster than the highlevel Split C implementation. Tables 1#4 display the performance of our sample sort as a function of input distribution for a variety of input sizes. In each case, the performance is essentially independent of the input distribution. ....

V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C. T. Ho, S. Kipnis, and M. Snir, CCL: A portable and tunable collective communication library for scalable parallel computers, IEEE Transactions on Parallel and Distributed Systems 6, No. 2 (Feb. 1995), 154#164.


A New Deterministic Parallel Sorting Algorithm With an.. - Helman, JaJa, Bader (1996)   (8 citations)  (Correct)

....nonblocking versions. Also, when reading or writing more than a single element, bulk data transports are provided with corresponding bulk read and bulk write primitives. Our collective communication primitives, described in detail in [6] are similar to those of the MPI [17] the IBM POWERparallel [7], and the Cray MPP systems [9] and, for example, include the following: transpose, bcast, gather, and scatter. Brief descriptions of these are as follows. The transpose primitive is an all to all personalized communication in which each processor has to send a unique block of data to every ....

....different implementations of the communication primitives were allowed for each machine. Wherever possible, we tried to use the vendor supplied implementations. In fact, IBM does provide all of our communication primitives as part of its machine specific Collective Communication Library (CCL) [7] and MPI. As one might expect, they were faster than the high level Split C implementation. Optimal Number of Samples s for Sorting on T3D Number of Processors int. proc. 8 16 32 64 128 Table I: Optimal number of samples s for sorting the [WR] integer benchmark on the Cray T3D, for a ....

V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C.-T. Ho, S. Kipnis, and M. Snir. CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers. IEEE Transactions on Parallel and Distributed Systems, 6:154--164, 1995.


Practical Parallel Algorithms for Personalized.. - Bader, Helman.. (1995)   (21 citations)  (Correct)

....non blocking versions. Also, when reading or writing more than a single element, bulk data transports are provided with corresponding bulk read and bulk write primitives. Our collective communication primitives, described in detail in [7] are similar to those of MPI [33] the IBM POWERparallel [9], and the Cray MPP systems [16] and, for example, include the following: transpose, bcast, gather, and scatter. Brief descriptions of these are as follows. The transpose primitive is an all to all personalized com Note that the NAS IS benchmark requires that the integers be ranked and not ....

V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C.-T. Ho, S. Kipnis, and M. Snir. CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers. IEEE Transactions on Parallel and Distributed Systems, 6:154--164, 1995. 20


A Bandwidth Latency Tradeoff for Broadcast and Reduction - Sanders, Sibeyn   (Correct)

....over both simple algorithms approaches a factor two (3=2 for the simplex model) For some powerful network topologies, somewhat better algorithms are known. For Hypercubes, there is an elegant and fast algorithm which runs in time HC = k(1 = k(1 O( t log(P ) k) O(t log P ) [1, 4]. However, no similarly good algorithm was known for networks with low bisection bandwidth, e.g. meshes. Even for fully connected networks the best known algorithms for arbitrary P are quite complicated [2, 8] The fractional tree algorithm does not have this problem. In Sec. 4 we explain how ....

V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C. Ho, S. Kipnis, and M. Snir. CCL: A portable and tunable collective communication library for scalable parallel computers. IEEE Transactions on Parallel and Distributed Systems, 6(2):154--164, 1995.


A General-Purpose Model for Heterogeneous Computation - Williams (2000)   (Correct)

....blocks in a variety of parallel algorithms. Proper implementation of these collective communication operations is vital to the efficient execution of the parallel algorithms that use them. Collective communication for homogeneous parallel environments has been throughly researched over the years [BBC94, BGP94, MR95]. Collective operations designed for traditional parallel machines are not adequate for heterogeneous environments. As a result, we design and analyze six collective communication algorithms gather, scatter, reduction, prefix sums, one to all broadcast, and all to all broadcast for ....

....super i steps. Balancing these objectives is a nontrivial task. Nevertheless, HBSP k provides guidance on how to design efficient heterogeneous programs. 4. 3 HBSP k Collective Communication Algorithms Collective communication plays an important role in the development of parallel programs [BBC94, BGP94, MR95]. It simplifies the programming task, facilitates the implementation of efficient communication schemes, and promotes portability. In the following subsections, we design six HBSP k collective communication operations gather, scatter, reduction, prefix sums, one to all broadcast, and ....

Vasanth Bala, Jehoshua Bruck, Robert Cypher, Pablo Elustondo, Alex Ho, Ching-Tien Ho, Shlomo Kipnis, and Marc Snir. "CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers." In Proceedings of 8th International Parallel Processing Symposium, pp. 835--844, 1994.


Recent Advances of SKaMPI - Reussner   (Correct)

....le. All possible message length can be restricted to a multiple of this given value. Without access to the Cray T3E we would not have been able to detect this e ect. 2. 2 The quality of algorithms for MPI Gather Collective operations play a crucial role in programming message passing systems [1]. Besides measuring the performance of collective operations, also an evaluation of their quality gives useful information for the software developer. The quality of a collective operation is given through the selection of an algorithm appropriate to the given hardware and through a good ....

Vasanth Bala, Jehoshua Bruck, Robert Cypher, Pablo Elustondo, Alex Ho, ChingTien Ho, Shlomo Kipnis, and Marc Snir. CCL: A portable and tunable collective communication library for scalable parallel computers. IEEE Transactions on Parallel and Distributed Systems, 6(2):154-164, February 1995.


Communication Modeling of Heterogeneous Networks.. - Banikazemi.. (1999)   (5 citations)  (Correct)

....However, most of the implementations, specially those used for NOW environments, implement the different collective operations on top of point to point operations. Different types of trees (e.g. binomial trees, sequential trees, and k trees) can be used for implementing these operations [6, 9, 10, 19]. We can classify the MPI collective communication operations into three major categories: one to many (such as MPI Bcast and MPI Scatter) manyto one (such as MPI Gather) and many to many (such as MPI Allgather and MPI Alltoall) We present the analysis for some of the representative ....

J. Bruck, R. Cypher, P. Elustando, A. Ho, C.T. Ho, V. Bala, S. Kipnis, and M. Snir. CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers. In Proceedings of the International Parallel Processing Symposium, 1994.


Modeling the Communication Behavior of Distributed Memory.. - Foschia, Rauber, Rünger   (Correct)

....only well suited for distributed memory machines but also for numerical applications. To express the communication, the programs use either machine specific communication libraries (like the NX 2 system on the Intel Paragon) or machine independent libraries like PVM [6] p4 [8] MPI [14] or CCL [3]. These libraries provide communication primitives for single node to single node transfers, for global communication operations like single broadcast and multi broadcast transmissions, and for global reduction operations like single accumulations or multi accumulations [7, 24] To write efficient ....

V. Bala, J. Bruck, R. Cypher, P. Elustondo, C.-T. Ho, S. Kipnis, and M. Snir. CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers. IEEE Transactions on Parallel and Distributed Systems, 6(2):154--164, 1995. 21


Hybrid Algorithms for Complete Exchange in 2D Meshes - Sundar, Jayasimha, Panda.. (1996)   (5 citations)  (Correct)

....problem in the context of specific applications such as parallel sorting [2, 1, 15, 22] Efficient implementation of complete exchange is a significant issue for designers of collective communication libraries. The question of performing collective communication efficiently has been addressed in [3, 4]. A survey of collective communication algorithms, including those for complete exchange, is given in [17] Multiphase algorithms have been proposed for the hypercube [7] and these same algorithms have been implemented on a 2D mesh [8] This paper has several features that distinguish it from ....

Vasanth Bala, Jehoshua Bruck, Robert Cypher, Pablo Elustondo, Alex Ho, Ching-Tien Ho, Shlomo Kipnis, and Marc Snir. CCL: A portable and tunable collective communication library for scalable parallel computers. Proceedings of the 8th International Parallel Processing Symposium, pages 835--844, 1994.


Scalable S-to-P Broadcasting on Message-Passing MPPs - Hambrusch, Khokhar, Liu (1998)   (1 citation)  (Correct)

....r rows and c columns. Algorithm Br xy dim selects the rows if r c and the columns if r c. In the algorithms described so far processors issue sends and receives to facilitate communication. We do not make use of existing communication operations generally available in communication libraries [1, 2, 7]. S to p broadcasting can easily be stated in terms of known communication operations. We considered two such approaches. The first one, Algorithm Xor, invokes an all to all personalized exchange communication [7] The second such approach results in Algorithm 2 Step. This algorithm performs the ....

V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C.-T. Ho, S. Kipnis, and M. Snir, "CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers," Proc. 8-th IPPS, pp. 835-844, 1994.


ECO: Efficient Collective Operations for Communication on.. - Lowekamp, Beguelin (1995)   (26 citations)  (Correct)

.... high performance collective communication libraries, implementation of their library on several architectures [3] and discussions of other packages and approaches to collective communication [13] Bala, et al. describe a collective communication library originally designed for the IBM SP1 [2]. They discuss performance tuning issues, as well as a detailed discussion of the semantics of collective communication and group membership, including the correctness of collective operations. Considerable work has been done on collective operations, and multicast communication in general, which ....

Vasanth Bala, Jehoshua Bruck, Robert Cypher, Pablo Elustondo, Alex Ho, Ching-Tien Ho, Shlomo Kipnis, and Marc Snir. CCL: A portable and tunable collective communication library for scalable parallel computers. In Proceedings of 8th International Parallel Processing Symposium, pages 835--844. IEEE Comput. Soc. Press, 1994.


Efficient Collective Communication Operations in PVMe - Bernaschi, Iannello (1995)   (3 citations)  (Correct)

....approaches are possible. The simpler one is to implement the barrier by performing an empty reduce followed by an empty broadcast. If this solution is selected, an ff tree can be used to speedup both steps. An alternative approach, is to use global algorithms like those based on circulant graphs [1] or generalized Fibonacci numbers [3] 4 Experimental results In this section we report some preliminary results produced by the algorithm based on the ff trees, in particular for the reduce operation. The value of ff we selected is 0.6 which corresponds to the value predicted by using the ....

V. Bala, et al., "CCL: a portable and tunable collective communication library for scalable parallel computers", Procs. of the 8th Int. Conf. on Parallel Processing, IEEE, April 1994.


Efficient Implementation of Reduce-Scatter in MPI - Bernaschi, Iannello, Lauria (1998)   (2 citations)  (Correct)

....of LogP. In these papers experimental results confirming the theoretical analysis are reported. Algorithms developed in the framework of fully connected models have been extensively employed in general purpose, portable communication libraries. This approach has been followed in the design of CCL [2], an experimental communication library including all basic collective operations. Most of the results of this work have been incorporated in the IBM implementation of MPI we used in our experiments. Extensive performance data about advanced algorithms for broadcast, reduce and global combine are ....

V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C.T. Ho, S. Kipnis, and M. Snir, "CCL: a portable and tunable collective communication library for scalable parallel computers", IEEE Trans. on Parallel and Distributed Systems , vol. 6, n. 2, Feb. 1995.


Randomized, Oblivious, Minimal Routing Algorithms for.. - Nesson (1995)   (Correct)

.... Machine Scientific Software Library (CMSSL) provided by Thinking Machines Corporation for its CM 200 and CM 5 systems [129] Recently, IBM has developed a portable library of communication primitives for its SP series of parallel computers, called the Collective Communication Library (CCL) [6]. Compiled Routing While customized libraries have many advantages, they have some weaknesses which directly affect the programmers who use them. Obviously, portability is a major issue if one plans to use an application on different parallel systems. However, even when confined to a single ....

V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C.-T. Ho, S. Kipnis, and M. Snir, CCL: A Portable and Tunable Collective Communication Library for Parallel Computers, IEEE Trans. on Parallel and Distributed Systems, 6(2):154--164, February 1995.


First Year Report - Winstanley (1997)   (Correct)

....task creation or fault tolerance. There is a new version of the MPI standard, MPI 2, due to be released soon, will rectify many of these shortcomings. However, it will take time before the new standard is widely implemented. There are alternative communication libraries, such as Express and CCL [1]; these newer libraries are intended for SPMD programming and are designed speci cally so that group communications are ecient. However, the stability and number of platforms supported by these libraries need to be investigated before committing to them instead of MPI. Although the communication ....

V Bala, J Bruck, R Cypher, P Elustondo, C T Ho, S Kipnis, and M Snir. Ccl: A portable and tunable collective communication library for scalable parallel computers. IEEE Transactions on Parallel and Distributed Systems, 6(2):154|164, 1995.


Topology and Routing in Clusters: From Theory to Practice - Etsion, Raizman, Feitelson   (Correct)

....different communication patterns. component price PCI network interface 995. dual 8 way switch 2000. 3 foot cable 120. Table 1: List prices of Myrinet components as of November 1999. model [9] and for numerous algorithms for the implementation of collective communication primitives [10, 3, 4]. In contrast, we focus on general mechanisms that do not require detailed knowledge of network properties. Our work was inspired by well known theoretical work promoting ideas such as randomized routing, using multiple paths, and creating topologies with a high bisection bandwidth. We showed a ....

V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C-T. Ho, S. Kipnis, and M. Snir, "CCL: a portable and tunable collective communications library for scalable parallel computers". IEEE Trans. Parallel & Distributed Syst. 6(2), pp. 154--164, Feb 1995.


PCODE: An Efficient and Reliable Collective.. - Bruck, Dolev, Ho.. (1994)   Self-citation (Bruck Ho)   (Correct)

....bruck syst ems.calt ech.edu tlnstitute of CS Hebrew University Jerusalem, Israel dolev cs.huji.ac.il IBM Almaden Research Center 650 Harry Road San ,lose, CA 95120 ho, st tong almaden. ibm. corn University of Maryland Institute of Advanced Computer Studies College Park, MD 20742 rimon umiacs.mnd.edu Abstract Existing programming enwronments for clusters are typically built on top of a point to point coremunica hon layer (send and receive) over local area networks (LANs) and, as a result. suffer from poor performance m the collective commumcahon part, For ezample, a ....

....via UDP broadcast he ezperimental results we obtained indicate that the performance advantage of PCODE over the current point to point approach (TCP) can be as b. igh as an order of magni tude on a cluster of 16 workstations. Supported in part. by the NSF Young Investigator Award CCR 9457811, by a grant from the IBM Almaden Research Center, San Jose, Calikrnia and by a grant from the AT T Fmmdation. 1 Introduction Parallel computing on clusters of workstations and personal computers has very high potential, since it leverages existing hardware and software. In fact, there are ....

[Article contains additional citation context not shown here]

V. BMa, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C.T. Ho, S. Kipms, and M Snir, "CCL: A portable and tunable collective communication It- [16] brary for scalable parallel computers", International Parallel Processing Symposium, pp. 835844, Canrun, Mexico, April 1994.


Scheduling Multiple Multicast for Heterogeneous Network of .. - Jan-Jan Wu Shih-Hsien (2000)   (Correct)

No context found.

J. Bala, Bruck, R. Cypher, P. Elustando, A. Ho, C.-T. Ho, S. Kipnis, and M. Snir. CCL: A portable and tunable collective communication library for scalable parallel computers. Journal of Parallel and Distributed Computing, 6(2):154-- 164, February 1995.


Unresponsiveness-Tolerant Collective Communication - Pakin (2001)   (Correct)

No context found.

Vasanth Bala, Jehoshua Bruck, Robert Cypher, Pablo Elustondo, Alex Ho, Ching-Tien Ho, Shlomo Kipnis, and Marc Snir. CCL: A portable and tunable collective communication library for scalable parallel computers. IEEE Transactions on Parallel and Distributed Systems, 6(1):154--164, February 1995. Available from http://www.cs.jhu.edu/~cypher/pubs/ccl.ps. 128


Can Scatter Communication Benefit from Multidestination.. - Banikazemi, Panda   (Correct)

No context found.

V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C.-T. Ho, S. Kipnis, and M. Snir. CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers. In Proceedings of the International Parallel Processing Symposium, 1994.


Scientific and Engineering Computation - Janusz Kowalik Editor   (Correct)

No context found.

J. Bruck, R. Cypher, P. Elustond, A. Ho, C-T. Ho, V. Bala, S. Kipnis, , and M. Snir. Ccl: A portable and tunable collective communicationlibrary for scalable parallel computers. IEEE Trans. on Parallel and Distributed Systems, 6(2):154--164, 1995.


"Research sponsored in part by the Phillips Laboratory, .. - Under Cooperative..   (Correct)

No context found.

V. Bala; J. Bruck; R. Cypher; P. Elustondo; A. Ho; C-T. Ho; S. Kipnis; M. Snir, "CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers," IEEE Transactions on Parallel and Distributed Systems, Vol. 6, No. 2, Feb. 1995.

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC