| P. K. McKinley and D. F. Robinson. Collective Communication in Wormhole-Routed Massively Parallel Computers. IEEE Computer, pages 39--50, Dec 1995. |
....elements through interprocessor communication which can be either one to one (unicast communication) or could involve a group of processors (collective communication) Unicast communication is concerned with sending a message from a source node to one destination. Collective communication [1] involves a group of processing nodes that intercommunicate in a specific manner. Examples of collective communication primitives are barrier synchronization, broadcast, gather, scatter, all gather, all to all, global reduction, and scan. Because of the nature of parallel programming, which ....
P. K. McKinley, Y. Tsai, and D. F. Robinson, "Collective communication in wormhole-routed massively parallel computers," IEEE Computer, vol. 28, pp. 39--50, December 1995.
....blocks in a variety of parallel algorithms. Proper implementation of these collective communication operations is vital to the efficient execution of the parallel algorithms that use them. Collective communication for homogeneous parallel environments has been throughly researched over the years [BBC94, BGP94, MR95]. Collective operations designed for traditional parallel machines are not adequate for heterogeneous environments. As a result, we design and analyze six collective communication algorithms gather, scatter, reduction, prefix sums, one to all broadcast, and all to all broadcast for ....
....super i steps. Balancing these objectives is a nontrivial task. Nevertheless, HBSP k provides guidance on how to design efficient heterogeneous programs. 4. 3 HBSP k Collective Communication Algorithms Collective communication plays an important role in the development of parallel programs [BBC94, BGP94, MR95]. It simplifies the programming task, facilitates the implementation of efficient communication schemes, and promotes portability. In the following subsections, we design six HBSP k collective communication operations gather, scatter, reduction, prefix sums, one to all broadcast, and ....
Philip McKinley and David Robinson. "Collective Communication in Wormhole-Routed Massively Parallel Computers." IEEE Computer, 28(12):39--50, December 1995.
....addition, a rich set of collective operations (such as broadcast, multicast, global reduction, scatter, gather, complete exchange and barrier synchronization) has been defined in the MPI standard. Collective communication and synchronization operations are frequently used in parallel applications [3, 5, 7, 13, 15, 18, 21]. Therefore, it is important that these operations are implemented on a given platform in the most efficient manner. Many research projects are currently focusing on improving the performance of pointto point [17, 23] and collective [11, 16] communication operations on NOWs. However, none of ....
P. K. McKinley and D. F. Robinson. Collective Communication in Wormhole-Routed Massively Parallel Computers. IEEE Computer, pages 39--50, Dec 1995.
....connectivity in order to make a message from any input port to be transferred to any output port. As in most system configurations, we assume a simple node switch connection with one port model, in which a switch (router) is connected to the local node via a pair of input and output channel [40]. The switch is responsible for entering, leaving, passing, and replicating messages. The crossbar connectivity within the switch allows simultaneous transmission of messages between different input and output channel. Hardware support for barrier synchronization is the barrier registers within ....
P. K. McKinley, Y.-J. Tsai, and D. F. Robinson, "Collective Communication in Wormhole-Routed Massively Parallel Computers," IEEE Computer, Vol. 28, No. 12, pp. 39-50, Dec. 1995.
....the BTM scheme is presented in Section 4. Conclusions and future works are discussed in Section 5. 2 Barrier Tree for Meshes (BTM) The proposed BTM scheme assumes a wormhole capable [7, 8, 9] 2 D mesh network, where a router is connected to the local node via a pair of input and output channels [6]. Hardware support for barrier synchronization is provided using barrier registers within the routers. Similar concept has been assumed in most hardware supported synchronization schemes [2, 3, 4] We assume each barrier register can hold an entire synchronization message. This can be justified by ....
P. K. McKinley, Y.-J. Tsai, and D. F. Robinson, "Collective Communication in WormholeRouted Massively Parallel Computers," IEEE Computer, Vol. 28, No. 12, pp. 39-50, Dec. 1995.
....the BTM scheme is presented in Section 4. Conclusions and future works are discussed in Section 5. 2 Barrier Tree for Meshes (BTM) The proposed BTM scheme assumes a wormholecapable [7, 8, 9] 2 D mesh network, where a router is connected to the local node via a pair of input and output channels [6]. Hardware support for barrier synchronization is provided using barrier registers within the routers. Similar concept has been assumed in most hardware supported synchronization schemes [2, 3, 4] We assume each barrier register can hold an entire synchronization message. This can be justified by ....
P. K. McKinley, Y.-J. Tsai, and D. F. Robinson, "Collective Communication in Wormhole-Routed Massively Parallel Computers," IEEE Computer, Vol. 28, No. 12, pp. 39-50, Dec. 1995.
....in current and future clusters can deliver significant performance benefits to the applications. 1 Introduction Barrier synchronization is a common operation in parallel and distributed systems. It is a common synchronization operation in both message passing as well as shared memory systems [2, 11, 14, 10, 15, 13, 6]. An efficient implementation is important because during the operation, gener This research is supported in part by an NSF Career Award MIP9502294, NSF Grants CCR 9704512 and EIA 9986052, an Ameritech Faculty Fellow Award, and grants from the Ohio Board of Regents. ally no computation can be ....
P. K. McKinley and D. F. Robinson. Collective Communication in Wormhole-Routed Massively Parallel Computers. IEEE Computer, pages 39--50, Dec 1995.
....the processing speed. In most contemporary parallel systems, low dimensional wormhole routed k ary n cube networks are adopted due to its low communication latency and high bandwidth [1] Collectivecommunication whichinvolves a group of intercommunicating nodes is useful in various applications [2]. Several of these collective communications can be supported through an efficient implementation of the multicast communication. Multicast communication is concerned with the delivery of a message from one source node to multiple destinations. The performance of multicast communications is ....
P. K. McKinley, Y. Tsai, and D. F. Robinson, "Collective communication in wormhole-routed massively parallel computers," IEEE Computer, vol. 28, pp. 39--50, Dec. 1995.
....Proper implementation of these basic communication operations is a key to the performance of the par 11 allel computers. Therefore, there has been a great deal of interest in their design and the study of their performance. Excellent surveys on collective communication algorithms can be found in [90, 53, 61]. Collective communication operations can be used for data movement, process control, or global operations. Data movement operations include, broadcasting, muticasting, scattering, gathering, multinode broadcasting, and total exchange. Barrier synchronization,is a type of process control. Global ....
.... and Dimopoulos have shown how total exchange can be done in cayley graphs [41] They have also presented collective communication algorithms on binary fat trees [42] McKinley and his colleagues have surveyed collective communications on hypercubes, meshes, and tori in wormhole routed networks [90]. Recently, Banikazemi and others, have proposed efficient broadcasting and multicasting algorithms using communication capabilities of heterogeneous networks of workstations [15] In the context of optical interconnection networks, Berthome and Ferreira [20, 21] have presented broadcasting and ....
P. K. McKinley and D. F. Robinson, "Collective Communication in WormholeRouted Massively Parallel Computers", IEEE Computer, December 1995, pp. 3950.
....(such as the deadlock free cut through routing of packets [1, 31, 33] are therefore being revisited for such irregular topologies. Collective communication is an important type of communication operation in parallel systems, and involves communication among groups of (3 or more) processes [20, 26]. Examples of collective operations include multicast [24] barrier synchronization [36] reduction, etc. The importance of such operations is underlined by the inclusion of several primitives for collective communication in the newly drafted Message Passing Interface (MPI) standard [23] Such ....
....Efficient multicast algorithms are typically hierarchical in nature. This means that some destinations serve as intermediate sources, i.e. when they receive a message, they forward copies of it to other destinations. Many such hierarchical algorithms have been proposed in the literature [13, 20, 21, 30] to implement multicast. Figure 3 shows an example of a multicast from a source node to seven other destinations. In the figure, the numbers in brackets indicate the step numbers. 1] 2] 3] 2] 3] 3] 3] source Figure 3: Example of a hierarchical multicast algorithm on a destination ....
P. K. McKinley and D. F. Robinson. Collective Communication in Wormhole-Routed Massively Parallel Computers. IEEE Computer, pages 39--50, Dec 1995.
....paradigms require fast implementation of multicast and broadcast operations in order to support various application and system level data distribution functions. Multicast and broadcast also get used for other collective communication operations like barrier synchronization and global combining [24, 30]. Since broadcast is a special case of multicast (multicast to all nodes in the system) we will consider multicast for the remainder of this paper. However, it must be noted that all the developed algorithms and theories in this paper apply to broadcast as well. Current generation parallel ....
....nodes to implement contention free multicast with binomial tree based message pattern on an arbitrary irregular network with the UD routing scheme discussed in Section 2.3. 3. 1 Contention Free Multicast with Ordered Chain Typically, binomial tree based algorithms have been used in the literature [24, 25] to implement multicast on meshes, tori, and hypercubes with optimal number of communication start ups (steps) Such an approach requires dlog 2 (d 1)e communication steps for a multicast with d destinations. Besides the number of startups, an important factor which affects the overall multicast ....
P. K. McKinley and D. F. Robinson. Collective Communication in Wormhole-Routed Massively Parallel Computers. IEEE Computer, pages 39--50, Dec 1995. 36
....paradigms require fast implementation of multicast and broadcast operations in order to support various application and system level data distribution functions. Multicast and broadcast also get used for other collective communication operations like barrier synchronization and global combining [7, 12]. Current generation parallel systems like IBM SP2 [15] Intel Paragon, Cray T3D, Ncube 3, J Machine, and Stanford FLASH use the wormhole routing switching technique due to its inherent advantages like low latency communication and reduced communication hardware overhead [10] These systems use ....
....of nodes to implement contention free multicast with binomial tree based message pattern on an arbitrary irregular network with the routing scheme discussed in Section 2.3. 3. 1 Contention Free Multicast with Ordered Chain Typically, binomial tree based algorithms have been used in the literature [7, 8] to implement multicast on meshes, tori, and hypercubes with optimal number of communication start ups (steps) Such an approach requires dlog 2 (d 1)e communication steps for a multicast with d destinations. Besides the number of startups, an important factor which affects the overall multicast ....
P. K. McKinley and D. F. Robinson. Collective Communication in Wormhole-Routed Massively Parallel Computers. IEEE Computer, pages 39--50, Dec 1995.
....(such as the deadlock free cut through routing of packets [1, 36, 38] are therefore being revisited for such irregular topologies. Collective communication is an important type of communication operation in parallel systems, and involves communication among groups of (2 or more) processes [24, 30]. Examples of collective operations include multicast [28] barrier synchronization [44] reduction, etc. The importance of these operations is underlined by the inclusion of several primitives for collective communication in the Message Passing Interface (MPI) standard [27] Some collective ....
....Efficient multicast algorithms are typically hierarchical in nature. This means that some destinations serve as intermediate sources, i.e. when they receive a message, they forward copies of it to other destinations. Many such hierarchical algorithms have been proposed in the literature [16, 24, 25, 35] to implement multicast. Figure 3 shows an example of a multicast from a source node to seven other destinations. In the figure, the numbers in brackets indicate the step numbers. It can be easily observed that dlog 2 (n 1)e communication steps are required for such a binomial tree based ....
P. K. McKinley and D. F. Robinson. Collective Communication in Wormhole-Routed Massively Parallel Computers. IEEE Computer, pages 39--50, Dec 1995.
....program. The standard packet size on which the following investigations are based is B max = 1728 Byte. The packets are sent to the destination one after another in a pipelined way. All packets are transmitted along the same path in the interconnection network. The path is determined by XY routing [29]. The transfer is implemented by wormhole routing [31, 29] such that 7 the transfer time does not depend on the length of the path or on the distance between the sending and the receiving processor. Application programs can run on any subset of the processors. Different application programs can ....
....investigations are based is B max = 1728 Byte. The packets are sent to the destination one after another in a pipelined way. All packets are transmitted along the same path in the interconnection network. The path is determined by XY routing [29] The transfer is implemented by wormhole routing [31, 29] such that 7 the transfer time does not depend on the length of the path or on the distance between the sending and the receiving processor. Application programs can run on any subset of the processors. Different application programs can run on different disjoint subsets. These subsets need not ....
[Article contains additional citation context not shown here]
P.K. McKinley, Y. Tsai, and D.F. Robinson. Collective Communication in Wormhole-Routed Massively Parallel Computers. IEEE Computer, 28(12):39--50, 1995.
.... and the reduced communication hardware overhead [1] These systems are often used to support distributed memory or distributed shared memory programming paradigms which require a fast implementation of broadcast, multicast, barrier synchronization and other collective communication operations [2]. While the hardware has been improved, there remains a wide gap between CPU speed and the communication times. As one of the point design teams to develop NSF sponsored Petaflop super computers [3,4] our research group was challenged by the need to reduce the communication overhead required to ....
P. K. McKinley, Y. Tsai, and D. F. Robinson, "Collective communication in wormhole-routed massively parallel computers," IEEE Computer, vol. 28, pp. 39--50, December 1995.
....is increased, the software overhead of the node is more important to the system performance than the hardware latency of the network. Multidestination message passing mechanism reduces the processor overhead as well as the network latency. But it necessitates hardware supports at the router[5]. This leads us to ask the following question. Is it possible to provide a new mechanism to deliver a message to multiple destinations with reduced software overhead by adding little hardware support to the router In this paper, we propose a novel treebased multidestination multicast scheme named ....
....because unicast messages can always be implemented under this scheme as a subset operation with only one destination. In order to support multidestination message passing, the router interface needs to have logic to concurrently absorb as well as forward flits at intermediate destination routers[5]. Current wormhole routers contain logic to absorb and forward flits[8] Only a small amount of logic is required for routers to provide these features concurrently. To support the destination encoding in the header, each router also needs a small amount of additional logic to strip away a header ....
P.K.Mckinley and D.F.Robinson, "Collective Communication in Wormhole-Routed Massively Parallel Computers. ," IEEE Computer, pp. 39-50, 1995.
....of communication primitives that can be used within basic modules includes single to single transfers and collective communication operations. Collective communication operations can be classified into three types according to their purpose: data movement, global computations and execution control [18]. Data movement operations include single broadcast, multi broadcast, scatter, gather, and total exchange operations. Global computations include both reduction and scan operations. The most common execution control operation is a synchronization operation. For each of these operations, a formula ....
P.K. McKinley, Y. Tsai, and D.F. Robinson. Collective Communication in Wormhole-Routed Massively Parallel Computers. IEEE Computer, 28(12):39--50, 1995.
....is one to one (uni cast) Recent research has put much attention on the collective communication, which incurs denser and heavier trac on the network. Examples include one to all (broadcast) one tomany (multicast ) and all to all communications and a large amount of work can be found in [2, 4, 5, 7, 11, 15, 16, 17, 18, 20, 22, 30]. Messages to be sent can be further classi ed as non personalized (wherein all receivers will receive a same message from a same source) and personalized (wherein each receiver will receive a di erent message from a same source) Some of these communication patterns have also been implemented in ....
P. K. McKinley, Y.-J. Tsai, and D. F. Robinson. Collective communication in wormholerouted massively parallel computers. IEEE Computer, 28(12):39-50, Dec. 1995.
....as the reference architecture in this paper. The importance of multicast communication, where the same message is delivered from a source node to an arbitrary number of destination nodes, is well understood. Its applications range from the implementation of collective communication operations [19], directly supported by systems like MPI [11] to a variety of system level functions, such as barrier synchronization and cache coherence in distributed shared memory systems, among others. In spite of that, however, worhmole switches currently support only unicast (one to one) communication in ....
P. McKinley, Y. Tsai, and D. Robinson. Collective Communication in Wormhole-Routed Massively Parallel Computers. IEEE Computer, pages 39--50, December 1995.
....to reduce the cache coherence overhead. We assume that there are n processor nodes having the shared readable copies, i.e. there are n destination nodes in one cache invalidation transaction. The basic communication model of write invalidate cache coherence protocols is collective communication[12], i.e. sending one to n messages from the source node(home node) to a set of destination nodes and collecting n toone messages from a set of nodes to the source node. In the interconnection network literature, the former communication pattern is known as multicast and the latter pattern as ....
....algorithm on the source node to find the next destinations first. Then those destination nodes applies the algorithm described in Section 2.2 to the dimension order chain. So, at any time, the dimension order is maintained in every message, which is guaranteed to be arc disjoint and contention free[12]. Therefore, the TBM scheme is contention free. Lemma 2: The TBM scheme will be deadlock free if the underlying multidestination message passing mechanism is deadlock free. Proof: Lemma 2 is correct obviously. In [14] D.K.Panda has proved that in k ary n cube system with e cube routing algorithm ....
P.K.Mckinley and D.F.Robinson, "Collective Communication in Wormhole-Routed Massively Parallel Computers.," IEEE Computer, pp. 39-50, 1995.
....to reduce the cache coherence overhead. We assume that there are n processor nodes having the shared readable copies, i.e. there are n destination nodes in one cache invalidation transaction. The basic communication model of write invalidate cache coherence protocols is collective communication[12], i.e. sending one to n messages from the source node(home node) to a set of destination nodes and collecting n to one messages from a set of nodes to the source node. In the interconnection network literature, the former communication pattern is known as multicast and the latter pattern as ....
....algorithm on the source node to find the next destinations first. Then those destination nodes applies the algorithm described in Section 2.2 to the dimension order chain. So, at any time, the dimension order is maintained in every message, which is guaranteed to be arc disjoint and contentionfree [12]. Therefore, the TBM scheme is contention free. Lemma 2: The TBM scheme will be deadlock free if the underlying multidestination message passing mechanism is deadlock free. Proof: Lemma 2 is correct obviously. In [14] D.K.Panda has proved that in k ary n cube system with e cube routing algorithm ....
P.K.Mckinley and D.F.Robinson. Collective communication in wormhole-routed massively parallel computers. IEEE Computer, pages 39--50, December 1995.
....The circuit switched model will be the only one considered in this paper since it is the most widely used in current multicomputers and in most of the literature on complete exchange algorithms. The port model defines the number of channels connecting every processor with its local router [McTR95]. In a one port model, every node can send and or receive only one message at the same time. In an all port model the node can send and or receive at the same time as many messages as external links connect the node to the network (2C links in a C dimensional 4 mesh torus) as long as every ....
P.K. McKinley, Y. Tsai and D.F. Robinson, Collective Communication in Wormhole-Routed Massively Parallel Computers, IEEE Computer, December 1995, pp. 39-50.
....paradigms require fast implementation of multicast and broadcast operations in order to support various application and system level data distribution functions. Multicast and broadcast also get used for other collective communication operations like barrier synchronization and global combining [22, 28]. Current generation parallel systems like IBM SP2 [33] Intel Paragon [16] Cray T3D [7] Ncube 3 [9] J Machine [26] and Stanford FLASH [11] use the wormhole routing switching technique due to its inherent advantages like low latency communication and reduced communication hardware overhead ....
....of nodes to implement contention free multicast with binomial tree based message pattern on an arbitrary irregular network with the routing scheme discussed in Section 2.2. 3. 1 Contention Free Multicast with Ordered Chain Typically binomial tree based algorithms have been used in the literature [22, 23] to implement multicast on meshes, tori, and hypercubes with optimal number of communication start ups (steps) Such an approach requires dlog 2 (d 1)e communication steps for a multicast with d destinations. Besides the number of startups, an important factor which affects the overall ....
P. K. McKinley and D. F. Robinson. Collective Communication in Wormhole-Routed Massively Parallel Computers. IEEE Computer, pages 39--50, Dec 1995.
No context found.
P. K. McKinley and D. F. Robinson. Collective Communication in Wormhole-Routed Massively Parallel Computers. IEEE Computer, pages 39--50, Dec 1995.
No context found.
R K. McKinley and 55-J. Tsai and D. Robinson, "Collective Communication in Wormhole-routed Massively Parallel Computers," IEEE Computer, pages 39--50, December 1995.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC