Results 1 - 10
of
29
MPI-FM: High Performance MPI on Workstation Clusters
- Journal of Parallel and Distributed Computing
, 1997
"... Despite the emergence of high speed LANs, the communication performance available to applications on workstation clusters still falls short of that available on MPPs. A new generation of efficient messaging layers is needed to take advantage of the hardware performance and to deliver it to the appli ..."
Abstract
-
Cited by 71 (13 self)
- Add to MetaCart
Despite the emergence of high speed LANs, the communication performance available to applications on workstation clusters still falls short of that available on MPPs. A new generation of efficient messaging layers is needed to take advantage of the hardware performance and to deliver it to the application level. Communication software is the key element in bridging the communication performance gap separating MPPs and workstation clusters. MPI-FM is a high performance implementation of MPI for networks of workstations connected with a Myrinet network, built on top of the Fast Messages (FM) library. Based on the FM version 1.1 released in Fall 1995, MPI-FM achieves a minimum oneway latency of 19 ¯s and a peak bandwidth of 17.3 MByte/s with common MPI send and receive function calls. A direct comparison using published performance figures shows that MPI-FM running on SPARCstation 20 workstations connected with a relatively inexpensive Myrinet network outperforms the MPI implementations a...
Low-Latency Communication over ATM Networks using Active Messages
- IEEE Micro
, 1995
"... Recent developments in communication architectures for parallel machines have made significant progress and reduced the communication overheads and latencies by over an order of magnitude as compared to earlier proposals. This paper examines whether these techniques can carry over to clusters of wor ..."
Abstract
-
Cited by 68 (0 self)
- Add to MetaCart
Recent developments in communication architectures for parallel machines have made significant progress and reduced the communication overheads and latencies by over an order of magnitude as compared to earlier proposals. This paper examines whether these techniques can carry over to clusters of workstations connected by an ATM network even though clusters use standard operating system software, are equipped with network interfaces optimized for stream communication, do not allow direct protected user-level access to the network, and use networks without reliable transmission or flow control. In a first part, this paper describes the differences in communication characteristics between clusters of workstations built from standard hardware and software components and state-of-the-art multiprocessors. The lack of flow control and of operating system coordination affects the communication layer design significantly and requires larger buffers at each end than on multiprocessors. A second ...
Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and their Impact
, 1997
"... Multidestination message passing has been proposed as an attractive mechanism for efficiently implementing multicast and other collective operations on direct networks. However, applying this mechanism to switch-based parallel systems is non-trivial. In this paper we propose alternative switch archi ..."
Abstract
-
Cited by 24 (8 self)
- Add to MetaCart
Multidestination message passing has been proposed as an attractive mechanism for efficiently implementing multicast and other collective operations on direct networks. However, applying this mechanism to switch-based parallel systems is non-trivial. In this paper we propose alternative switch architectures with differing buffer organizations to implement multidestination worms on switch-based parallel systems. First, we discuss issues related to such implementation (deadlock-freedom, replication mechanisms, header encoding, and routing). Next, we demonstrate how an existing central-buffer-based switch architecture supporting unicast message passing can be enhanced to accommodate multidestination message passing. Similarly, implementing multidestination worms on an input-buffer-based switch architecture is discussed. Both of these implementations are evaluated against each other as well as against a software-based scheme using the central buffer organization. Simulation experiments und...
Efficient Broadcast and Multicast on Multistage Interconnection Networks using Multiport Encoding
- In Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing
, 1996
"... This paper proposes a new approach for implementing fast multicast and broadcast in unidirectional and bidirectional multistage interconnection networks (MINs) with multiport encoded multidestination worms. For a MIN with n stages such worms use n header flits each. One flit is used for each stage ..."
Abstract
-
Cited by 18 (10 self)
- Add to MetaCart
This paper proposes a new approach for implementing fast multicast and broadcast in unidirectional and bidirectional multistage interconnection networks (MINs) with multiport encoded multidestination worms. For a MIN with n stages such worms use n header flits each. One flit is used for each stage of the network and it indicates the output ports to which a multicast message needs to be replicated. A multiport encoded worm with (d 1 ; d 2 : : : ; dn , 1 d i k) degrees of replication for the respective stages is capable of covering (d 1 \Theta d 2 \Theta : : : \Theta dn ) destinations with a single communication start-up. In this paper a switch architecture is proposed for implementing multidestination worms without deadlock. Three grouping algorithms of varying complexity are presented to derive the associated multiport encoded worms for a multicast to an arbitrary set of destinations. Using these worms a multinomial tree-based scheme is proposed to implement the multicast. This s...
HIPIQS: A High-Performance Switch Architecture using Input Queuing
- In Proceedings of the 12th International Parallel Processing Symposium
, 1998
"... Switch-based interconnects are used in a number of application domains including parallel system interconnects, local area networks, and wide area networks. However, very few switches have been designed that are suitable for more than one of these application domains. Such a switch must offer both e ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Switch-based interconnects are used in a number of application domains including parallel system interconnects, local area networks, and wide area networks. However, very few switches have been designed that are suitable for more than one of these application domains. Such a switch must offer both extremely low latency and very high throughput for a variety of different message sizes. While some architectures with output queuing have been shown to perform extremely well in terms of throughput, their performance can suffer when used in systems where a significant portion of the packets are extremely small. On the other hand, architectures with input queuing offer limited throughput, or require fairly complex and centralized arbitration that increases latency. In this paper we present a new input queue-based switch architecture called HIPIQS (HIgh-Performance Input-Queued Switch). It offers low latency for a range of message sizes, and provides throughput comparable to that of output qu...
Selected problems of scheduling tasks in multiprocessor computing systems
- PhD thesis, Instytut Informatyki Politechnika Poznanska
, 1997
"... ..."
MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems
, 2001
"... The IBM RS/6000 SP system is one of the most cost-effective commercially available high performance machines. IBM RS/6000 SP systems support the Message Passing Interface standard (MPI) and LAPI. LAPI is a low level, reliable and efficient one sided communication API library, implemented on IBM RS/6 ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
The IBM RS/6000 SP system is one of the most cost-effective commercially available high performance machines. IBM RS/6000 SP systems support the Message Passing Interface standard (MPI) and LAPI. LAPI is a low level, reliable and efficient one sided communication API library, implemented on IBM RS/6000 SP systems. This paper explains how the high performance of the LAPI library has been exploited in order to implement the MPI standard more efficiently than the existing MPI. It describes how to avoid unnecessary data copies at both the sending and receiving sides for such an implementation. The resolution of problems arising from the mismatches between the requirements of the MPI standard and the features of LAPI is discussed. As a result of this exercise, certain enhancements to LAPI are identified to enable an efficient implementation of MPI on LAPI. The performance of the new implementation of MPI is compared with that of the underlying LAPI itself. The latency (in polling and interr...
Global Reduction in Wormhole k-ary n-cube Networks with Multidestination Exchange Worms
- IN INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM
, 1994
"... This paper presents a new approach to implement global reduction operations (including barrier synchronization) in wormhole k-ary n-cubes. The novelty lies in using multidestination message passing mechanism instead of single destination (unicast) messages. Using pairwise exchange worms along eac ..."
Abstract
-
Cited by 11 (11 self)
- Add to MetaCart
This paper presents a new approach to implement global reduction operations (including barrier synchronization) in wormhole k-ary n-cubes. The novelty lies in using multidestination message passing mechanism instead of single destination (unicast) messages. Using pairwise exchange worms along each dimension, it is shown that global reduction and barrier synchronization operations, as defined by the Message Passing Interface (MPI) standard, can be implemented with n communication start-ups as compared to 2ndlog 2 ke start-ups required with unicast-based message passing. This leads to an asymptotic improvement by a factor of d2log 2 ke. For different values of communication start-up time and system size, the analysis indicates that the new framework can implement fast global reduction compared to the unicast-based scheme. Issues related to the new approach are studied and the required architectural modifications to the router interface are presented. The analysis indicates that as the...
Performance of Multistage Bus Networks for a Distributed Shared Memory Multiprocessor
, 1997
"... A Multistage Bus Network (MBN) is proposed in this paper to overcome some of the shortcomings of the conventional multistage interconnection networks (MINs), single bus and hierarchical bus interconnection networks. The MBN consists of multiple stages of buses connected in a manner similar to the MI ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
A Multistage Bus Network (MBN) is proposed in this paper to overcome some of the shortcomings of the conventional multistage interconnection networks (MINs), single bus and hierarchical bus interconnection networks. The MBN consists of multiple stages of buses connected in a manner similar to the MINs and has the same bandwidth at each stage. A switch in an MBN is similar to that in a MIN switch except that there is a single bus connection instead of a crossbar. MBNs support bidirectional routing and there exists a number of paths between any source and destination pair. In this paper we develop self routing techniques for the various paths, present an algorithm to route a request along the path with minimum distance and analyze the probabilities of a packet taking different routes. Further, we derive a performance analysis of a synchronous packet-switched MBN in a distributed shared memory environment and compare the results with those of an equivalent bidirectional MIN (BMIN). Finall...

