Results 1 - 10
of
24
Portals 3.0: Protocol Building Blocks for Low Overhead Communication
- in Proceedings of the 2002 Workshop on Communication Architecture for Clusters
, 2002
"... This paper describes the evolution of the Portals message passing architecture and programming interface from its initial development on tightly-coupled massively parallel platforms to the current implementation running on a 1792-node commodity PC Linux cluster. Portals provides the basic building b ..."
Abstract
-
Cited by 38 (17 self)
- Add to MetaCart
This paper describes the evolution of the Portals message passing architecture and programming interface from its initial development on tightly-coupled massively parallel platforms to the current implementation running on a 1792-node commodity PC Linux cluster. Portals provides the basic building blocks needed for higher-level protocols to implement scalable, low-overhead communication. Portals has several unique characteristics that differentiate it from other high-performance system-area data movement layers. This paper discusses several of these features and illustrates how they can impact the scalability and performance of higher-level message passing protocols.
Optimizing bandwidth limited problems using one-sided communication and overlap
- In 20th International Parallel and Distributed Processing Symposium (IPDPS
, 2006
"... This paper demonstrates the one-sided communication used in languages like UPC can provide a significant performance advantage for bandwidth-limited applications. This is shown through communication microbenchmarks and a case-study of UPC and MPI implementations of the NAS FT benchmark. Our optimiza ..."
Abstract
-
Cited by 32 (12 self)
- Add to MetaCart
This paper demonstrates the one-sided communication used in languages like UPC can provide a significant performance advantage for bandwidth-limited applications. This is shown through communication microbenchmarks and a case-study of UPC and MPI implementations of the NAS FT benchmark. Our optimizations rely on aggressively overlapping communication with computation, alleviating bottlenecks that typically occur when communication is isolated in a single phase. The new algorithms send more and smaller messages, yet the one-sided versions achieve> 1.9 × speedup over the base Fortran/MPI. Our one-sided versions show an average 15 % improvement over the twosided versions, due to the lower software overhead of onesided communication, whose semantics are fundamentally lighter-weight than message passing. Our UPC results use Berkeley UPC with GASNet and demonstrate the scalability of that system, with performance approaching 0.5 TFlop/s on the FT benchmark with 512 processors. 1.
VMI 2.0: A Dynamically Reconfigurable Messaging Layer for Availability, Usability, and Management
"... As system area networks (SANs) grow in size, and organizations pool their SANs over the wide area into even larger compute platforms (commonly known as grids'), it becomes increasingly difficult both to manage and to exploit the available resources. The key issues is the space of grid computing are ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
As system area networks (SANs) grow in size, and organizations pool their SANs over the wide area into even larger compute platforms (commonly known as grids'), it becomes increasingly difficult both to manage and to exploit the available resources. The key issues is the space of grid computing are availability, reliability, and managemerit. Availability is an issue, as network hardware is more likely to fail in a large network than in a small one. Usability is an issue, as different SANs use different networks, and inter-SAN communication frequently uses different networks from intra-SAN communication. And management is an issue, as it is more difficult to find and isolate problematic components of a large, heterogeneous system than a small, homogeneous one.
A New DMA Registration Strategy for Pinning-Based High Performance Networks
, 2003
"... This paper proposes a new memory registration strategy for supporting Remote DMA (RDMA) operations over pinning-based networks, as existing approaches are insufficient for efficiently implementing Global Address Space (GAS) languages. Although existing approaches often maximize bandwidth, they requi ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
This paper proposes a new memory registration strategy for supporting Remote DMA (RDMA) operations over pinning-based networks, as existing approaches are insufficient for efficiently implementing Global Address Space (GAS) languages. Although existing approaches often maximize bandwidth, they require levels of synchronization that discourage one-sided communication, and can have significant latency costs for small messages. The proposed Firehose algorithm attempts to expose one-sided, zero-copy communication as a common case, while minimizing the number of host-level synchronizations required to support remote memory operations. The basic idea is to reap the performance benefits of a Pin-Everything approach in the common case (without the drawbacks) and revert to a Rendezvous-based approach to handle the uncommon case. In all cases, the algorithm attempts to amortize the cost of synchronization and pinning over multiple remote memory operations, improving performance over Rendezvous by avoiding many handshaking messages and the cost of re-pinning recently used pages. Performance results are presented which demonstrate that the cost of two-sided handshaking and memory registration is negligible when the set of remotely referenced memory pages on a given node is smaller than the physical memory (where the entire working set can remain pinned), and for applications with larger working sets the performance degrades gracefully and consistently outperforms conventional approaches.
COMB: A portable benchmark suite for assessing MPI overlap
- IEEE Cluster
, 2002
"... This paper describes a portable benchmark suite that assesses the ability of cluster networking hardware and software to overlap MPI communication and computation. The Communication Offload MPI-based Benchmark, or COMB, uses two different methods to characterize the ability of messages to make progr ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
This paper describes a portable benchmark suite that assesses the ability of cluster networking hardware and software to overlap MPI communication and computation. The Communication Offload MPI-based Benchmark, or COMB, uses two different methods to characterize the ability of messages to make progress concurrently with computational processing on the host processor(s). COMB measures the relationship between overall MPI communication bandwidth and host CPU availability. In this paper, we describe the two different approaches used by the benchmark suite, and we present results from several systems. We demonstrate the utility of the suite by examining the results and comparing and contrasting different systems. 1
Protocols and Strategies for Optimizing Performance of Remote Memory Operations on Clusters
- In: Proc. Workshop Communication Architecture for Clusters (CAC02) of IPDPS’02, Ft
, 2002
"... this paper, we describe software architecture for supporting remote memory operations on clusters with networks such as Myrinet or cLAN. When combined with protocols and strategies for efficient management of network and host resources, this architecture can both deliver high performance and match n ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
this paper, we describe software architecture for supporting remote memory operations on clusters with networks such as Myrinet or cLAN. When combined with protocols and strategies for efficient management of network and host resources, this architecture can both deliver high performance and match network protocols with requirements of remote memory operations. The protocols and strategies address issues such as buffer memory consumption, management of GM tokens, dynamic memory registration, zero-copy data transfers and adaptive data streaming. For example, the adaptive data streaming technique bridges the performance gap between remote memory operations that target registered and those that use regular memory. Our approach relies on the standard unmodified system software and drivers for Myrinet and cLAN rather than on custom/alternative drivers and interfaces (e.g., AM [1], PM [2], BIP [3], and FM [4]) interfaces that replace the standard Myrinet Control Program (MCP) on the network interface card
Netgauge: A Network Performance Measurement Framework
"... Abstract. This paper introduces Netgauge, an extensible open-source framework for implementing network benchmarks. The structure of Netgauge abstracts and explicitly separates communication patterns from communication modules. As a result of this separation of concerns, new benchmark types and new n ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Abstract. This paper introduces Netgauge, an extensible open-source framework for implementing network benchmarks. The structure of Netgauge abstracts and explicitly separates communication patterns from communication modules. As a result of this separation of concerns, new benchmark types and new network protocols can be added independently to Netgauge. We describe the rich set of pre-defined communication patterns and communication modules that are available in the current distribution. Benchmark results demonstrate the applicability of the current Netgauge distribution to to different networks. An assortment of use-cases is used to investigate the implementation quality of selected protocols and protocol layers. 1

