Results 1 -
6 of
6
Implementation and performance analysis of non-blocking collective operations for MPI
- SC07
, 2007
"... Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance lib ..."
Abstract
-
Cited by 21 (12 self)
- Add to MetaCart
Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance library for implementing non-blocking collective MPI communication operations. LibNBC provides non-blocking versions of all MPI collective operations, is layered on top of MPI-1, and is portable to nearly all parallel architectures. To measure the performance characteristics of our implementation, we also present a microbenchmark for measuring both latency and overlap of computation and communication. Experimental results demonstrate that the blocking performance of the collective operations in our library is comparable to that of collective operations in other highperformance MPI implementations. Our library introduces a very low overhead between the application and the underlying MPI and thus, in conjunction with the potential to overlap communication with computation, offers the potential for optimizing real-world applications.
A Case for NonBlocking Collective Operations
- In Frontiers of High Performance Computing and Networking - ISPA 2006 Workshops
, 2006
"... Abstract. Non-blocking collective operations for MPI have been in discussion for a long time. We want to contribute to this discussion and to give a rationale for the usage these operations and assess their possible benefits. A LogGP model for the CPU overhead of collective algorithms and a benchmar ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Abstract. Non-blocking collective operations for MPI have been in discussion for a long time. We want to contribute to this discussion and to give a rationale for the usage these operations and assess their possible benefits. A LogGP model for the CPU overhead of collective algorithms and a benchmark to measures it are provided and show a large potential to overlap communication and computation. We show that nonblocking collective operations can provide at least the same benefits as non-blocking point to point operations already do. Our claim is that actual CPU overhead for non-blocking collective operations depends on the message size and the communicator size and benefits especially highly scalable applications with huge communicators. We prove that the share of the overhead of the overall communication time of current blocking collective operations gets smaller with bigger communicators and larger messages. We show that the user level CPU overhead is less than 10 % for MPICH2 and LAM/MPI using TCP/IP communication, which leads us to the conclusion that, by using non-blocking collective communication, ideally 90 % idle CPU time can be freed for the application.
Parallel scaling of Teter’s minimization for Ab Initio calculations
"... Abstract — We propose a parallelization scheme for the conjugate gradient method by Teter et. al. and report a detailed analysis of its scalability. We use MPI collective operations exclusively to take advantage of optimized collective implementations with possible hardware support. Our parallel con ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract — We propose a parallelization scheme for the conjugate gradient method by Teter et. al. and report a detailed analysis of its scalability. We use MPI collective operations exclusively to take advantage of optimized collective implementations with possible hardware support. Our parallel conjugate gradient calculation can be applied in addition to the already implemented parallelism in the application ABINIT. We propose distribution schemes for the band vectors and the 3D-FFT, and provide both a detailed runtime and scalability analysis and a model for the used collective operations. We use this model of collective communication to predict the parallel scaling and to show that the scalability is mostly limited by the communication. Our codes scales up to 52 processors for a small 43 atom system and up to 120 processors for a larger 86 atom system for a single k-point on our test cluster. Our results suggest that non-blocking collective communication could be used to enhace the application running time especially for cluster computers. I.
Programming Distributed Memory Sytems Using OpenMP ∗
"... OpenMP has emerged as an important model and language extension for shared-memory parallel programming. On shared-memory platforms, OpenMP offers an intuitive, incremental approach to parallel programming. In this paper, we present techniques that extend the ease of sharedmemory parallel programming ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
OpenMP has emerged as an important model and language extension for shared-memory parallel programming. On shared-memory platforms, OpenMP offers an intuitive, incremental approach to parallel programming. In this paper, we present techniques that extend the ease of sharedmemory parallel programming in OpenMP to distributedmemory platforms as well. First, we describe a combined compile-time/runtime system that uses an underlying Software Distributed Shared Memory System and exploits repetitive data access behavior in both regular and irregular program sections. We present a compiler algorithm to detect such repetitive data references and an API to an underlying software distributed shared memory system to orchestrate the learning and proactive reuse of communication patterns. Second, we introduce a direct translation of standard OpenMP into MPI message-passing programs for execution on distributed memory systems. We present key concepts and describe techniques to analyze and efficiently handle both regular and irregular accesses to shared data. Finally, we evaluate the performance achieved by our approaches on representative OpenMP applications. 1
On the Interference of Communication on Computation in Java
- In Proceedings of the Third International Workshop on Performance Modeling, Evaluation and Optimization of Parallel and Distributed Systems (PMEO-PDS
, 2004
"... or ..."
CR-07: Ordonnancement 2012–2013 DM DM d’ordonnancement
"... à rendre pour le 9 décembre 2012 Ce DM porte sur l’article “Interference Aware Scheduling”, par B. Kreaseck, ..."
Abstract
- Add to MetaCart
à rendre pour le 9 décembre 2012 Ce DM porte sur l’article “Interference Aware Scheduling”, par B. Kreaseck,

