MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes (2002) [54 citations — 5 self]

Download:
pdf
by George Bosilca, Aurelien Bouteiller, Samir Djilali, Gilles Fedak, Cecile Germain, Thomas Herault, Vincent Neri, Anton Selikhov
In Supercomputing
http://www.sc2002.org/paperpdfs/pap.pap298.pdf
Add To MetaCart

Abstract:

Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint / rollback and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.

Citations

499 A highperformance, portable implementation of the MPI Message-Passing Interface standard – Gropp, Lusk, et al. - 1996
329 A Survey of Rollback-Recovery Protocols in Message-Passing Systems – Elnozahy, Alvisi, et al. - 1999
318 A security architecture for computational grids – Foster, Kesselman, et al. - 1998
253 Optimistic recovery in distributed systems – Strom, Yemini - 1985
201 Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs – BOLOSKY, DOUCEUR, et al. - 2000
188 The NAS Parallel Benchmarks 2.0 – Bailey, Harris, et al. - 1995
129 Cocheck: Checkpointing and process migration for MPI – Stellner - 1996
118 SenderBased Message Logging – Johnson, Zwaenepoel - 1987
91 A grid-enabled MPI: Message passing in heterogeneous distributed computing systems – Foster, Karonis - 1998
60 R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations – Agbaria, Friedman - 1999
56 FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World”, EuroPVM/MPI User’s Group Meeting 2000 – Fagg, Dongarra - 2000
43 Crash recovery with little overhead – Juang, Venkatesan - 1991
36 CLIP: A checkpointing tool for message-passing parallel programs – Chen, Plank, et al. - 1997
28 Replicated distributed processes in Manetho – Elnozahy, Zwaenepoel - 1992
27 Egida: An extensible toolkit for low-overhead fault-tolerance – Rao, Alvisi, et al. - 1999
23 Embracing Failure: A Case for Recovery-Oriented Computing (ROC – Brown, Patterson - 2001
17 Implementing mpi with optimized algorithms for metacomputing – Gabriel, Resch, et al. - 1999
12 MPI/FT TM : Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing – Batchu, Neelamegam, et al. - 2001
10 XtremWeb: Building an Experimental Platform for Global Computing – Germain, Neri, et al.
4 MPI interconnection and control – Fagg, London - 1998
4 An Asynchronous Checkpoint and Rollback Facility for Distributed Computations – Pruitt - 1998
3 F.Cappello. Mpich-cm: a communication library design for p2p mpi implementation – Selikhov, Germain - 2002
3 Neophytos Neophytou, Arianos Lachanas, and Paraskevas Evrepidou. MPI-FT: Portable fault tolerance scheme for MPI. Parallel Processing – Louca - 2000