by George Bosilca, Aurelien Bouteiller, Samir Djilali, Gilles Fedak, Cecile Germain, Thomas Herault, Vincent Neri, Anton Selikhov
In Supercomputing
http://www.sc2002.org/paperpdfs/pap.pap298.pdf
Add To MetaCart
Abstract:
Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint / rollback and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.
Citations
|
499
|
A highperformance, portable implementation of the MPI Message-Passing Interface standard
– Gropp, Lusk, et al.
- 1996
|
|
329
|
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
– Elnozahy, Alvisi, et al.
- 1999
|
|
318
|
A security architecture for computational grids
– Foster, Kesselman, et al.
- 1998
|
|
253
|
Optimistic recovery in distributed systems
– Strom, Yemini
- 1985
|
|
201
|
Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs
– BOLOSKY, DOUCEUR, et al.
- 2000
|
|
188
|
The NAS Parallel Benchmarks 2.0
– Bailey, Harris, et al.
- 1995
|
|
129
|
Cocheck: Checkpointing and process migration for MPI
– Stellner
- 1996
|
|
118
|
SenderBased Message Logging
– Johnson, Zwaenepoel
- 1987
|
|
91
|
A grid-enabled MPI: Message passing in heterogeneous distributed computing systems
– Foster, Karonis
- 1998
|
|
60
|
R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations
– Agbaria, Friedman
- 1999
|
|
56
|
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World”, EuroPVM/MPI User’s Group Meeting 2000
– Fagg, Dongarra
- 2000
|
|
43
|
Crash recovery with little overhead
– Juang, Venkatesan
- 1991
|
|
36
|
CLIP: A checkpointing tool for message-passing parallel programs
– Chen, Plank, et al.
- 1997
|
|
28
|
Replicated distributed processes in Manetho
– Elnozahy, Zwaenepoel
- 1992
|
|
27
|
Egida: An extensible toolkit for low-overhead fault-tolerance
– Rao, Alvisi, et al.
- 1999
|
|
23
|
Embracing Failure: A Case for Recovery-Oriented Computing (ROC
– Brown, Patterson
- 2001
|
|
17
|
Implementing mpi with optimized algorithms for metacomputing
– Gabriel, Resch, et al.
- 1999
|
|
12
|
MPI/FT TM : Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing
– Batchu, Neelamegam, et al.
- 2001
|
|
10
|
XtremWeb: Building an Experimental Platform for Global Computing
– Germain, Neri, et al.
|
|
4
|
MPI interconnection and control
– Fagg, London
- 1998
|
|
4
|
An Asynchronous Checkpoint and Rollback Facility for Distributed Computations
– Pruitt
- 1998
|
|
3
|
F.Cappello. Mpich-cm: a communication library design for p2p mpi implementation
– Selikhov, Germain
- 2002
|
|
3
|
Neophytos Neophytou, Arianos Lachanas, and Paraskevas Evrepidou. MPI-FT: Portable fault tolerance scheme for MPI. Parallel Processing
– Louca
- 2000
|