by Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill
http://iss.cs.cornell.edu/Publications/Papers/PPoPP2003.pdf
Add To MetaCart
Abstract:
Because of increasing hardware and software complexity, the running time of many computational science applications is now more than the mean-time-to-failure of highpeformance computing platforms. Therefore, computational science applications need to tolerate hardware failures. In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing faulttolerance protocols in the literature are not suitable for implementing this approach. In this paper, we present a suitable protocol, and show how it can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small. 1
Citations
|
1027
|
Distributed Algorithm
– Lynch
- 1996
|
|
796
|
Distributed snapshots: Determining global states of distributed systems
– Chandy, Lamport
- 1985
|
|
329
|
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
– Elnozahy, Alvisi, et al.
- 1999
|
|
209
|
Libckpt: Transparent checkpointing under Unix
– Plank, Beck, et al.
- 1995
|
|
162
|
Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit
– Elnozahy, Zwaenepoel
- 1992
|
|
72
|
Checkpoint and migration of UNIX processes in the Condor distributed processing system
– Litzkow, Tannenbaum, et al.
- 1997
|
|
65
|
Compiler-assisted checkpointing
– Beck, Plank, et al.
- 1994
|
|
60
|
R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations
– Agbaria, Friedman
- 1999
|
|
47
|
B.: Portable Checkpointing for Heterogeneous Architectures. Fault-Tolerant Parallel and Distributed Systems
– Strumpen, Ramkumar
- 1998
|
|
43
|
Efficient transparent optimistic rollback recovery for distributed application programs
– Johnson
- 1993
|
|
38
|
Application level fault tolerance in heterogeneous networks of workstations
– Beguelin, Seligman, et al.
- 1997
|
|
37
|
A Network-Failure-Tolerant Message-Passing System for Terascacle Clusters
– Graham, Choi, et al.
- 2002
|
|
27
|
On scalable and efficient distributed failure detectors
– Gupta, Chandra, et al.
|
|
27
|
Egida: An extensible toolkit for low-overhead fault-tolerance
– Rao, Alvisi, et al.
- 1999
|
|
11
|
The use of the MPI communication library in the NAS Parallel Benchmark
– Tabe, Stout
- 1999
|
|
1
|
Solver 532KB 2.1MB 512x512 1024x1024 2048x2048 Problem Size Unmodified Program Using Protocol Layer, No Checkpoints Checkpointing, No Application State Full Checkpoints The number above each set of bars is the size of the application state for that proble
– Laplace
|
|
1
|
Blue gene project overview. Online at http://www.research.ibm.com/bluegene
– Research
- 2002
|