by Martin Schulz, Greg Bronevetsky, Rohit Fern, Daniel Marques, Keshav Pingali, Paul Stodghill
In Proceedings of SC2004
http://www.cs.cornell.edu/stodghil/papers/sc04.pdf
Add To MetaCart
Abstract:
The running times of many computational science applications are much longer than the mean-time-to-failure of current high-performance computing platforms. Therefore, to run to completion, these applications must tolerate hardware failures. Checkpoint-and-restart (CPR) is the most commonly used scheme for accomplishing this- the state of computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CPR schemes in the literature can be classified as blocking, system-level checkpointing schemes because they take core-dump style snapshots of the computational state when all the processes are blocked at global barriers in the program. Unfortunately, a system that implements this style of checkpointing is tied to a particular platform; in addition, it cannot be used if there are no global barriers in the program. In our research project, we are exploring an alternative called non-blocking application-level checkpointing. In our approach, programs are transformed by a pre-processor so that they become selfcheckpointing and self-restartable on any platform; there is also no assumption about the existence of global barriers in the code. In this paper, we describe our implementation of non-blocking application-level checkpointing. We present experimental results on both a Windows cluster and the Lemieux system at the Pittsburgh Supercomputer Center, and argue that these results demonstrate both the platform-independence and the scalability of our approach. 1.
Citations
|
1027
|
Distributed Algorithm
– Lynch
- 1996
|
|
796
|
Distributed snapshots: Determining global states of distributed systems
– Chandy, Lamport
- 1985
|
|
329
|
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
– Elnozahy, Alvisi, et al.
- 1999
|
|
185
|
Direct Bulk-Synchronous Parallel Algorithms
– Gerbessiotis, Valiant
- 1992
|
|
162
|
Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit
– Elnozahy, Zwaenepoel
- 1992
|
|
129
|
Cocheck: Checkpointing and process migration for MPI
– Stellner
- 1996
|
|
72
|
Checkpoint and migration of UNIX processes in the Condor distributed processing system
– Litzkow, Tannenbaum, et al.
- 1997
|
|
65
|
Compiler-assisted checkpointing
– Beck, Plank, et al.
- 1994
|
|
60
|
R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations
– Agbaria, Friedman
- 1999
|
|
56
|
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World”, EuroPVM/MPI User’s Group Meeting 2000
– Fagg, Dongarra
- 2000
|
|
47
|
B.: Portable Checkpointing for Heterogeneous Architectures. Fault-Tolerant Parallel and Distributed Systems
– Strumpen, Ramkumar
- 1998
|
|
45
|
Automated application-level checkpointing of mpi programs
– Bronevetsky, Marques, et al.
- 2003
|
|
38
|
Application level fault tolerance in heterogeneous networks of workstations
– Beguelin, Seligman, et al.
- 1997
|
|
29
|
HPL – a portable implementation of the high-performance linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl
– Petitet, Whaley, et al.
- 2004
|
|
27
|
Egida: An extensible toolkit for low-overhead fault-tolerance
– Rao, Alvisi, et al.
- 1999
|
|
25
|
MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender ba sed message logging
– Bouteiller, Cappello, et al.
- 2003
|
|
24
|
Catch – Compiler-Assisted Techniques for Checkpointing
– Li, Fuchs
- 1990
|
|
20
|
Collective operations in an application-level fault tolerant MPI system
– Bronevetsky, Marques, et al.
- 2003
|
|
16
|
Process Introspection: A Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code Modification
– Ferrari, Chapin, et al.
- 1997
|
|
14
|
Srs: A framework for developing malleable and migratable parallel applications for distributed systems
– Vadhiyar, Dongarra
|
|
7
|
A checkpoint and recovery system for the Pittsburgh Supercomputing Center Terascale Computing System
– Stone, Kochmar, et al.
- 2001
|
|
2
|
Source-code transformations for efficient reversibility
– Perumalla, Fujimoto
- 1999
|
|
1
|
The smg2000 benchmark code. Available at http://www.llnl.gov/asci/purple/benchmarks/ limited/smg
– Carnes
- 2001
|