MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for mpi programs (2004) [5 citations — 1 self]

Download:
pdf
by Martin Schulz, Greg Bronevetsky, Rohit Fern, Daniel Marques, Keshav Pingali, Paul Stodghill
In Proceedings of SC2004
http://www.cs.cornell.edu/stodghil/papers/sc04.pdf
Add To MetaCart

Abstract:

The running times of many computational science applications are much longer than the mean-time-to-failure of current high-performance computing platforms. Therefore, to run to completion, these applications must tolerate hardware failures. Checkpoint-and-restart (CPR) is the most commonly used scheme for accomplishing this- the state of computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CPR schemes in the literature can be classified as blocking, system-level checkpointing schemes because they take core-dump style snapshots of the computational state when all the processes are blocked at global barriers in the program. Unfortunately, a system that implements this style of checkpointing is tied to a particular platform; in addition, it cannot be used if there are no global barriers in the program. In our research project, we are exploring an alternative called non-blocking application-level checkpointing. In our approach, programs are transformed by a pre-processor so that they become selfcheckpointing and self-restartable on any platform; there is also no assumption about the existence of global barriers in the code. In this paper, we describe our implementation of non-blocking application-level checkpointing. We present experimental results on both a Windows cluster and the Lemieux system at the Pittsburgh Supercomputer Center, and argue that these results demonstrate both the platform-independence and the scalability of our approach. 1.

Citations

1027 Distributed Algorithm – Lynch - 1996
796 Distributed snapshots: Determining global states of distributed systems – Chandy, Lamport - 1985
329 A Survey of Rollback-Recovery Protocols in Message-Passing Systems – Elnozahy, Alvisi, et al. - 1999
185 Direct Bulk-Synchronous Parallel Algorithms – Gerbessiotis, Valiant - 1992
162 Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit – Elnozahy, Zwaenepoel - 1992
129 Cocheck: Checkpointing and process migration for MPI – Stellner - 1996
72 Checkpoint and migration of UNIX processes in the Condor distributed processing system – Litzkow, Tannenbaum, et al. - 1997
65 Compiler-assisted checkpointing – Beck, Plank, et al. - 1994
60 R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations – Agbaria, Friedman - 1999
56 FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World”, EuroPVM/MPI User’s Group Meeting 2000 – Fagg, Dongarra - 2000
47 B.: Portable Checkpointing for Heterogeneous Architectures. Fault-Tolerant Parallel and Distributed Systems – Strumpen, Ramkumar - 1998
45 Automated application-level checkpointing of mpi programs – Bronevetsky, Marques, et al. - 2003
38 Application level fault tolerance in heterogeneous networks of workstations – Beguelin, Seligman, et al. - 1997
29 HPL – a portable implementation of the high-performance linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl – Petitet, Whaley, et al. - 2004
27 Egida: An extensible toolkit for low-overhead fault-tolerance – Rao, Alvisi, et al. - 1999
25 MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender ba sed message logging – Bouteiller, Cappello, et al. - 2003
24 Catch – Compiler-Assisted Techniques for Checkpointing – Li, Fuchs - 1990
20 Collective operations in an application-level fault tolerant MPI system – Bronevetsky, Marques, et al. - 2003
16 Process Introspection: A Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code Modification – Ferrari, Chapin, et al. - 1997
14 Srs: A framework for developing malleable and migratable parallel applications for distributed systems – Vadhiyar, Dongarra
7 A checkpoint and recovery system for the Pittsburgh Supercomputing Center Terascale Computing System – Stone, Kochmar, et al. - 2001
2 Source-code transformations for efficient reversibility – Perumalla, Fujimoto - 1999
1 The smg2000 benchmark code. Available at http://www.llnl.gov/asci/purple/benchmarks/ limited/smg – Carnes - 2001