We discuss the design and implementation of Egida, an object-oriented toolkit designed to support transparent rollback-recovery. Egida exports a simple specification language that can be used to express arbitrary rollback recovery protocols. From this specification, Egida automatically synthesizes an implementation of the specified protocol by glueing together the appropriate objects from an available library of "building blocks". Egida is extensible and facilitates rapid implementation of rollback recovery protocols with minimal programming effort. We have integrated Egida with the MPICH implementation of the MPI standard. Existing MPI applications can take advantage of Egida without any modifications: fault-tolerance is achieved transparently---all that is needed is a simple re-link of the MPI application with Egida. We demonstrate Egida's versatility both as a testbed as well as an environment for developing new protocols by generating a few message logging protocols and evaluating their performance with a set of NAS benchmarks on a network of workstations. 1
|
617
|
the Ordering of Events in a Distributed System
– Time
- 1978
|
|
594
|
The x-Kernel: An architecture for implementing network protocols
– Hutchinson, Peterson
- 1991
|
|
523
|
A high-performance, portable implementation of the MPI message passing interface standard
– Gropp, Lusk, et al.
- 1996
|
|
521
|
Virtual time and global states of distributedsystems. Parallel and Distributed Algorithms
– Mattern
- 1989
|
|
460
|
Reliable communication in the presence of failures
– Birman, Joseph
- 1987
|
|
267
|
Optimistic recovery in distributed systems
– Strom, Yemini
- 1985
|
|
230
|
The Transis approach to high availability cluster communication
– Dolev, Malkhi
- 1996
|
|
216
|
Libckpt: Transparent checkpointing under Unix
– Plank, Beck, et al.
- 1995
|
|
201
|
Recovery in distributed systems using optimistic message logging and checkpointing
– Johnson, Zwaenepoel
- 1990
|
|
169
|
Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output commit
– Elnozahy, Zwaenepoel
- 1992
|
|
138
|
CoCheck: Checkpointing and Process Migration for MPI
– Stellner
- 1996
|
|
121
|
Sender-Based Message Logging
– Johnson, Zwaenpoel
- 1987
|
|
115
|
A message system supporting fault tolerance
– Borg, Baumbach, et al.
- 1983
|
|
96
|
Efficient distributed recovery using message logging
– Sistla, Welch
- 1989
|
|
93
|
Publishing: A reliable broadcast communication mechanism
– Powell, Presotto
- 1983
|
|
93
|
Design and Performance of Horus: A Lightweight Group Communications System
– Renesse, Hickey, et al.
- 1994
|
|
80
|
Fail-safe PVM: a portable package for distributed programming with transparent recovery
– Leon, Fisher, et al.
|
|
70
|
Software implemented fault tolerance: Technologies and experience
– Huang, Kintala
- 1993
|
|
70
|
Volatile logging in n-fault-tolerant distributed systems
– Strom, Bacon, et al.
- 1088
|
|
64
|
Adaptive: A Dynamically Assembled Protocol Transformation, Integration, and Evaluation Environment. Concurrency: Practice and Experience
– Schmidt, Box, et al.
- 1993
|
|
63
|
Nonblocking and Orphan-Free Message Logging Protocols
– Alvisi, Hoppe, et al.
- 1993
|
|
63
|
A Configurable Membership Service
– Hiltunen, Schlichting
- 1998
|
|
44
|
Crash Recovery with Little Overhead
– Juang, Venkatesan
- 1991
|
|
40
|
User’s Guide for mpich, a Portable Implementation
– Gropp, Lusk
|
|
38
|
How to Recover Efficiently and Asynchronously when Optimism Fails
– Damani, Garg
- 1996
|
|
37
|
MIST: PVM with transparent migration and checkpointing
– Casas, Clark, et al.
- 1995
|
|
32
|
Quarterware for Middleware
– Singhai, Sane, et al.
- 1998
|
|
21
|
MPI: The Complete Reference. Scientific and Engineering Computation Series
– Snir, Otto, et al.
- 1996
|
|
20
|
COMERA: COM Extensible Remoting Architecture
– Wang, Lee
- 1998
|
|
18
|
RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols
– Neves, Fuchs
- 1998
|
|
17
|
The cost of recovery in message logging protocols
– Rao, Alvisi, et al.
|
|
14
|
Efficient Algorithms for Optimistic Crash Recovery
– Venkatesan, Juang
- 1994
|
|
12
|
An Object-Oriented Testbed for the Evaluation of Checkpointing and Recovery Systems
– Ramamurthy, Upadhyaya, et al.
- 1997
|
|
6
|
An Analysis of Communication-induced Checkpointing
– Alvisi, Elnozahy, et al.
- 1999
|
|
4
|
MPI: The Complete Reference. Scienti c and Engineering Computation Series
– Snir, Otto, et al.
- 1996
|
|
2
|
Fail-safe PVM: A protable package for distributed programming with transparent recovery
– Leon, Ficher, et al.
- 1993
|
|
2
|
Hybrid Message Logging Protocols for Fast Recovery
– Rao, Alvisi, et al.
- 1998
|
|
2
|
RENEW: A Tool for Fast and E cient Implementation of Checkpoint Protocols
– Neves, Fuchs
- 1998
|
|
1
|
Volatile Logging in -Fault-Tolerant Distributed Systems
– Strom, Bacon, et al.
- 1988
|
|
1
|
How to Recover E - ciently and Asynchronously when Optimism Fails
– Damani, Garg
- 1996
|
|
1
|
A Con gurable Membership Service
– Hiltunen, Schlichting
- 1998
|