MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Egida: An extensible toolkit for low-overhead fault-tolerance (1999) [29 citations — 5 self]

Download:
Download as a PDF | Download as a PS
by Sriram Rao, Lorenzo Alvisi, Harrick M. Vin
In Symposium on Fault-Tolerant Computing
http://www.cs.utexas.edu/ftp/pub/techreports/tr98-29.ps.Z
Add To MetaCart

Abstract:

We discuss the design and implementation of Egida, an object-oriented toolkit designed to support transparent rollback-recovery. Egida exports a simple specification language that can be used to express arbitrary rollback recovery protocols. From this specification, Egida automatically synthesizes an implementation of the specified protocol by glueing together the appropriate objects from an available library of "building blocks". Egida is extensible and facilitates rapid implementation of rollback recovery protocols with minimal programming effort. We have integrated Egida with the MPICH implementation of the MPI standard. Existing MPI applications can take advantage of Egida without any modifications: fault-tolerance is achieved transparently---all that is needed is a simple re-link of the MPI application with Egida. We demonstrate Egida's versatility both as a testbed as well as an environment for developing new protocols by generating a few message logging protocols and evaluating their performance with a set of NAS benchmarks on a network of workstations. 1

Citations

617 the Ordering of Events in a Distributed System – Time - 1978
594 The x-Kernel: An architecture for implementing network protocols – Hutchinson, Peterson - 1991
523 A high-performance, portable implementation of the MPI message passing interface standard – Gropp, Lusk, et al. - 1996
521 Virtual time and global states of distributedsystems. Parallel and Distributed Algorithms – Mattern - 1989
460 Reliable communication in the presence of failures – Birman, Joseph - 1987
267 Optimistic recovery in distributed systems – Strom, Yemini - 1985
230 The Transis approach to high availability cluster communication – Dolev, Malkhi - 1996
216 Libckpt: Transparent checkpointing under Unix – Plank, Beck, et al. - 1995
201 Recovery in distributed systems using optimistic message logging and checkpointing – Johnson, Zwaenepoel - 1990
169 Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output commit – Elnozahy, Zwaenepoel - 1992
138 CoCheck: Checkpointing and Process Migration for MPI – Stellner - 1996
121 Sender-Based Message Logging – Johnson, Zwaenpoel - 1987
115 A message system supporting fault tolerance – Borg, Baumbach, et al. - 1983
96 Efficient distributed recovery using message logging – Sistla, Welch - 1989
93 Publishing: A reliable broadcast communication mechanism – Powell, Presotto - 1983
93 Design and Performance of Horus: A Lightweight Group Communications System – Renesse, Hickey, et al. - 1994
80 Fail-safe PVM: a portable package for distributed programming with transparent recovery – Leon, Fisher, et al.
70 Software implemented fault tolerance: Technologies and experience – Huang, Kintala - 1993
70 Volatile logging in n-fault-tolerant distributed systems – Strom, Bacon, et al. - 1088
64 Adaptive: A Dynamically Assembled Protocol Transformation, Integration, and Evaluation Environment. Concurrency: Practice and Experience – Schmidt, Box, et al. - 1993
63 Nonblocking and Orphan-Free Message Logging Protocols – Alvisi, Hoppe, et al. - 1993
63 A Configurable Membership Service – Hiltunen, Schlichting - 1998
44 Crash Recovery with Little Overhead – Juang, Venkatesan - 1991
40 User’s Guide for mpich, a Portable Implementation – Gropp, Lusk
38 How to Recover Efficiently and Asynchronously when Optimism Fails – Damani, Garg - 1996
37 MIST: PVM with transparent migration and checkpointing – Casas, Clark, et al. - 1995
32 Quarterware for Middleware – Singhai, Sane, et al. - 1998
21 MPI: The Complete Reference. Scientific and Engineering Computation Series – Snir, Otto, et al. - 1996
20 COMERA: COM Extensible Remoting Architecture – Wang, Lee - 1998
18 RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols – Neves, Fuchs - 1998
17 The cost of recovery in message logging protocols – Rao, Alvisi, et al.
14 Efficient Algorithms for Optimistic Crash Recovery – Venkatesan, Juang - 1994
12 An Object-Oriented Testbed for the Evaluation of Checkpointing and Recovery Systems – Ramamurthy, Upadhyaya, et al. - 1997
6 An Analysis of Communication-induced Checkpointing – Alvisi, Elnozahy, et al. - 1999
4 MPI: The Complete Reference. Scienti c and Engineering Computation Series – Snir, Otto, et al. - 1996
2 Fail-safe PVM: A protable package for distributed programming with transparent recovery – Leon, Ficher, et al. - 1993
2 Hybrid Message Logging Protocols for Fast Recovery – Rao, Alvisi, et al. - 1998
2 RENEW: A Tool for Fast and E cient Implementation of Checkpoint Protocols – Neves, Fuchs - 1998
1 Volatile Logging in -Fault-Tolerant Distributed Systems – Strom, Bacon, et al. - 1988
1 How to Recover E - ciently and Asynchronously when Optimism Fails – Damani, Garg - 1996
1 A Con gurable Membership Service – Hiltunen, Schlichting - 1998