Results 1 - 10
of
13
Egida: An extensible toolkit for low-overhead fault-tolerance
- In Symposium on Fault-Tolerant Computing
, 1999
"... We discuss the design and implementation of Egida, an objectoriented toolkit designed to support transparent rollback-recovery. Egida exports a simple specification language that can be used to express arbitrary rollback recovery protocols. From this specification, Egida automatically synthesizes an ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
We discuss the design and implementation of Egida, an objectoriented toolkit designed to support transparent rollback-recovery. Egida exports a simple specification language that can be used to express arbitrary rollback recovery protocols. From this specification, Egida automatically synthesizes an implementation of the specified protocol by gluing together the appropriate objects from an available library of “building blocks”. Egida is extensible and facilitates rapid implementation of rollback recovery protocols with minimal programming effort. We have integrated Egida with the MPICH implementation of the MPI standard. Existing MPI applications can take advantage of Egida without any modifications: fault-tolerance is achieved transparently—all that is needed is a simple re-link of the MPI application with Egida. 1
Message Logging in Mobile Computing
, 1999
"... Dependable mobile computing is enhanced by independent recovery, low power consumption and no dependence on stable storage at the mobile host. Existing recovery protocols proposed for mobile environments typically create consistent global checkpoints that do not guarantee independent recovery and lo ..."
Abstract
-
Cited by 24 (6 self)
- Add to MetaCart
Dependable mobile computing is enhanced by independent recovery, low power consumption and no dependence on stable storage at the mobile host. Existing recovery protocols proposed for mobile environments typically create consistent global checkpoints that do not guarantee independent recovery and low power consumption. This paper demonstrates the advantages of message logging by describing a receiver based logging protocol. Checkpointing is utilized to limit log size and recovery latency. We compare the performance of our approach with that of existing mobile checkpointing and recovery algorithms in terms of failure free overhead and recovery time. We also describe a stable storage management scheme for mobile support stations. Garbage collection is achieved without direct participation of mobile hosts.
Coordinated checkpoint versus message log for fault tolerant MPI
- in IEEE International Conference on Cluster Computing (Cluster 2003). IEEE CS
, 2003
"... fault tolerant MPI ..."
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
- in IEEE International Conference on Cluster Computing (Cluster 2004). IEEE CS
, 2004
"... Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a h ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a high fault rate. In a recent paper, we have demonstrated that the main differences between pessimistic sender based message logging and coordinated checkpointing are 1) the communication latency and 2) the performance penalty in case of faults. Pessimistic message logging increases the latency, due to additional blocking control messages. When faults occur at a high rate, coordinated checkpointing implies a higher performance penalty than message logging due to a higher stress on the checkpoint server. In this paper we extend this study to improved versions of message logging and coordinated checkpoint protocols which respectively reduces the latency overhead of pessimistic message logging and the server stress of coordinated checkpoint. We detail the protocols and their implementation into the new MPICH-V fault tolerant framework. We compare their performance against the previous versions and we compare the novel message logging protocols against the improved coordinated checkpointing one using the NAS benchmark on a typical high performance cluster equipped with a high speed network. The contribution of this paper is two folds: a) an original message logging protocol and an improved coordinated checkpointing protocol and b) the comparison between them.
Causality Tracking in Causal Message-Logging Protocols
- Comput
, 2002
"... Casual message-logging protocols have several attractive properties: they introduce no blocking, send no additional messages over those sentby the application, and never create orphans. ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Casual message-logging protocols have several attractive properties: they introduce no blocking, send no additional messages over those sentby the application, and never create orphans.
Impact of event logger on causal message logging protocols for fault tolerant MPI
- In IPDPS ’05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) - Papers
, 2005
"... Abstract — Fault tolerance in MPI becomes a main issue in the HPC community. Several approaches are envisioned from user or programmer controlled fault tolerance to fully automatic fault detection and handling. For this last approach, several protocols have been proposed in the literature. In a rece ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract — Fault tolerance in MPI becomes a main issue in the HPC community. Several approaches are envisioned from user or programmer controlled fault tolerance to fully automatic fault detection and handling. For this last approach, several protocols have been proposed in the literature. In a recent paper, we have demonstrated that uncoordinated checkpointing tolerates higher fault frequency than coordinated checkpointing. Moreover causal message logging protocols have been proved the most efficient message logging technique. These protocols consist in piggybacking non deterministic events to computation message. Several protocols have been proposed in the literature. Their merits are usually evaluated from four metrics: a) piggybacking computation cost, b) piggyback size, c) applications performance and d) fault recovery performance. In this paper, we investigate the benefit of using a stable storage for logging message events in causal message logging protocols. To evaluate the advantage of this technique we implemented three protocols: 1) a classical causal message protocol proposed in Manetho, 2) a state of the art protocol known as LogOn, 3) a light computation cost protocol called Vcausal. We demonstrate a major impact of this stable storage for the three protocols, on the four criteria for micro benchmarks as well as for the NAS benchmark. I.
1 Introduction Hybrid Message Logging Protocols for Fast Recovery
"... Fast recovery has received little attention in the context of message logging protocols, which have instead focused on minimizing their overhead during failure-free executions. ..."
Abstract
- Add to MetaCart
Fast recovery has received little attention in the context of message logging protocols, which have instead focused on minimizing their overhead during failure-free executions.
Hybrid Message Logging Protocols for Fast Recovery
- In Digest of FastAbstracts of Fault-Tolerant Computing Symposium (FTCS-28
, 1998
"... Introduction Fast recovery has received little attention in the context of message logging protocols, which have instead focused on minimizing their overhead during failure-free executions. As distributed computing becomes commonplace, and many more applications are faced with the current costs of ..."
Abstract
- Add to MetaCart
Introduction Fast recovery has received little attention in the context of message logging protocols, which have instead focused on minimizing their overhead during failure-free executions. As distributed computing becomes commonplace, and many more applications are faced with the current costs of high availability, there is a fresh need for recovery-based techniques that combine high performance during failurefree executions with fast recovery. As an initial step towards the development of these new techniques, we have implemented a sender-based pessimistic protocol [5], a receiver-based pessimistic protocol [2] and a causal protocol [1] and studied their performance during recovery [4] 1 . All of these protocols log (1) the content and (2) the order of receipt of each message delivered by each process during its execution. Processes synchronously log on stable storage 2 the order of receipt, encoded in tuples called determinants. Rec
EXTENDIBLE, LONG-LIVED TRANSACTION PROCESSING ON DISTRIBUTED AND MOBILE ENVIRONMENTS WITH RECOVERY GUARANTEES
, 2001
"... ..."
On the Calculation of the Checkpoint Interval in Run-Time for Parallel Applications
"... Abstract — The growth in the number of components that compose parallel computers increases their fault frequency. Currently, in such systems faults are no longer a rare event but a common problem, thus some sort of fault tolerance should be provided. In general, fault tolerance protocols rely on ch ..."
Abstract
- Add to MetaCart
Abstract — The growth in the number of components that compose parallel computers increases their fault frequency. Currently, in such systems faults are no longer a rare event but a common problem, thus some sort of fault tolerance should be provided. In general, fault tolerance protocols rely on checkpoints. A common question surrounding checkpointing is the definition of the checkpoint interval. Checkpoint interval models define variables which depends on application characteristics, e.g. the time need to take a checkpoint. The use of average values and/or statistical data to define these variables reduces the model’s accuracy. In this paper we propose a methodology to define in run-time the variables value needed to calculate the checkpoint interval. While using uncoordinated checkpoint this interval can be defined individually for each process of the parallel application. The variables definition relies on the measuring of the time spent on fault tolerance tasks in run-time. Experimental evaluation shows that the use of our methodology reduces in more than 3 % the overhead introduced by fault tolerance while tested applications are running in a faulty environment.

