13 citations found. Retrieving documents...
S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead fault-tolerance. In Symposium on Fault-Tolerant Computing, pages 48--55, 1999.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
MPI/FT: A Model-based Approach to Low-overhead Fault .. - Batchu, Dandass..   (Correct)

....realized reliable MPI at the cost of high overheads that contradict the high performance goal of MPI. Most of these efforts and resulting middleware are based on black box approaches that provide static abstractions of fault tolerance features and services to all types of parallel applications [1, 12, 26, 28]. These rigid approaches, however, can result in unnecessarily large overheads in support of unused features and limited the flexibility to applications that can tradeoff reliability in favor of performance. This paper presents MPI FT [3] an MPI based middleware that provides additional services ....

....it checkpoints the entire process state. Recovery of a failed process is performed by a user level function. However, CoCheck only provides coarse grained reliability for MPI because CoCheck does not have access to the message passing middleware s internal data structures. 2. 3 Egida Egida [26] is an object oriented toolkit for transparent logging, and rollback recovery. Egida is extensible and allows users to define their own logging and rollback recovery protocols. Implementations for the described protocol are synthesized by gluing pre existing objects. Egida s log based fault ....

S. Rao, L. Alvisi, and H.M. Vin. "Egida: An Extensible Toolkit for Low-overhead Faulttolerance, " Proc. 29thlnternational Fault-Tolerant Computing Symposium, IEEE CS Press, Los Alamitos, California, pp. 48-55, 1999.


MPICH-V: Toward a Scalable Fault Tolerant MPI for.. - Bosilca, Bouteiller, .. (2002)   (1 citation)  (Correct)

....a transparent fault tolerance property for MPI applications. Only the upper and the lowest levels approaches provide automatic fault tolerance for MPI applications, transparent for the user. Others approaches require the user to manage fault tolerance by adapting its application. As discussed in [8, 21], many techniques of fault tolerance have been proposed in previous works, lying in various level of the software stacks : one solution is based on global coherent checkpoints, the MPI application being restarted to the last coherent checkpoint in case of faults (even in case of a single crash) ....

....operations (shrinking, rebuilding, aborting the communicator) according to various fault states. The main advantage of FT MPI is its performance since it does not checkpoint neither MPI processes nor log MPI messages. Its main drawback is the lack of transparency for the programmer. Egida [21] is an object oriented toolkit supporting transparent log based rollback recovery. By integrating Egida with MPICH, existing MPI applications can take advantage of fault tolerance transparently without any modification. Egida implements failure detection and recovery at the low level layer. The ....

Sriram Rao, Lorenzo Alvisi, and Harrick M. Vin. Egida: An extensible toolkit for low-overhead faulttolerance. In Symposium on Fault-Tolerant Computing, pages 48--55, 1999.


A Distributed Fault-Tolerant Asynchronous Algorithm for.. - Weerasinghe, Lipsky (2001)   (Correct)

....of tolerating processor failures, and communication timeouts. In the presence of such failures, applications should continue execution and produce correct results. A number of systems have been developed to incorporate fault tolerance into distributed applications. The systems, Star sh [2] Egida [9], Hector [10] and others allow MPI [1] applications to be executed in NoWs without any modi cations. They use checkpointing and recovery mechanisms in which each process s state is saved periodically in a reliable storage. In the event of a process failure, either the a ected process or the whole ....

S. Rao, L. Alvisi, H. M. Vin, \Egida: An Extensible Toolkit for Low-overhead FaultTolerance, " In Proc. of the 8 t h Int. Conf. on Distributed Computing Systems, Netherlands, May 1998.


An Analysis of Communication-Induced Checkpointing - Alvisi, Elnozahy, Rao, Amir.. (1999)   (7 citations)  Self-citation (Rao Alvisi)   (Correct)

....study consists of four 300 MHz Pentium II based workstations connected by a 100MB s Ethernet. Each workstation has two processors, 512MB of RAM, and a 4GB disk used to implement stable storage. The machines ran Solaris 2.6, and used Sun s f77 and C compilers. The testbed is part of the Egida tool [23], which includes support for incremental checkpointing and implements non blocking checkpointing by forking off a child process that writes the checkpoint to stable storage. The applications under study consist of four programs from the NPB 2.3 benchmark suite [15] These programs represent common ....

....We are not aware, however, of any experimental work to investigate CIC along the lines we followed here. We would like to point out though that two implementations of coordinated checkpointing have used the idea of time stamping a message with the checkpointing interval as suggested by Briatico [6,23]. There are also several experimental evaluations that were performed on other styles of rollbackrecovery such as message logging [4,17] and coordinated checkpointing [6,14,18,20,24,28] but comparing these efforts with the work presented here is out of the scope of this paper. 6. Conclusions ....

S. Rao and L. Alvisi and H. M. Vin. Egida: An Extensible Toolkit for Low-overhead Fault-tolerance. Technical Report TR 98-29, Dec. 1998.


A Survey of Rollback-Recovery Protocols in.. - Elnozahy, Alvisi.. (1996)   (161 citations)  Self-citation (Alvisi)   (Correct)

No context found.

S. Rao, L. Alvisi and H. Vin. "Egida: An extensible toolkit for low-overhead fault tolerance." In Proceedings of the Twenty Ninth International Symposium on Fault-Tolerant Computing, Jun. 1999.


The Cost of Recovery in Message Logging Protocols - Sriram Rao (1998)   (2 citations)  Self-citation (Rao Alvisi Vin)   (Correct)

....Sections 3.5 and 4. Section 5 introduces and evaluates hybrid logging protocols. Finally, Section 6 offers some concluding remarks. 2 Implementation We measure the cost of recovery in message logging protocols using Egida, an object oriented toolkit for synthesizing rollback recovery protocols [16]. Egida supports a library of objects that implement a set of functionalities that are at the core of all log based rollback recovery protocols; different rollback recovery protocols can be implemented by composing the objects in this library. Egida is integrated with the MPICH implementation of ....

S. Rao, L. Alvisi, and H. M. Vin. Egida: An Extensible Toolkit for Low-overhead Fault-tolerance. In Proceedings of the IEEE Fault-Tolerant Computing Symposium (FTCS-29), Madison, WI, June 1999.


Automated Application-level Checkpointing of MPI Programs - Bronevetsky, Marques.. (2003)   (Correct)

No context found.

S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead fault-tolerance. In Symposium on Fault-Tolerant Computing, pages 48--55, 1999.


Collective Operations in an Application-level.. - Bronevetsky.. (2003)   (Correct)

No context found.

S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead fault-tolerance. In Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, Madison, Wisconsin, June 15 - 18, 1999.


Dependable High Performance Computing on a Parallel.. - Blochinger, Bündgen, al. (2000)   (1 citation)  (Correct)

No context found.

S. Rao, L. Alvisi, and H.M. Vin. Egida: An extensible toolkit for low-overhead faulttolerance. In Proceedings of IEEE International Conference on Fault-Tolerant Computing (FTCS), pages 48--55, June 1999.


Implementation and Evaluation of a Scalable.. - Schulz.. (2004)   (1 citation)  (Correct)

No context found.

S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead fault-tolerance. In Symposium on Fault-Tolerant Computing, pages 48--55, 1999.


Unknown - Automating   (Correct)

No context found.

S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead faulttolerance. In Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, Madison, Wisconsin, June 15 - 18, 1999.


Reliability in LAM/MPI Requirements Specification - Lumsdaine, Squyres, Barrett (2002)   (Correct)

No context found.

Sriram Rao, Lorenzo Alvisi, and Harrick M. Vin. Egida: An Extensible Toolkit for Low-Overhead Fault-Tolerance. In Symposium on Fault-Tolerant Computing, pages 48--55, 1999.


Performance Modelling and Experimental Evaluation of Systems.. - Weerasinghe (2002)   (Correct)

No context found.

S. Rao, L. Alvisi, H. M. Vin, "Egida: An Extensible Toolkit for Low-overhead Fault-Tolerance", In Proc. of the 8 h Int. Conf. on Distributed Computing Systems, Netherlands, May 1998.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC