56 citations found. Retrieving documents...
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpointing and its applications. In Proc. IEEE Fault-Tolerant Computing Symp, pages 22--31, June 1995.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

SafetyNet: Improving the Availability of Shared Memory . . . - Sorin (2002)   (8 citations)  (Correct)

....8. Performance vs. CLB Size 11 Software Backward Error Recovery. Software checkpointing has also been used, but at radically different engineering costs. In Tandem NonStop machines, every process periodically checkpoints its state on another processor [38] Work by Plank [32] and Wang and Hwang [44] uses software to periodically checkpoint applications to aid fault tolerance. These schemes differ in the degree of support required from the programmer, libraries, and operating system. At the link level, SCI [25] supports software retry of dropped or corrupted messages. SafetyNet differs from ....

Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpointing and Its Applications. In Computing Systems, pages 22--31, June 1995.


Duplex: A Reusable Fault Tolerance Extension.. - Sharma, Chen, Li.. (2003)   (3 citations)  (Correct)

....kinds of rollback recovery techniques: checkpoint based and logging based. Checkpoint based techniques periodically save the state of an executing process to a disk file from which it can be recovered after a failure. Examples of work on checkpoint based techniques include Libckpt [13] and Libckp [14]. Checkpointing of process state is an expensive operation in the context of high performance network access devices. Duplex provides a logging based mechanism that keeps a persistent record of nondeterministic events, such as changes made to the device configuration. In the event of a failure, ....

Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C.M.R. Kintala. Checkpointing and its applications. In Symposium on Fault-Tolerant Computing, Pasadena, CA, pages 22--31, June 1995. 10


Transparent Fault Tolerance for CORBA - Narasimhan (1999)   (4 citations)  (Correct)

....of replicas, the hosts on which they are running, the status of each replica and the number of faults seen by the replica on a given host. This repository,which forms part of the state of the ReplicaManager, is periodically checkpointed. DOORS employs libraries for the transparentcheckpointing [71] of applications# however, duplicate detection and suppression are not addressed. DoorMan is a managementinterface to DOORS that monitors DOORS and the underlying system in order to fine tune the functioning of DOORS and to take corrective action by migrating objects whose hosts are suspected of ....

Y. M. Wang, Y. Huang, K. P.Vo, P.Y.Chung, and C. M. R. Kintala. Checkpointing and its applications. In Proceedings of the 25th IEEE International Symposium on Fault-Tolerant Computing, pages 22--31, Pasadena, CA, June 1995.


Improving Availability with Recursive Micro-Reboots: A.. - Candea, Cutler, Fox (2003)   (1 citation)  (Correct)

....that of segregating and protecting state that needs to be persistent, while treating the rest as soft state. We see this approach reflected in recent work on soft state hardstate segregation in Internet services [39, 50] and we adopt it as a basic tenet for our restart retry model. Checkpointing [101, 23, 99] employs dynamic data redundancy to create a believed good snapshot of a program s state and, in case of failure, return the program to that state. An important challenge in checkpoint based recovery is ensuring that the checkpoint is taken before the state has been corrupted [102] Another ....

Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. M. R. Kintala. Checkpointing and its applications. In Proc. 25th International Symposium on Fault-Tolerant Computing, 1995.


State Synchronization and Recovery for Strongly.. - Narasimhan, Moser.. (2001)   (3 citations)  (Correct)

....[12] provides fault tolerance through a service approach, with CORBA objects that detect, and recover from, replica and processor faults. The system provides support for resource management based on the needs of the CORBA application. DOORS employs libraries for the transparent checkpointing [18] of applications; however, duplicate detection and suppression are not addressed. OGS, AQuA, Maestro and DOORS deal with the consistency of application level state by having application objects inherit from an IDL interface with state retrieval and assignment methods similar to those of our ....

Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. M. R. Kintala. Checkpointing and its applications. In Proceedings of the 25th IEEE International Symposium on FaultTolerant Computing, pages 22--31, Pasadena, CA, June 1995.


Strongly Consistent Replication and Recovery of Fault-Tolerant.. - Narasimhan (2002)   (Correct)

....[12] provides fault tolerance through a service approach, with CORBA objects that detect, and recover from, replica and processor faults. The system provides support for resource management based on the needs of the CORBA application. DOORS employs libraries for the transparent checkpointing [18] of applications; however, duplicate detection and suppression are not addressed. 17 The Interoperable Replication Logic (IRL) 5] also provides fault tolerance for CORBA applications through a service approach. One of the aims of IRL is to uphold CORBA s interoperability by supporting a ....

Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. M. R. Kintala. Checkpointing and its applications. In Proceedings of the 25th IEEE International Symposium on Fault-Tolerant Computing, pages 22--31, Pasadena, CA, June 1995.


Low-Cost Error Containment and Recovery for Onboard.. - Tai, Tso, Alkalai.. (2002)   (Correct)

....we take a crucial step in devising error containment and recovery methods by introducing the confidence driven notion. This notion complements the message driven (or communication induced approach employed by a number of existing checkpointing protocols for tolerating hardware faults [7] [8]. The resulting error containment and recovery protocol is thus both message driven and confidence driven (MDCD) In particular, the MDCD protocol is based on a two tiered approach: First, we discriminate among software components with respect to our confidence in them and, second, during onboard ....

.... in the beginning of Section 4, in order to effectively mitigate the effects of software design faults in a distributed computing environment without imposing restriction on interprocess communication, we adapt the communication induced checkpointing technique for hardware error recovery [7] [8] and complement the technique by introducing the confidence driven notion. This is the most crucial step we take in deriving the distributed algorithms for low cost error containment and recovery. The resulting checkpointing rule and algorithms thus ensure that the error recovery mechanisms can ....

Y.M. Wang et al., "Checkpointing and Its Applications," Digest 25th Ann. Int'l Symp. Fault-Tolerant Computing, pp. 22-31, June 1995.


On-Board Guarded Software Upgrading for Space Missions - Ann Tai Kam (1999)   (1 citation)  (Correct)

....called guarded software upgrading (GSU) that enables seamless and dependable on board software upgrading and feasible for middleware implementation. The error containment and protection methods for GSU are based on checkpointing, message logging and rollback roll forward recovery techniques [6, 7, 8, 9] that are adapted and extended to accommodate the requirements from the X2000 architecture and applications. The same methodology can be applied to the two stages of guarded software upgrading, namely, on board validation and guarded operation, as well as version switching for the transition from ....

Y. M. Wang et al., "Checkpointing and its applications, " in Digest of the 25th Annual International Symposium on Fault-Tolerant Computing, (Pasadena, CA), pp. 22--31, June 1995.


Platform Independent Checkpointing of a C-Program in Execution - Gulwani, Tarachandani   (Correct)

....and have shown small space and time overheads. 1 Introduction Process checkpointing is a technique to store the state of a process during normal execution. Process checkpointing has been extensively used to provide support for software fault tolerance, process migration and playback debugging[17]. Most of the research on process checkpointing has focussed on homogeneous process checkpointing i.e. checkpointing the process for restart on the same machine or on a di erent machine with same architecture and running the same operating system. This allows the state information of the process ....

Yi-Min Wang, Yennum Huang, Kiem-Phong Vo, Pi-Yu Chung, Chandra Kintala, "Checkpointing and its applications", In 25th International Symposium on Fault-Tolerant Computing, June 1995 14


Processor Allocation and Checkpoint Interval Selection in.. - Plank, Thomason (2001)   (3 citations)  (Correct)

....[19] and CosMiC [8] where workstations are available for computations only when they are not in use by their owners. Failure and repair data was obtained by the authors of [8] and the checkpointing performance data was gleaned from performance results of CosMiC s transparent checkpointer libckp [35]. It is assumed that the copy on write optimization yields an 80 percent improvement in checkpoint overhead [24] The failure rate of LOW is extremely high, which is typical of these environments, and as the data later show, they are not particularly conducive to this kind of parallel computing. ....

Y-M. Wang, Y. Huang, K-P. Vo, P-Y. Chung, and C. Kintala. Checkpointing and its applications. In 25th International Symposium on Fault-Tolerant Computing, pages 22--31, Pasadena, CA, June 1995.


Low-overhead Protocols for Fault-tolerant File Sharing - Lorenzo Alvisi Sriram (1998)   (Correct)

....a simple, uniform approach, which can provide low overhead fault tolerance to applications in which communication is performed through message passing, file sharing, or a combination of the two. 1 Introduction Low overhead rollback recovery protocols such as checkpointing and message logging [2, 3, 9, 17, 18] have been extensively studied for message passing applications. These protocols seek to tolerate common failures while minimizing the use of additional resources and the impact on performance during failure free executions. In this paper, we focus on low overhead protocols for applications in ....

Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. Kintala. Checkpointing and Its Applications. In Proceedings of the IEEE Fault-Tolerant Computing Symposium (FTCS-25), pages 22--31, Pasadena, CA, June 1995.


The Average Availability of Parallel Checkpointing Systems.. - Plank, Thomason (1999)   (1 citation)  (Correct)

....by CosMiC [6] where workstations are available for computations only when they are not in use by their owners. Failure and repair data was obtained by the authors of [6] and the checkpointing performance data was gleaned from performance results of CosMiC s transparent checkpointer libckp [27]. It is assumed that the copy on write optimization yields an 80 improvement in checkpoint overhead [18] The failure rate of LOW is extremely high, which is typical of these environments. For each application, we selected a problem size that causes the computation to run between 14 and 20 hours ....

Y-M. Wang et al. Checkpointing and its applications. In 25th Int. Symp. on Fault-Tol. Comp., pp. 22--31, June 1995.


Towards Performability Modeling of Software Rejuvenation - Garg, van Moorsel (1996)   (1 citation)  (Correct)

....Transient software failures of this nature are reported in many instances in the field [1, 6, 8, 11] The reason behind the Heisenbug s elusiveness, during testing as well as in the operational phase, is the dependence of their activation on the operational environment. Using the terminology in [13], the operational environment includes both the process state and the process environment, The material presented in this paper has been developed during Sachin Garg s summer internship at Bell Labs, Murray Hill, summer 1996. 1 Throughout the paper we use the term performability in a generic, ....

Y.-M. Wang, Y. Huang, P. Vo, P.-Y. Chung, and C. Kintala, "Checkpointing and its applications," in 25th Symposium on Fault Tolerant Computer Systems, pp. 22--30, Pasadena, CA, 1995, IEEE, IEEE Computer Society.


System-Level versus User-Defined Checkpointing - Silva, Silva (1998)   (5 citations)  (Correct)

.... These two last schemes (transparent system level and userdefined checkpointing) have their advantages and drawbacks and there has been some discussion about whether faulttolerance should be handled transparently by the operating system or should be provided on top of the operating system [6][7]. In this paper, we describe the pros and cons of both approaches. Section 2 will present a qualitative analysis between these two approaches. Section 3 refers the systemlevel checkpointing algorithm, while section 4 presents a user defined checkpointing scheme. Section 5 presents the results of ....

....XDR format they can be used in heterogeneous architectures, while system level checkpoints can only be migrated between homogeneous machines; programmer induced recovery provides more flexibility. For instance, checkpoint recovery can also be used to tolerate software bugs as was proposed in [7]; user defined checkpointing can be seen as a multipurpose technique: for fault tolerance, playback debugging or coarse grained job swapping. To conclude, we do not claim that system level checkpointing is worse than user defined checkpointing. What we tried to prove was that user defined ....

Y.M.Wang, Y.Huang, K.P.Vo, P.Y.Chung, C.Kintala. "Checkpointing and Its Applications", Proc. 25 th FaultTolerant Computing Symposium, FTCS-25, pp. 22-31, July 1995


Controlling Recovery Time with Message Logging - Ssu, Yao, Fuchs   (Correct)

....in the presence of failures. These mathematical solutions are often not applicable due to the lack of accurate data on the probability distribution function of failures [1] Current checkpoint libraries typically require application users to define a fixed time interval for checkpointing [2]. Since the checkpoint interval implies the approximate maximum recovery time for single process applications, users who do not have accurate information on the mean time to failure (MTTF) determine the fixed checkpoint interval based on their preferred maximum recovery time. The maximum recovery ....

Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala, "Checkpointing and its Applications," Proceedings of IEEE Fault-Tolerant Computing Symposium, pp. 22--31, June 1995.


Design and Implementation of a Low-Overhead File.. - Pei, Wang, Shen, Zheng (2000)   Self-citation (Wang)   (Correct)

....by hiding the latency of flushing the buffer, this approach achieved an overhead lower than other approaches. 1. Introduction Checkpointing and recovery is a technique for saving process state during normal execution and restoring the saved state after a failure to reduce the amount of lost work [1]. Process state refers to everything that is included in a checkpoint in order to guarantee a successful recovery and it should include both volatile and persistent state [2] Persistent state includes the status of all the user files related to the current execution of the process. The status of ....

....to the current execution of the process. The status of a file includes its content and its active information, i.e. its descriptor, access mode, the offset to which it is positioned, etc. Although supporting the correct rollback of persistent state has become the primary concern of many users [1], existing checkpoint libraries usually save and restore only active information [2, 3] This is because it is unacceptably expensive to save all the content of user files into checkpoint due to their arbitrary size and number. This straightforward but incomplete way will result in inconsistent ....

[Article contains additional citation context not shown here]

Y.M. Wang, Y. Huang, et al. "Checkpointing and Its Applications", Proceedings of IEEE 25th Symposium on FaultTolerant Computing, June 1995, pp. 22-31.


An Empirical Study of Tracing Techniques - From Failure Analysis   (Correct)

No context found.

Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpointing and its applications. In Proc. IEEE Fault-Tolerant Computing Symp, pages 22--31, June 1995.


Virtual Machine Based Heterogeneous Checkpointing - Adnan Agbaria Roy (2000)   (Correct)

No context found.

Y. M. Wang, Y. Huang, K. P. Vo, P.Y. Chung, and C. Kintala. Checkpointing and its Applications. In Proceedings of the 25th International Symposium on Fault-Tolerant Computing, pages 22--31, June 1995.


Software---Practice And Experience - Softw Pract Exper (2002)   (Correct)

No context found.

Wang YM, Huang Y, Vo KP, Chung PY, Kintala CMR. Checkpointing and its applications. Proceedings 25th IEEE International Symposium on Fault-Tolerant Computing, Pasadena, CA, June 1995. IEEE Computer Society: Los Alamitos, CA, 1995; 22--31.


Flashback: A Lightweight Extension for Rollback and .. - Srinivasan.. (2004)   (1 citation)  (Correct)

No context found.

Y. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpointing and its applications. In FTCS-25, 1995.


Protecting Distributed Software Upgrades that Involve.. - Interface Changes Ann   (Correct)

No context found.

Y. M. Wang et al., "Checkpointing and its applications," in Digest of the 25th Annual International Symposium on FaultTolerant Computing, (Pasadena, CA), pp. 22--31, June 1995.


Using Lightweight Checkpoint/Recovery to Improve the Availability.. - Sorin (2002)   (Correct)

No context found.

Y-M. Wang, Y. Huang, K-P. Vo, P-Y. Chung, and C. Kintala. Checkpointing and Its Applications. In Proceedings of the 25th International Symposium on Fault-Tolerant Computing Systems, pages 22--31, June 1995.


Duplex: A Reusable Fault Tolerance Extension.. - Sharma, Chen, Li.. (2003)   (3 citations)  (Correct)

No context found.

Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C.M.R. Kintala. Checkpointing and its applications. In Symposium on Fault-Tolerant Computing, Pasadena, CA, pages 22--31, June 1995.


Techniques for Efficiently Recording State Changes of a Computer.. - II (2001)   (Correct)

No context found.

Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. Kintala, "Checkpointing and its Applications," In 25th International Symposium on Fault-Tolerant Computing, pages 22-31, Pasadena, CA, June 1995.


Design, Implementation, and Performance of Checkpointing in.. - Agbaria, Plank   (Correct)

No context found.

Y-M. Wang, Y. Huang, K-P. Vo, P-Y. Chung, and C. Kintala. Checkpointing and its applications. In 25th International Symposium on Fault-Tolerant Computing, pages 22-- 31, Pasadena, CA, June 1995.

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC