| Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpointing and its applications. In Proc. IEEE Fault-Tolerant Computing Symp, pages 22--31, June 1995. |
....8. Performance vs. CLB Size 11 Software Backward Error Recovery. Software checkpointing has also been used, but at radically different engineering costs. In Tandem NonStop machines, every process periodically checkpoints its state on another processor [38] Work by Plank [32] and Wang and Hwang [44] uses software to periodically checkpoint applications to aid fault tolerance. These schemes differ in the degree of support required from the programmer, libraries, and operating system. At the link level, SCI [25] supports software retry of dropped or corrupted messages. SafetyNet differs from ....
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpointing and Its Applications. In Computing Systems, pages 22--31, June 1995.
....kinds of rollback recovery techniques: checkpoint based and logging based. Checkpoint based techniques periodically save the state of an executing process to a disk file from which it can be recovered after a failure. Examples of work on checkpoint based techniques include Libckpt [13] and Libckp [14]. Checkpointing of process state is an expensive operation in the context of high performance network access devices. Duplex provides a logging based mechanism that keeps a persistent record of nondeterministic events, such as changes made to the device configuration. In the event of a failure, ....
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C.M.R. Kintala. Checkpointing and its applications. In Symposium on Fault-Tolerant Computing, Pasadena, CA, pages 22--31, June 1995. 10
....of replicas, the hosts on which they are running, the status of each replica and the number of faults seen by the replica on a given host. This repository,which forms part of the state of the ReplicaManager, is periodically checkpointed. DOORS employs libraries for the transparentcheckpointing [71] of applications# however, duplicate detection and suppression are not addressed. DoorMan is a managementinterface to DOORS that monitors DOORS and the underlying system in order to fine tune the functioning of DOORS and to take corrective action by migrating objects whose hosts are suspected of ....
Y. M. Wang, Y. Huang, K. P.Vo, P.Y.Chung, and C. M. R. Kintala. Checkpointing and its applications. In Proceedings of the 25th IEEE International Symposium on Fault-Tolerant Computing, pages 22--31, Pasadena, CA, June 1995.
....that of segregating and protecting state that needs to be persistent, while treating the rest as soft state. We see this approach reflected in recent work on soft state hardstate segregation in Internet services [39, 50] and we adopt it as a basic tenet for our restart retry model. Checkpointing [101, 23, 99] employs dynamic data redundancy to create a believed good snapshot of a program s state and, in case of failure, return the program to that state. An important challenge in checkpoint based recovery is ensuring that the checkpoint is taken before the state has been corrupted [102] Another ....
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. M. R. Kintala. Checkpointing and its applications. In Proc. 25th International Symposium on Fault-Tolerant Computing, 1995.
....[12] provides fault tolerance through a service approach, with CORBA objects that detect, and recover from, replica and processor faults. The system provides support for resource management based on the needs of the CORBA application. DOORS employs libraries for the transparent checkpointing [18] of applications; however, duplicate detection and suppression are not addressed. OGS, AQuA, Maestro and DOORS deal with the consistency of application level state by having application objects inherit from an IDL interface with state retrieval and assignment methods similar to those of our ....
Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. M. R. Kintala. Checkpointing and its applications. In Proceedings of the 25th IEEE International Symposium on FaultTolerant Computing, pages 22--31, Pasadena, CA, June 1995.
....[12] provides fault tolerance through a service approach, with CORBA objects that detect, and recover from, replica and processor faults. The system provides support for resource management based on the needs of the CORBA application. DOORS employs libraries for the transparent checkpointing [18] of applications; however, duplicate detection and suppression are not addressed. 17 The Interoperable Replication Logic (IRL) 5] also provides fault tolerance for CORBA applications through a service approach. One of the aims of IRL is to uphold CORBA s interoperability by supporting a ....
Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. M. R. Kintala. Checkpointing and its applications. In Proceedings of the 25th IEEE International Symposium on Fault-Tolerant Computing, pages 22--31, Pasadena, CA, June 1995.
....we take a crucial step in devising error containment and recovery methods by introducing the confidence driven notion. This notion complements the message driven (or communication induced approach employed by a number of existing checkpointing protocols for tolerating hardware faults [7] [8]. The resulting error containment and recovery protocol is thus both message driven and confidence driven (MDCD) In particular, the MDCD protocol is based on a two tiered approach: First, we discriminate among software components with respect to our confidence in them and, second, during onboard ....
.... in the beginning of Section 4, in order to effectively mitigate the effects of software design faults in a distributed computing environment without imposing restriction on interprocess communication, we adapt the communication induced checkpointing technique for hardware error recovery [7] [8] and complement the technique by introducing the confidence driven notion. This is the most crucial step we take in deriving the distributed algorithms for low cost error containment and recovery. The resulting checkpointing rule and algorithms thus ensure that the error recovery mechanisms can ....
Y.M. Wang et al., "Checkpointing and Its Applications," Digest 25th Ann. Int'l Symp. Fault-Tolerant Computing, pp. 22-31, June 1995.
....called guarded software upgrading (GSU) that enables seamless and dependable on board software upgrading and feasible for middleware implementation. The error containment and protection methods for GSU are based on checkpointing, message logging and rollback roll forward recovery techniques [6, 7, 8, 9] that are adapted and extended to accommodate the requirements from the X2000 architecture and applications. The same methodology can be applied to the two stages of guarded software upgrading, namely, on board validation and guarded operation, as well as version switching for the transition from ....
Y. M. Wang et al., "Checkpointing and its applications, " in Digest of the 25th Annual International Symposium on Fault-Tolerant Computing, (Pasadena, CA), pp. 22--31, June 1995.
....and have shown small space and time overheads. 1 Introduction Process checkpointing is a technique to store the state of a process during normal execution. Process checkpointing has been extensively used to provide support for software fault tolerance, process migration and playback debugging[17]. Most of the research on process checkpointing has focussed on homogeneous process checkpointing i.e. checkpointing the process for restart on the same machine or on a di erent machine with same architecture and running the same operating system. This allows the state information of the process ....
Yi-Min Wang, Yennum Huang, Kiem-Phong Vo, Pi-Yu Chung, Chandra Kintala, "Checkpointing and its applications", In 25th International Symposium on Fault-Tolerant Computing, June 1995 14
....[19] and CosMiC [8] where workstations are available for computations only when they are not in use by their owners. Failure and repair data was obtained by the authors of [8] and the checkpointing performance data was gleaned from performance results of CosMiC s transparent checkpointer libckp [35]. It is assumed that the copy on write optimization yields an 80 percent improvement in checkpoint overhead [24] The failure rate of LOW is extremely high, which is typical of these environments, and as the data later show, they are not particularly conducive to this kind of parallel computing. ....
Y-M. Wang, Y. Huang, K-P. Vo, P-Y. Chung, and C. Kintala. Checkpointing and its applications. In 25th International Symposium on Fault-Tolerant Computing, pages 22--31, Pasadena, CA, June 1995.
....a simple, uniform approach, which can provide low overhead fault tolerance to applications in which communication is performed through message passing, file sharing, or a combination of the two. 1 Introduction Low overhead rollback recovery protocols such as checkpointing and message logging [2, 3, 9, 17, 18] have been extensively studied for message passing applications. These protocols seek to tolerate common failures while minimizing the use of additional resources and the impact on performance during failure free executions. In this paper, we focus on low overhead protocols for applications in ....
Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. Kintala. Checkpointing and Its Applications. In Proceedings of the IEEE Fault-Tolerant Computing Symposium (FTCS-25), pages 22--31, Pasadena, CA, June 1995.
....by CosMiC [6] where workstations are available for computations only when they are not in use by their owners. Failure and repair data was obtained by the authors of [6] and the checkpointing performance data was gleaned from performance results of CosMiC s transparent checkpointer libckp [27]. It is assumed that the copy on write optimization yields an 80 improvement in checkpoint overhead [18] The failure rate of LOW is extremely high, which is typical of these environments. For each application, we selected a problem size that causes the computation to run between 14 and 20 hours ....
Y-M. Wang et al. Checkpointing and its applications. In 25th Int. Symp. on Fault-Tol. Comp., pp. 22--31, June 1995.
....Transient software failures of this nature are reported in many instances in the field [1, 6, 8, 11] The reason behind the Heisenbug s elusiveness, during testing as well as in the operational phase, is the dependence of their activation on the operational environment. Using the terminology in [13], the operational environment includes both the process state and the process environment, The material presented in this paper has been developed during Sachin Garg s summer internship at Bell Labs, Murray Hill, summer 1996. 1 Throughout the paper we use the term performability in a generic, ....
Y.-M. Wang, Y. Huang, P. Vo, P.-Y. Chung, and C. Kintala, "Checkpointing and its applications," in 25th Symposium on Fault Tolerant Computer Systems, pp. 22--30, Pasadena, CA, 1995, IEEE, IEEE Computer Society.
.... These two last schemes (transparent system level and userdefined checkpointing) have their advantages and drawbacks and there has been some discussion about whether faulttolerance should be handled transparently by the operating system or should be provided on top of the operating system [6][7]. In this paper, we describe the pros and cons of both approaches. Section 2 will present a qualitative analysis between these two approaches. Section 3 refers the systemlevel checkpointing algorithm, while section 4 presents a user defined checkpointing scheme. Section 5 presents the results of ....
....XDR format they can be used in heterogeneous architectures, while system level checkpoints can only be migrated between homogeneous machines; programmer induced recovery provides more flexibility. For instance, checkpoint recovery can also be used to tolerate software bugs as was proposed in [7]; user defined checkpointing can be seen as a multipurpose technique: for fault tolerance, playback debugging or coarse grained job swapping. To conclude, we do not claim that system level checkpointing is worse than user defined checkpointing. What we tried to prove was that user defined ....
Y.M.Wang, Y.Huang, K.P.Vo, P.Y.Chung, C.Kintala. "Checkpointing and Its Applications", Proc. 25 th FaultTolerant Computing Symposium, FTCS-25, pp. 22-31, July 1995
....in the presence of failures. These mathematical solutions are often not applicable due to the lack of accurate data on the probability distribution function of failures [1] Current checkpoint libraries typically require application users to define a fixed time interval for checkpointing [2]. Since the checkpoint interval implies the approximate maximum recovery time for single process applications, users who do not have accurate information on the mean time to failure (MTTF) determine the fixed checkpoint interval based on their preferred maximum recovery time. The maximum recovery ....
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala, "Checkpointing and its Applications," Proceedings of IEEE Fault-Tolerant Computing Symposium, pp. 22--31, June 1995.
....research area for enabling long running applications to be fault tolerant. Many basic checkpointing algorithms [6, 11] and optimization techniques [12] have been developed for uniprocessor and parallel computing systems, and several checkpointing libraries and systems have been implemented [1, 5, 8, 10, 14, 17, 18, 20, 22]. However, for the typical scientific user, actually using a checkpointing system is a difficult task. All systems require the user to port a library and recompile or relink their code subject to a number of restrictions imposed by the library. These restrictions range from strong typing of the ....
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpointing and its applications. In 25th Int. Symp. on Fault-Tolerant Comp., pages 22--31, Pasadena, CA, June 1995.
....target programs. In particular, support for error detection and recovery has been added. The frontend acts as a voter for data that is collected from the replicated backends. The voter code includes management code for synchronization of the backends. For recovery, the libckp checkpointing package [11] is used to perform restarting and migration of target program processes that have been determined to be erroneous. The frontend graphical interface was also modified to allow the specification of voting parameters and recovery parameters. 3 Fault tolerance implementation The Prism tool described ....
Yi-Min Wang, Yennun Huang, Kiem-Phong Vo, Pi-Yu Chung, and Chandra Kintala. Checkpointing and its applications. In Proc. 25th Fault-Tolerant Computing Symposium, pages 22--31, 1995.
....hiding the latency of flushing the buffer, the MOB approach achieves an overhead lower than other approaches. 1. Introduction Checkpointing and recovery is a technique for saving process state during normal execution and restoring the saved state after a failure to reduce the amount of lost work [1]. Process state refers to everything that is included in a checkpoint in order to guarantee a successful recovery. Wang et al. 1] pointed out that the process state should include both volatile state and persistent state. Volatile state consists of the data segment, stack, the process registers ....
....and recovery is a technique for saving process state during normal execution and restoring the saved state after a failure to reduce the amount of lost work [1] Process state refers to everything that is included in a checkpoint in order to guarantee a successful recovery. Wang et al. [1] pointed out that the process state should include both volatile state and persistent state. Volatile state consists of the data segment, stack, the process registers and the information about signals. Persistent state includes the status of all the user files related to the current execution of ....
[Article contains additional citation context not shown here]
Y.M. Wang, Y. Huang, et al. "Checkpointing and Its Applications ", Proceedings of IEEE 25th Symposium on FaultTolerant Computing", June 1995, pp. 22-31.
....shortrunning programs. The user can reduce the percentage of execution time spent saving checkpoints by increasing the checkpoint interval. 4. Related work Checkpointing libraries such as Libckpt, Condor, and libckp run on several versions of Unix, but are not designed for multithreaded programs [10, 12, 14]. Plank and Li describe several algorithms for reducing latency when saving checkpoints on multiprocessor machines[11] Their algorithms could be used to improve the efficiency of our checkpointing library. Many researchers have worked on checkpointing for distributed systems [2, 4, 5, 7, 8, 9, ....
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpointing and its applications. In Proceedings of the International Symposium on Fault-Tolerant Computing, pages 22--31, June 1995.
....on one machine, then recover from the checkpoint on another machine. Condor runs on a number of operating systems including Solaris 2 and Linux. It does not support multithreaded programs or have freely available source. Libckp was developed at AT T Bell Laboratories to checkpoint Unix processes [14]. Libckp saves files along with the checkpoint to guarantee they will be the same when the program recovers. It does not support multithreaded programs. Plank and Li describe several algorithms for reducing latency when saving checkpoints [11] Their algorithms assume the operating system ....
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpointing and its applications. In Proceedings of the International Symposium on Fault-Tolerant Computing, pages 22--31, June 1995.
....6 discusses the work in progress and the future plans. 2 Background Checkpointing and recovery schemes have been developed for database and process control systems as early as the seventies [10] 11] Today the application domain encompasses a much wider area including memory rejuvenation [12], mobile computing [13] and multi media. Checkpointing is needed to consistently synchronize different types of multi media (voice, image and text) data that is transmitted during the processing of video on demand and video conferencing applications. When a failure is detected, the system can ....
Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. Kintala, "Checkpointing and its applications, " Proc. 25th International Symposium on Fault Tolerant Computing (FTCS-25), pp. 22--31, 1995.
....Computing and Systems (ICDCS 98) Amsterdam, Netherlands, May 1998 (to appear) Also available as Technical Report TR 98 01, Department of Computer Sciences, University of Texas, Austin, TX. 1 Introduction Low overhead fault tolerance protocols such as checkpointing and message logging [2, 3, 4, 13, 14, 17, 21, 27, 32, 30] have been extensively studied for message passing distributed applications. These protocols seek to tolerate common failures while minimizing the use of additional resources and the impact on performance during failure free executions. In this paper, we focus on low overhead protocols for ....
Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. Kintala. Checkpointing and Its Applications. In Proceedings of the IEEE Fault-Tolerant Computing Symposium (FTCS-25), pages 22--31, Pasadena, CA, June 1995.
....overhead for data conversion on the master process is eliminated. Receiver makes right has the disadvantage that each slave process must record all the data formats for ev 6 5 4 3 Variables declared in source code 2 1 2228 2212 2204 2044 2004 2000 1 4 4 1 20 8 8 4 4 10 1 100 z[4] y x[20] b[10] a 6 5 4 3 2 1 5848 5816 5808 5648 5608 5604 1 8 8 8 4 4 100 4 1 10 20 1 int a, b[10] double x[20] y; long z[4] char str[100] Table A Table B id addr size num id addr size num str Figure 8: Two address space tables in a slave process. ery architecture. The disadvantage leads to more ....
.... right has the disadvantage that each slave process must record all the data formats for ev 6 5 4 3 Variables declared in source code 2 1 2228 2212 2204 2044 2004 2000 1 4 4 1 20 8 8 4 4 10 1 100 z[4] y x[20] b[10] a 6 5 4 3 2 1 5848 5816 5808 5648 5608 5604 1 8 8 8 4 4 100 4 1 10 20 1 int a, b[10]; double x[20] y; long z[4] char str[100] Table A Table B id addr size num id addr size num str Figure 8: Two address space tables in a slave process. ery architecture. The disadvantage leads to more complicated implementation and larger application size. However, this tradeoff is implemented ....
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala, "Checkpointing and its Applications", Proceedings of IEEE Fault-Tolerant Computing Symposium, pp. 22--31, June 1995.
....these approaches based on design diversity is the high cost of independently developing N different versions. This has lead to the development of techniques that do not require redundancy in the form of design diversity, but are able to provide fault tolerance only against certain type of faults [1, 13, 14, 15, 29]. Many experiments have been done to evaluate the effectiveness, or validate assumptions in multiversion software, particularly the assumption about independence of failures [3, 5, 17, 18, 19] Work has also been done to develop models to analyze the reliability provided by the two approaches to ....
Y.-M. Wang, Y. Huang, K.-P. Vo, I.-Y. Chung, C. Kintala, "Checkpointing and its applications" Proc. of 25th Intl. Symp. on Fault Tolerant Computing (FTCS-23), Pasadena, CA, June 1995.
....by CosMiC [9] where workstations are available for computations only when they are not in use by their owners. Failure and repair data was obtained by the authors of [9] and the checkpointing performance data was gleaned from performance results of CosMiC s transparent checkpointer libckp [35]. It is assumed that the copy on write optimization yields an 80 improvement in checkpoint overhead [24] Note that the failure rate of LOW is extremely high, which is typical of these environments. 1 2 4 8 16 32 Number of processors 30 20 0.5 5 2 3 4 1 10 Running Time (hours) BT LU EP 0 10 20 ....
Y-M. Wang, Y. Huang, K-P. Vo, P-Y. Chung, and C. Kintala. Checkpointingand its applications. In 25th International Symposium on Fault-Tolerant Computing, pages 22--31, Pasadena, CA, June 1995.
....the size of data segment address space. Second, the source code of the application program is needed because it should be linked with user level library and recompiled. Obtaining the source code is difficult and recompliation takes a long time or may require special compilation environments[15]. A checkpoint and recovery facility mechanism can be used in the following two different ways. One is user directed checkpoint and the other is user transparent checkpoint. In user directed checkpoint method, a user explicitly invokes system calls to take a checkpoint. The user decides when and ....
Y. M. Wang, Y. Huang, K.-P. Vo, P. Y. Chung, and C. Kintala, "Checkpointing and its applications," In Proc. IEEE Fault-Tolerant Computing Symposium (FTCS-25), pp. 22-31, Jun. 1995.
....checkpoints or message logs. Traditional This research was supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract DABT 63 96 C 0069, and in part by the Office of Naval Research under contract N00014 97 11013. checkpointing and message logging algorithms [5 12] are not directly applicable under such conditions. Previous proposals have suggested that checkpoints be sent back to Home Agents (HA) 13] Others have proposed that stable storage on Mobile Support Stations (MSS) be used for checkpoints and message logs [14 16] Because checkpoints and or ....
Y.-M. Wang, Y. Huang, P. Vo, P.-Y. Chung, and C. Kintala, "Checkpointing and its Applications," Proceedings of the 25th International Symposium on Fault-Tolerant Computing, pp. 22--31, June 1995.
....in the 26 th Symposium on Fault Tolerant Computing Systems, June 1996. The experimental work was done while both authors were at Carnegie Mellon University. 1 Introduction 1. 1 Problem Description and Motivations Many log based rollback recovery protocols have been proposed in the literature [1, 2, 5, 6, 11, 14, 15, 17 22, 24, 26, 27, 29, 30, 34 41]. These protocols use a combination of process checkpointing and message logging to recover from failures. A checkpoint typically contains a snapshot of the process s state and sufficient information to restart the computation from the execution point at which the state was saved. The log ....
Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. Kintala. Checkpointing and its applications. In Proc. IEEE Fault-Tolerant Computing Symp., pages 22--31, June 1995.
....1 Introduction Checkpointing is the act of saving an intermediate state of a program to stable storage so that it may be resumed at a later time. It is a general technique that has been used in many applications including fault tolerant computing [1, 2] program debugging [3, 4] and others [5, 6, 7]. Checkpoints are used for fault tolerance in a straightforward manner: the programmer periodically checkpoints the execution of his or her program so that following a failure, the program may be restarted from the most recent checkpoint, thus minimizing the amount of lost work. Checkpoints are ....
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala, "Checkpointing and its applications," in 25th International Symposium on Fault-Tolerant Computing, (Pasadena, CA), pp. 22--31, June 1995.
....workstation networks have become powerful computational resources, rivaling supercomputers in their utility for scientific programming. Traditionally, checkpointing and rollback recovery have been employed to provide fault tolerance for long running computations on all computing platforms (e.g. [4, 19, 21, 25, 27, 31]) By storing a checkpoint, a program limits the amount of re execution necessary following a process or processor failure. In turn, this improves the program s running time in the presence of failures. How often to checkpoint is a question of paramount practical importance. If one checkpoints too ....
....start of the next checkpoint. The second, T , is defined to be I Gamma C. If latency is equal to overhead, then T is the time between the end of one checkpoint and the beginning of the next checkpoint. Some checkpointing systems (e.g. 21, 19] require the user to specify I , while others (e.g. [31]) require the user to specify T . When optimizations such as forked checkpointing are used, and L AE C, I is the more natural specification. However, all theoretical research on the optimal checkpoint interval assumes that T is specified. The difference between specifying I and T has a subtle ....
[Article contains additional citation context not shown here]
Y-M. Wang et al. Checkpointing and its applications. In 25th Int. Symp. on Fault-Tolerant Comp., pages 22--31, June 1995.
....by hiding the latency of flushing the buffer, this approach achieved an overhead lower than other approaches. 1. Introduction Checkpointing and recovery is a technique for saving process state during normal execution and restoring the saved state after a failure to reduce the amount of lost work [1]. Process state refers to everything that is included in a checkpoint in order to guarantee a successful recovery and it should include both volatile and persistent state [2] Persistent state includes the status of all the user files related to the current execution of the process. The status of ....
....to the current execution of the process. The status of a file includes its content and its active information, i.e. its descriptor, access mode, the offset to which it is positioned, etc. Although supporting the correct rollback of persistent state has become the primary concern of many users [1], existing checkpoint libraries usually save and restore only active information [2, 3] This is because it is unacceptably expensive to save all the content of user files into checkpoint due to their arbitrary size and number. This straightforward but incomplete way will result in inconsistent ....
[Article contains additional citation context not shown here]
Y.M. Wang, Y. Huang, et al. "Checkpointing and Its Applications", Proceedings of IEEE 25th Symposium on FaultTolerant Computing, June 1995, pp. 22-31.
No context found.
Y. M. Wang, Y. Huang, K.P. Vo, P.Y. Chung, and C. Kintala. "Checkpointing and its applications." In Proceedings of the Twenty Fifth International Symposium on Fault-Tolerant Computing (FTCS-25), pp. 22-31, Jun. 1995.
....to choose its own checkpointing mechanism. In this section, we describe four different types of checkpoints that are currently available for use with the CosMiC system. 3.1. User transparent checkpoint library: libckp Libckp is a user transparent checkpoint library for Unix applications [20]. It can be linked with a user program to periodically save the program state on stable storage without requiring any modifications to the source code. The checkpointed program state includes (1) program counter, 2) program stack and stack pointer, 3) open file descriptors, 4) global and static ....
....function call that can be inserted into a user program to snapshot the entire process state at a user controlled program location. This can be useful for debugging purpose, or to avoid losing all useful work when a program abnormally exits due to temporary resource unavailabilityor glitch [20]. Second, libckp allows users to hook in an application specific recovery routine after data restoration but before the final program jump. This is useful for reestablishing whatever execution environment that is difficult to checkpoint, and for correcting state variables that should not have ....
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, andC. Kintala. Checkpointing and its applications. In the 25th Intl. Symp.on Fault-Tolerant Computing, pages 22--31, June 1995.
....an existing checkpoint library into a transactional resource manager that can participate in such a global coordination. 1 Introduction In this paper, the term checkpointing refers to the action of recording critical memory and file state at a given point of program execution on stable storage [17]. One question is commonly asked by software developers of client server applications: If I already use database transactions for data persistence, why do I need checkpointing Indeed, if an application can always store all state changes into a database at the end of each transaction, the ACID ....
....commit( or rollback( function of each resource manager based on that decision. 3.1 Implementing Checkpoint Libraries as Transactional Resource Managers Libft [9] is a checkpoint library for specifying and storing critical memory data. Libfcp the persistent state checkpointing part of libckp [17] is a library for undoing file updates upon a rollback. Both libraries were originally designed for applications without transaction semantics. The main contribution of this paper is to consider critical memory and files as two private datastores accessed by a transaction processing ....
Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. Kintala. Checkpointing and its applications. In Proc. IEEE Fault-Tolerant Computing Symp., pages 22--31, June 1995.
....completion. 3.2. Exception Masking Exception masking hides exceptions from applications. It requires either application specific information for fixing exceptions or certain general software fault tolerance approaches such as design diversity [15] data diversity [1] or environment diversity [22] to bypass them. We distinguish three types of exception masking: retry, alternate, and fixing. 3.2.1 Masking exceptions through retry Certain exceptions are due to transient problems in the execution environment and can disappear in a subsequent retry. For example, on a multi tasking and ....
....on a network of machines. Many users routinely use CosMiC to execute long running programs with unpredictable varying memory usage. Valuable work may be lost if such a program simply exits upon an out of memory exception. A checkpoint before exit construct based on our checkpointing library libckp [22] has been proposed to minimize the amount of lost work. That scheme is not easy to use because source code must be modified to include the provided construct and the process must be manually migrated. Figure 7 shows how to overcome these disadvantages by using Xept to intercept malloc( calls and ....
[Article contains additional citation context not shown here]
Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. Kintala. Checkpointing and its applications. In Proc. IEEE FaultTolerant Computing Symp., pages 22--31, 1995.
....a process is interrupted when a timer expires and a snapshot of the entire state is saved 3 . The checkpoint interval is on the order of tens of minutes to hours. Recovery is performed by restoring the checkpointed state, and the execution returns from the point at which the checkpoint was taken [17]. In contrast, a typical continuously running server application consists of an initialization step for both data and communication, followed by an infinite loop which receives a service request from a client, performs the requested processing, sends the results back to the client (if required) ....
Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. Kintala, "Checkpointing and its applications," in Proc. IEEE Fault-Tolerant Computing Symp., pp. 22-- 31, June 1995.
No context found.
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpointing and its applications. In Proc. IEEE Fault-Tolerant Computing Symp, pages 22--31, June 1995.
No context found.
Y. M. Wang, Y. Huang, K. P. Vo, P.Y. Chung, and C. Kintala. Checkpointing and its Applications. In Proceedings of the 25th International Symposium on Fault-Tolerant Computing, pages 22--31, June 1995.
No context found.
Wang YM, Huang Y, Vo KP, Chung PY, Kintala CMR. Checkpointing and its applications. Proceedings 25th IEEE International Symposium on Fault-Tolerant Computing, Pasadena, CA, June 1995. IEEE Computer Society: Los Alamitos, CA, 1995; 22--31.
No context found.
Y. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpointing and its applications. In FTCS-25, 1995.
No context found.
Y. M. Wang et al., "Checkpointing and its applications," in Digest of the 25th Annual International Symposium on FaultTolerant Computing, (Pasadena, CA), pp. 22--31, June 1995.
No context found.
Y-M. Wang, Y. Huang, K-P. Vo, P-Y. Chung, and C. Kintala. Checkpointing and Its Applications. In Proceedings of the 25th International Symposium on Fault-Tolerant Computing Systems, pages 22--31, June 1995.
No context found.
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C.M.R. Kintala. Checkpointing and its applications. In Symposium on Fault-Tolerant Computing, Pasadena, CA, pages 22--31, June 1995.
No context found.
Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and C. Kintala, "Checkpointing and its Applications," In 25th International Symposium on Fault-Tolerant Computing, pages 22-31, Pasadena, CA, June 1995.
No context found.
Y-M. Wang, Y. Huang, K-P. Vo, P-Y. Chung, and C. Kintala. Checkpointing and its applications. In 25th International Symposium on Fault-Tolerant Computing, pages 22-- 31, Pasadena, CA, June 1995.
No context found.
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, C. Kintala, "Checkpointing and its applications", Proceedings of IEEE Fault-Tolerant Computing Symposium, 1995 June, pp 22 -- 31.
No context found.
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, C. Kintala, "Checkpointing and its applications", Proceedings of IEEE Fault-Tolerant Computing Symposium, 1995 June, pp 22 -- 31.
No context found.
Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala, "Checkpointing and its Applications," Proceedings of IEEE Fault-Tolerant Computing Symposium, pp. 22--31, June 1995.
No context found.
Y-M. Wang, Y. Huang, K-P. Vo, P-Y. Chung, and C. Kintala. Checkpointing and its applications. In 25th International Symposium on Fault-Tolerant Computing, pages 22--31, Pasadena, CA, June 1995.
No context found.
Y-M. Wang, Y. Huang, K-P. Vo, P-Y. Chung, and C. Kintala. Checkpointing and its applications. In 25th International Symposium on Fault-Tolerant Computing, pages 22--31, Pasadena, CA, June 1995.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC