20 citations found. Retrieving documents...
Y. M. Wang, Y. Huang, and W. K. Fuchs. Progressive retry for software error recovery in distributed systems. In Proc. 23rd Int. Conf. on Fault-Tolerant Computing (FTCS-23), pp. 138-144, Toulouse, France, 1993.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
System Checkpointing using Reflection and Program Analysis - Whaley (2001)   (Correct)

....include any sort of program analysis to determine the extent of the checkpointing. There has been other work on checkpointing in the context of migrating applications [10] using extra processors for fault tolerance [12] post mortem and replay debugging, elimination of boundary condition errors [13], etc. Our work is very similar to user level transparent checkpointing techniques [11] Such techniques usually work by compiling the application program with a special checkpointing library. Our technique, on the other hand, relies on program analysis and therefore can optimize the result for ....

Y. M. Wang, Y. Huang, and W. K. Fuchs. Progressive retry for software error recovery in distributed systems. In Proc. 23rd Int. Conf. on Fault-Tolerant Computing (FTCS-23), pp. 138-144, Toulouse, France, 1993.


Analysis of Preventive Maintenance in Transactions.. - Garg, Puliafito.. (1998)   (7 citations)  (Correct)

....dynamic data segments. Persistent state refers to all the user files related to a program s execution while the OS environment refers to resources that the program must access through the operating system, such as swap space, file systems, 3 communication channels, keyboard, monitors, time etc. [27]. Typical transient failures occur because of design faults in software which result in unacceptable erroneous states in the OS environment of the process. Therefore, the key idea behind environment diversity is to modify the operating environment of the running process. Typically, this has been ....

Y. M. Wang, Y. Huang and W. K. Fuchs, "Progressive retry for software error recovery in distributed systems", in Proc. IEEE Fault Tolerant Computing Symposium, pp. 138144, June 1993. 33


Analysis of Preventive Maintenance in Transactions.. - Garg, Puliafito.. (1998)   (7 citations)  (Correct)

....and dynamic data segments. Persistent state refers to all the user files related to a program s execution while the OS environment refers to resources that the program must access through the operating system, such as swap space, file systems, communication channels, keyboard, monitors, time etc. [27]. 3 Typical transient failures occur because of design faults in software which result in unacceptable erroneous states in the OS environment of the process. Therefore, the key idea behind environment diversity is to modify the operating environment of the running process. Typically, this has ....

Y. M. Wang, Y. Huang and W. K. Fuchs, "Progressive retry for software error recovery in distributed systems", in Proc. IEEE Fault Tolerant Computing Symposium, pp. 138-144, June 1993. 32


An Evaluation of the Recovery-Related Properties of Software Faults - Chandra (2000)   (Correct)

....same operation at a later time will usually succeed. Some techniques have also been suggested to induce changes in the environment to increase the success of a retry without affecting the program correctness. One such technique changes the message ordering to simulate changes in the environment [Wang93]. The fact that recovery protocols can recover from this class of faults should not preclude us from developing techniques to prevent these faults from happening in the first place. One such technique [Savage97] looks at detecting data race conditions in lockbased multithreaded programs. 4.8 ....

....and environment dependent transient faults. Several schemes have been proposed to enable generic recovery to work for a larger class of faults. For example, some recovery techniques seek to increase the nondeterminism in the application by re ordering events such as message receives [Wang93]: these are basically techniques to induce change to the external environment. These do not transform environment independent faults into environment dependent faults. Rather, they increase the chance that a environment dependent fault will experience a different operat 76 ing environment (order ....

Yi-Min Wang, W. Kent Fuchs, and Yennuan Huang. Progressive Retry for Software Error Recovery in Distributed Systems. In Proceedings of the 1993 Symposium on Fault-Tolerant Computing, June 1993.


Semantics of Recovery Lines for Backward Recovery in.. - Brzezinski, Helary.. (1995)   (Correct)

....to independent checkpointing ( 3] INRIA Semantics of recovery lines for backward recovery in distributed systems 25 5.2. 1 Pessimistic message logging Pessimistic (also called synchronous) message logging refers to the stable logging of all the application messages before they are processed ([6, 25, 30, 45]) Thus, assuming a piece wise deterministic program, recovery points can be constructed by rolling back application processes to their last checkpoints, and replaying sequentially all appropriate application messages. Of course, it is only possible if logs have been put in stable storage. It ....

Y-M. Wang, Y. Huang, W.K. Fuchs, Progressive retry for software error recovery in distributed systems, Proc. Fault Tolerant Computing Systems, 1993, pp. 138-144.


Algorithms for Building Fault-Tolerant Distributed Systems - Mitchell (1997)   (Correct)

....go in its execution. The benefit of causal methods, that is, creating no orphan states, is the cause of this. Pessimistic message logging schemes also have this problem. Optimistic methods roll back further, in general, and can be more easily programmed to roll back far enough to avoid Heisenbugs [69]. Rolling back past the bug, however, is not enough to ensure that eventually the failure will be avoided. The recovering process must try different combinations of message receive orders at each recovery attempt until it knows it is successful. Then there is the tradeoff as to how far to roll ....

Y.-M. Wang, Y. Huang, and W. K. Fuchs. Progressive retry for software error recovery in distributed systems. Proceedings of the 23nd Annual IEEE International Fault-Tolerant Computing Symposium, pages 138--144, July 1993.


Phoinix - A Fault-Tolerant Object Service in OMA - Liang, Chou, Yuan   (Correct)

....environment in which object implementations are constructed with desired fault tolerance capability in a semi automatic fashion. We categorize the fault tolerance capability into three levels: restart service(level one) checkpoint recovery service(level two) and replication service (level three)[22]. Object implementations in Phoinix with fault tolerance capability of level 1, 2, and 3 are called restart objects, logable objects, and replicated objects, respectively. As the names suggest, the restart object resumes the service as a fresh server after recovering from failure whereas the ....

....the audit trail before redoing requests. A programmer can redefine his own policy to reorder the logged requests. Huang has suggested that most transient software failures can be masked by redoing the past requests from the last checkpoint to the crash point in the audit trail in a different order[22]. Thus object implementations developed with Phoinix will be able to tolerate transient software failures provided the recovery process with this mechanism. One possible approach to support this mechanism in Phoinix is to modify or overloaded the PersistRequest( in the Fault Tolerance class 16 ....

Y. M. Wang, Y. Huang and W. K. Fucks, "Progressive Retry for Software Error Recovery in Distributed Systems," Proceeding of 22nd Fault-tolerance Computing Symposium, 1993. 19


Consistent Logical Checkpointing - Nitin Vaidya (1994)   (2 citations)  (Correct)

....2 discusses the notion of a logical checkpoint. Section 3 presents a consistent checkpointing algorithm proposed by Chandy and Lamport [1] Section 4 presents the basic principle behind the proposed approach; implementation issues are discussed in Section 6. Our approach is closely related to [1, 5, 9, 13], as discussed in Section 5. Section 7 concludes the report. 2 A Logical Checkpoint A process is said to be deterministic if its state depends only on its initial state and the messages delivered to it [10] A deterministic process can take two types of checkpoints: a physical checkpoint or a ....

....saved on the stable storage. A process is said to have taken a logical checkpoint at time t 1 , if enough information is saved on the stable storage to allow the process state at time t 1 to be recovered. To the best of our knowledge, the term logical checkpoint was first introduced by Wang et al. [13, 14], who also presented one approach for taking a logical checkpoint. Now we present three approaches for taking a logical checkpoint at time t 1 . Although the three approaches are equivalent, each approach may be more attractive for some applications than the other approaches. Not all approaches ....

[Article contains additional citation context not shown here]

Y. M. Wang, Y. Huang, and W. K. Fuchs, "Progressive retry for software error recovery in distributed systems," in Digest of papers: The 23 rd Int. Symp. Fault-Tolerant Comp., pp. 138--144, 1993.


An Overview of Checkpointing in Uniprocessor and Distributed.. - Plank (1997)   (18 citations)  (Correct)

....reaches a synchronization state that the programmer never envisioned. These states occur infrequently, and often as a result of boundary conditions which do not repeat themselves often. In such situations, checkpointing and rollback recovery can be employed to eliminate the effect of the bug [WHF93] As in a fault tolerant execution, checkpoints are taken periodically, and if an error occurs due to one of these boundary conditions, the system is rolled back to the previous checkpoint. It is statistically unlikely that the same boundary conditions will occur, and thus that the software will ....

Y. M. Wang, Y. Huang, and W. K. Fuchs. Progressive retry for software error recovery in distributed systems. In 23rd International Symposium on Fault-Tolerant Computing, pages 138-- 144, June 1993.


Some Thoughts on Distributed Recovery - Vaidya (1994)   (2 citations)  (Correct)

....without blocking some processes. Plank [10] shows that staggering indeed reduces the overhead significantly for many applications. Here, we present a simple alternative for coordinated checkpointing that allows arbitrary staggering of checkpoints. The solution presented below is closely related to [2, 6, 10, 14], as discussed later. 7 Suggested Solution The solution suggested here can be summarized as follows: staggered checkpoints coordinated message logging = consistent logical checkpoints The basic idea is to coordinate logical checkpoints [14, 15] rather than physical checkpoints. A physical ....

....presented below is closely related to [2, 6, 10, 14] as discussed later. 7 Suggested Solution The solution suggested here can be summarized as follows: staggered checkpoints coordinated message logging = consistent logical checkpoints The basic idea is to coordinate logical checkpoints [14, 15] rather than physical checkpoints. A physical checkpoint of a process is taken by saving the process state on the stable storage. A logical checkpoint is taken by logging all the message received by the process since its most recent physical checkpoint on the stable storage. Thus, a physical ....

[Article contains additional citation context not shown here]

Y. Wang, Y. Huang, and W. K. Fuchs, "Progressive retry for software error recovery in distributed systems," in Digest of papers: The 23 rd Int. Symp. Fault-Tolerant Comp., pp. 138--144, 1993.


On Staggered Checkpointing - Vaidya (1996)   (1 citation)  (Correct)

....approach, that is similar to logical checkpointing. Our algorithm staggers the checkpoints, while the scheme in [9] does not allow staggering. 9] also assumes synchronized communication and an upper bound on communication delays; no such assumptions are made in the proposed scheme. Wang et al. [13] introduced the term logical checkpoint. They present an algorithm to determine a recovery line consisting of consistent logical checkpoints, after a failure occurs. This recovery line is used to recover from the failure. Their goal is to determine the latest consistent recovery line using the ....

....log contains the receive sequence number for the message as well M1 M2 M3 process P t0 t1 time physical checkpoint logical checkpoint Figure 1: Physical checkpoint message log = logical checkpoint as the entire message. This approach is essentially identical to that presented by Wang et al. [13]. Figure 1 presents an example wherein process P takes a physical checkpoint at time t 0 . Messages M1, M2 and M3 are delivered to process P by time t 1 . To establish a logical checkpoint of process P at time t 1 , messages M1, M2 and M3 are logged on the stable storage. As process P is ....

Y. M. Wang, Y. Huang, and W. K. Fuchs, "Progressive retry for software error recovery in distributed systems," in 23 rd Int. Symp. Fault-Tolerant Comp., pp. 138--144, 1993.


Staggered Consistent Checkpointing - Vaidya (1999)   (2 citations)  (Correct)

....that is similar to logical checkpointing. Our algorithm staggers the checkpoints, while the scheme in [11] does not allow staggering. 11] also assumes synchronized communication and an upper bound on communication delays; no such assumptions are made in the proposed scheme. Wang et al. [18] introduced the term logical checkpoint. They present an algorithm to determine a recovery line consisting of consistent logical checkpoints, after a failure occurs. This recovery line is used to recover from the failure. Their goal is to determine the latest consistent recovery line using the ....

....Approach 1: One approach for establishing a logical checkpoint at time t 1 is to take a physical checkpoint at some time t 0 t 1 and log (on stable storage) all messages delivered to the process between time t 0 and t 1 . This approach is essentially identical to that presented by Wang et al. [18]. Figure 1 presents an example wherein process P takes a physical checkpoint at time t 0 . Messages M1, M2 and M3 are delivered to process P by time t 1 . To establish a logical checkpoint of process P at time t 1 , messages M1, M2 and M3 are logged on the stable storage. We summarize this ....

Y. M. Wang, Y. Huang, and W. K. Fuchs, "Progressive retry for software error recovery in distributed systems," in Digest of papers: The 23 rd Int. Symp. Fault-Tolerant Comp., pp. 138--144, 1993.


Guaranteed Deadlock Recovery: Deadlock Resolution with.. - Wang, Merritt.. (1995)   Self-citation (Wang)   (Correct)

....different execution paths to bypass the deadlock. Since rolling back any process (called the victim) involved in a deadlock cycle can break the cycle, we have the freedom to choose among multiple potential victims and hence multiple potential recovery lines. Our previous work on progressive retry [9, 10] applied the technique of checkpointing and message logging to recovering failed processes from software errors caused by unknown software bugs; message replaying and message reordering were employed as heuristics to bypass the software bugs. This paper shows that it is possible to guarantee error ....

....the resource after its use. For simplicity, we assume that resources themselves do not have states, and each resource manager always has a checkpoint before every message receiving event. This can be achieved by low cost critical data checkpointing or by message logging under piecewise determinism [9]. For the purpose of presentation, we first assume that all resource related messages are monitored by a central server. Distributed algorithms will be considered in a later section. The server maintains a wait for graph (WFG) 11] as follows: a WFG edge is drawn from P i to P j if P i sends r ....

[Article contains additional citation context not shown here]

Y. M. Wang, Y. Huang, and W. K. Fuchs, "Progressive retry for software error recovery in distributed systems," in Proc. IEEE Fault-Tolerant Computing Symp., pp. 138--144, June 1993.


Software Fault Tolerance in the Application Layer - Huang (1995)   (6 citations)  Self-citation (Huang)   (Correct)

.... methods for design diversity are the recovery block approach (see Chapter 1) and the N version programming approach (see Chapter 2) However, the failures exhibited by those software faults can be transient, i.e. the failure may not recur if the software is reexecuted on the same input [Gra91, Wan93]; this is a frequently used technique in hardware to mask transient hardware failures. Sullivan and Chillarege [Sul92] also showed that a large percentage of software errors are triggered by peak conditions in workload, exception handling and timing. Such errors are likely to disappear when the ....

Y. M. Wang, Y. Huang and W. K. Fuchs. Progressive retry for software error recovery in distributed systems. In Proc. of 23rd International Symposium on Fault-Tolerant Computing (FTCS-23), pages 138--144, June 1993.


A Survey of Rollback-Recovery Protocols in.. - Elnozahy, Alvisi.. (1996)   (161 citations)  Self-citation (Wang)   (Correct)

No context found.

Y. M. Wang, Y. Huang and W. K. Fuchs. "Progressive retry for software error recovery in distributed systems." In Proceedings of the Twenty Third International Symposium on Fault-Tolerant Computing Systems, FTCS-23, pp.138---144, Jun. 1993.


Guaranteed Deadlock Recovery: Deadlock Resolution with Rollback .. - Yi-Min Wang (1995)   Self-citation (Wang)   (Correct)

No context found.

Y. M. Wang, Y. Huang, and W. K. Fuchs. Progressive retry for software error recovery in distributed systems. In Proc. IEEE Fault-Tolerant Computing Symp., pages 138--144, June 1993.


Yi-Min Wang Michael Merritt Alexander B. Romanovsky - Ky   Self-citation (Wang)   (Correct)

....different execution paths to bypass the deadlock. Since rolling back any process (called the victim) involved in a deadlock cycle can break the cycle, we have the freedom to choose among multiple potential victims and hence multiple potential recovery lines. Our previous work on progressive retry [9, 10] applied the technique of checkpointing and message logging to recovering failed processes from software errors caused by unknown software bugs; message replaying and message reordering were employed as heuristics to bypass the software bugs. This paper shows that it is possible to guarantee error ....

....the resource after its use. For simplicity, we assume that resources themselves do not have states, and each resource manager always has a checkpoint before every message receiving event. This can be achieved by low cost critical data checkpointing or by message logging under piecewise determinism [9]. For the purpose of presentation, we first assume that all resource related messages are monitored by a central server. Distributed algorithms will be considered in a later section. The server maintains a wait for graph (WFG) 11] as follows: a WFG edge is drawn from P i to P j if P i sends r ....

[Article contains additional citation context not shown here]

Y. M. Wang, Y. Huang, and W. K. Fuchs, "Progressive retry for software error recovery in distributed systems," in Proc. IEEE Fault-Tolerant Computing Symp., pp. 138--144, June 1993.


A Survey of Rollback-Recovery Protocols in Message-Passing.. - Elnozahy, Johnson, Wang (1996)   (161 citations)  Self-citation (Wang)   (Correct)

....research work aims at providing the benefits of piecewise determinism (such as efficient output commit and recovery) without requiring applications to satisfy the piecewise deterministic model. It is based on the observation that piecewise determinism can be modeled as having a logical checkpoint [91, 179, 190] before every nondeterministic event. Therefore, checkpoint based rollback recovery can mimic piecewise determinism by taking an actual checkpoint before every nondeterministic event. The main challenge is how to reduce the number of checkpoints while still preserving desirable properties. It has ....

....take advantage of piecewise determinism at all. In practice, it is important to support systems consisting of both deterministic and nondeterministic processes [87, 90] One challenge is to handle unreplayable nondeterministic events while still preserving the advantages of piecewise determinism [41, 184, 190]. Although most rollback recovery techniques were originally designed for tolerating hardware failures, they have also been applied to software and protocol error recovery [169, 184, 190, 193] Rollback recovery in shared memory and distributed shared memory systems has also been extensively ....

[Article contains additional citation context not shown here]

Y. M. Wang, Y. Huang, and W. K. Fuchs. Progressive retry for software error recovery in distributed systems. In Proc. IEEE Fault-Tolerant Computing Symp., pages 138--144, June 1993.


Lazy Checkpoint Coordination for Bounding Rollback Propagation - Wang, Fuchs (1993)   (30 citations)  Self-citation (Wang Fuchs)   (Correct)

.... and in part by the National Aeronautics and Space Administration (NASA) under Grant NASA NAG 1 613, in cooperation with the Illinois Computer Laboratory for Aerospace Systems and Software (ICLASS) shown that logging a nondeterministic event equivalently places a logical checkpoint [18] at the end of the ensuing state interval, and these extra logical checkpoints serve to eliminate the domino effect. Coordinated checkpointing achieves domino free recovery by sacrificing a certain degree of process autonomy and incurring run time and extra message overhead. Usually, whenever a ....

Y. M. Wang, Y. Huang, and W. K. Fuchs, "Progressive retry for software error recovery in distributed systems," in Proc. IEEE Fault-Tolerant Computing Symposium, pp. 138--144, June 1993.


Maximum and Minimum Consistent Global Checkpoints and Their.. - Yi-Min Wang (1995)   (2 citations)  Self-citation (Wang)   (Correct)

....order, even though the state is not directly checkpointed. For example, process P i in Figure 3 can recreate the state interval S i;x 1 by restoring the physical checkpoint C, and replaying m x , e and m y in that order. Equivalently, it can be modeled as having an additional logical checkpoint [19] anywhere inside S i;x 1 . For recovery applications, it usually suffices to place a logical checkpoint at the end of each state interval, as shown in Figure 3(b) For some debugging applications, it may be desirable to place an additional logical checkpoint immediately after every ....

.... to bound the rollback distance, and take additional uncoordinated checkpoints 5 to further localize the recovery [20, 21] Studies have shown that, since most software errors in production software are transient, rollback retry can often provide an effective way of bypassing software bugs [19, 22, 23]. Suppose a software error is detected at the point marked X in Figure 4(a) possibly caused by an unexpected nondeterministic event. A diagnosis procedure examines the error symptom and determines that the system should roll back to a state containing checkpoints C and D to maximize the chance ....

Y. M. Wang, Y. Huang, and W. K. Fuchs, "Progressive retry for software error recovery in distributed systems," in Proc. IEEE Fault-Tolerant Computing Symp., pp. 138--144, June 1993.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC