| P. Leu and B. Bhargava. "Concurrent Robust Checkpointing and Recovery in Distributed Systems." Fourth International Conference on Data Engineering. 154--163. 1988. |
....treating each nondeterministic influence as a message, logging it and replaying it during recovery. The message logging approach allows states of a process in addition to those saved in a checkpoint to be recovered. Recovery protocols based instead on checkpointing without message logging (e.g. [1, 5, 6, 7, 8, 15, 16, 17, 31]) can recover only process states that have been checkpointed, often forcing processes to roll back further than otherwise required after a failure. Message logging allows each process to be checkpointed less frequently, and may in general reduce failure free overhead since logging a message is ....
P. Leu and B. Bhargava. "Concurrent Robust Checkpointing and Recovery in Distributed Systems." Fourth International Conference on Data Engineering. 154--163. 1988.
....the rollback protocol can decide to discard the message only when the message is a knowable orphan. 1.3. Previous Work In this paper, we concentrate on rollback based on optimistic message logging and replay. Recovery protocols based instead on checkpointing without message logging (e.g. [1, 3, 4, 5, 8, 15, 16, 29]) may force processes to roll back further than otherwise required, since processes can only recover states that have been checkpointed. Recovery protocols based on pessimistic message logging (e.g. 2, 9, 11, 21] can cause processes to delay execution until incoming messages are logged to ....
P. Leu and B. Bhargava. "Concurrent Robust Checkpointing and Recovery in Distributed Systems." Fourth International Conference on Data Engineering. 154--163. 1988.
....of the failed process is greater than the sequence numbers of the latest checkpoints of all the other processes, then no process other than the failed process needs to roll back; they only need to take a checkpoint. 6 Comparison With Existing Work The checkpointing algorithms proposed in [10, 3] have a two phase structure. This causes processes to suspend the normal computation for making checkpoint decisions which may greatly increase the overhead during normal computation. The QSA does not cause any such overhead and avoids domino effect completely during recovery. In Acharya et al. s ....
B. Bhargava and P. Leu. "Concurrent Robust Checkpointing and Recovery in Distributed Systems". In Proc. of 4 th IEEE Int. Conf. Data Eng., pages 154--163, February 1988. 29
....processes to take checkpoints. Therefore, consistent checkpointing suffers from high overhead associated with the checkpointing process. Much of previous work in consistent checkpointing has focused on minimizing the number of processes that must participate in taking a consistent checkpoint [5, 10, 13] or to reduce the number of messages required to synchronize the recording of a consistent checkpoint [22, 23] However, these algorithms (called blocking algorithm) force all relevant processes in the system to block their computations during the checkpointing process. Checkpointing includes the ....
....P3 P4 P5 m2 m3 m4 m5 checkpoint checkpoint checkpoint 00011 00010 00001 00100 01000 10000 m1 11000 11100 11100 11110 S1 S2 Figure 1: Checkpointing and dependency information 2. 3 Basic Idea of Non blocking Algorithms Most of the existing consistent checkpointing algorithms [5, 10, 13] rely on the two phase protocol and save two kinds of checkpoints on the stable storage: tentative and permanent. In the first phase, the initiator takes a tentative checkpoint and forces all relevant processes to take tentative checkpoints. Each process informs the initiator whether it succeeded ....
[Article contains additional citation context not shown here]
P.Y. Leu and B. Bhargava. "Concurrent Robust Checkpointing and Recovery in Distributed Systems". Pro. 4th IEEE Int. Conf. on Data Eng., pages 154--163, 1988.
....example, if P 1 fails after receiving message M , it would rollback to checkpoint C 1;3 and as a result P 2 will be forced to rollback to C 2;3 . 6 Comparison with Existing Algorithms In this section, we compare the performance of the MQSA with the existing algorithms. The algorithms proposed in [12, 2] have a two phase structure. This causes processes to suspend the normal computation for making checkpoint decisions which greatly increases the overhead during normal computation. The MQSA does not cause any such overhead and avoids domino effect completely during recovery. In [12] if multiple ....
B. Bhargava and P. Leu. "Concurrent Robust Checkpointing and Recovery in Distributed Systems". In Proc. of 4 th IEEE Int. Conf. Data Eng., pages 154--163, February 1988.
....large delays. There has been much research in designing checkpointing algorithms [2, 4 8, 10 17] However, all known algorithms fail to satisfy one or more of the above requirements. Algorithms [2, 7, 10] are designed for systems that use unreliable communication channels. Algorithms presented in [7, 8, 11, 12, 17] are blocking. 4 Algorithms [14, 15] rely on the assumption that all the processor clocks are approximately synchronized, which limits the generality of these algorithms. Some of these algorithms [2, 4, 5] require O(n 2 ) communication messages, where n is the number of processors in the ....
P.Y. Leu, B. Bhargava. Concurrent Robust Checkpointing and Recovery in Distributed Systems. In Proc. of 4 th IEEE Int. Conf. On Data Eng ., pages 154-163, 1988
....a consistent snapshot to a minimum. This can be achieved by forcing a minimal subset of nodes to take their local snapshots, and by employing data structures that impose low memory overheads. Consistent snapshot collection algorithms for static distributed systems have 2 been proposed in [6, 7, 15, 12, 13, 16, 17]. The snapshot collection algorithm by Chandy and Lamport [6] forces every node to take its local snapshot. The underlying computation is allowed to proceed while the global snapshot is being collected. Snapshot collection algorithms in [7, 13, 16, 17] also force every node to take its snapshot. ....
P.-J. Leu and B. Bhargava. Concurrent Robust Checkpointing and Recovery in Distributed Systems. In Proceedings of the 4 th International Conference on Data Engineering, pages 154--163, February 1988.
....a consistent state of the whole system [8] After a failure, failed processes, as well as surviving processes are rolled back to their last checkpoint. Consistent checkpointing techniques can further be divided into two sub groups: blocking and non blocking techniques. In blocking techniques [9, 10, 11], processes synchronize together when saving a checkpoint and are halted during the whole checkpointing protocol. In non blocking techniques [12, 13, 14] each process takes a temporary checkpoint and resumes its execution. Later on, temporary checkpoints are made definitive when it is known that ....
P. Leu and B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Proc. of 4th International Conference on Data Engineering, pages 154--163, Los Angeles (CA), February 1988.
....to stable storage (storage for which data persists even in the event of failures) Because processes are not replicated, when a failure occurs, the saved checkpoint must be reloaded so that the entire computation does not have to restart from the beginning. The different checkpointing methods [2, 7, 10, 20, 26, 24, 41, 44, 54, 59, 61, 65] offer other tradeoffs which we will discuss in Section 1.3.3. The Manetho system employs both active replication and checkpointing [24] Active replication is used for system processes (for example, name services) and a checkpointing scheme is designed for applications. As mentioned above, the ....
....for reducing delay of the application is the process recovery method itself. In addition to the time it takes to recover a failed process, some checkpointing methods [2, 41] can cause delays when sending and receiving certain messages (for example, messages received during recovery) Other methods [7, 20, 26, 41, 44, 59, 61, 65] can cause unfailed processes to go back to the state of a previous checkpoint when a failure occurs. Finally, checkpointing schemes impose blocking on all messages sent outside of the distributed computation. All three of these characteristics create additional latency and slow the application. ....
[Article contains additional citation context not shown here]
P. Leu and B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. Proceedings of the 4th International Conference on Data Engineering, pages 154--163, Feb 1988.
....[10, 12, 14] Plank and Li describe several algorithms for reducing latency when saving checkpoints on multiprocessor machines[11] Their algorithms could be used to improve the efficiency of our checkpointing library. Many researchers have worked on checkpointing for distributed systems [2, 4, 5, 7, 8, 9, 13]. Distributed checkpointing algorithms are primarily concerned with forming consistent global checkpoints state either by coordinating all of the processors or by building a consistent checkpoint from independent checkpoints. Our checkpointing library uses the thread library s memory consistency ....
P.-J. Leu and B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Proceedings of the International Conference on Data Engineering, pages 154-- 163, Feb. 1988.
....With POSIX threads a forked child will only have one thread. Thus the state of the thread library in the new process will be different from that in the original process. The recovery would have to account for the difference. Many researchers have worked on checkpointing for distributed systems [2, 3, 4, 6, 7, 9, 13]. Distributed checkpointing algorithms are primarily concerned with forming consistent global checkpoints state either by coordinating all of the processors or by building a consistent checkpoint 8 0 20 40 60 80 100 120 140 2 4 8 16 32 Execution Time (Seconds) Number of Threads ....
P.-J. Leu and B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Proceedings of the International Conference on Data Engineering, pages 154-- 163, Feb. 1988.
....systems are notoriously difficult to design. An overview of the problems of fault tolerance can be found in [6] A particularly difficult task is the saving of distributed system state. This problem was described in [21] initially solved in [4] and later addressed by a number of authors ([2, 3, 8, 13, 15, 20, 24, 29], to mention only a few) This multiplicity of solutions stems partly from assumptions of different failure semantics [6] and architectures. For example, in [7] it is assumed that processor clocks are synchronized within ffl units of each other and that interprocess communication delays are ....
P-J Leu and B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Fourth Conference on Data Engineering, IEEE, pages 154--163, 1988.
....any assumption about the underlying micro kernel nor about the machine on which the system ran (mono processor, multiprocessor) Consistent checkpointing techniques also fall in two subgroups: blocking and non blocking techniques. In blocking techniques [Tamir Sequin 84] Koo Toueg 86] Leu Bhargava 88] processes halt and synchronize with each other when saving a local checkpoint. To minimize halt time duration, several studies have focused on how to reduce the number of dependent processors involved in a checkpoint and the number of messages exchanged [Koo Toueg 86] Ahamad Lin 89] ....
P. Leu & B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Proc. of 4th International Conference on Data Engineering, pages 154--163, Los Angeles (CA), February 1988.
....result in considerable message overhead and also may result in high checkpointing overhead. The QSA does not have any additional message overhead and the checkpointing overhead is nominal. A considerable body of work is available on checkpointing and recovery for static dis23 tributed systems [10, 14, 5, 17, 19, 20, 21, 23]. The checkpointing algorithms proposed in [14, 5] have a two phase structure. This causes processes to suspend the normal computation for making checkpoint decisions which may greatly increase the overhead during normal computation. The QSA does not cause any such overhead and avoids domino ....
....overhead. The QSA does not have any additional message overhead and the checkpointing overhead is nominal. A considerable body of work is available on checkpointing and recovery for static dis23 tributed systems [10, 14, 5, 17, 19, 20, 21, 23] The checkpointing algorithms proposed in [14, 5] have a two phase structure. This causes processes to suspend the normal computation for making checkpoint decisions which may greatly increase the overhead during normal computation. The QSA does not cause any such overhead and avoids domino effect completely during recovery. Some of the recovery ....
B. Bhargava and P. Leu. "Concurrent Robust Checkpointing and Recovery in Distributed Systems". In Proc. of 4 th IEEE Int. Conf. Data Eng., pages 154--163, February 1988.
....message was sent, recovery of a consistent system state may be impossible, since the outside world cannot in general be rolled back. Once the system can meet this guarantee, the message may be committed by releasing it to the outside world. Rollback recovery methods using consistent checkpointing [6, 8, 9, 11, 17, 18, 19, 21, 24, 27, 31, 34, 35] record the states of each process separately on stable storage as a process checkpoint, and coordinate the checkpointing of the processes such that a consistent system state is always recorded by the collection of process checkpoints. The checkpoint of a process records the address space of the ....
....process checkpoint for each consistent system state that may include a given state of a process. The protocol of Israel and Morris [11] does not inhibit sending or receiving messages during the execution of the algorithm, but does require FIFO channels. Leu and Bhargava have proposed a protocol [19] that does not require FIFO channels. The basic version of their algorithm inhibits processes sending messages during the algorithm, as in Koo and Toueg s algorithm, but an extension to the algorithm avoids any inhibition. However, their algorithm is more complicated than the commit algorithm ....
P.-J. Leu and B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Proceedings of the Fourth International Conference on Data Engineering, pages 154--163, February 1988.
.... programmer explicitly invokes the checkpointing routine and specifies consistent recoverylines [10] Message logging: processes save their state independently and inter process messages are logged [7, 12, 15, 21, 26, 27] Coordinated checkpointing: all [4, 30] or an interacting set [5, 16, 17, 28] of processes save their states in a coordinated way. Hybrid techniques: a combination of the two previous techniques that merges their advantages [18, 24, 31, 32] The latter three systems are user transparent. Overhead is associated with each class of techniques, both during failure free ....
....were taken, will be repeated. Kernel overhead: a part of the CPU time is used for the management of checkpoint and rollback related topics. Synchronisation overhead: other processes should sometimes suspend their operation until all processes finished their checkpointing or rollback [4, 5, 16, 17, 28, 30]. 2.1.2. Storage Overhead . Local memory and disc space are used to store the checkpoints: at least one complete (permanent) recovery line should be stored [16] together with the tentative checkpoint(s) If local memory is used for this, its usable part is significantly reduced [4] Local ....
[Article contains additional citation context not shown here]
P.Y. Leu, B. Bhargava, "Concurrent Robust Checkpointing and Recovery in Distributed Systems", Proc. 4th IEEE Int. Conf. on Data Engineering, 1988, pp.154-163
.... to rollback to their latest checkpoint on stable storage in order to remain consistent with recovering processes [15] Much of the previous work in consistent checkpointing has focused on minimizing the number of processes that must participate in taking a consistent checkpoint or in rolling back [1, 11, 15, 17]. Another issue that has received considerable attention is how to reduce the number of messages required to synchronize the consistent checkpoint [2, 5, 8, 16, 19, 24, 28, 29] In this paper, we focus instead on the overhead of consistent checkpointing on the failure free running time of ....
....our environment. 5 Related Work Previous work in checkpointing has concentrated on issues such as reducing the number of messages required to synchronize a checkpoint [2, 5, 8, 16, 19, 24, 28, 29] limiting the number of hosts that have to participate in taking the checkpoint or in rolling back [1, 11, 15, 17], or using message logging to eliminate the need for synchronizing the checkpoints and to accelerate input output interactions with the outside world [4, 13, 25] There are very few empirical studies of consistent checkpointing and its performance. Bhargava et al. 3] reported on the performance ....
P. Leu and B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Proceedings of the International Conference on Data Engineering, February 1988.
....a consistent state of the whole system [8] After a failure, failed processes, as well as surviving processes are rolled back to their last checkpoint. Consistent checkpointing techniques can furthermore be divided into two sub groups: blocking and non blocking techniques. In blocking techniques [9, 10, 11], processes synchronize together when saving a checkpoint and are halted during the whole checkpointing protocol. In non blocking techniques [12, 13, 14] each process takes a temporary checkpoint and resumes its execution. Later on, temporary checkpoints are made definitive when it is known that ....
P. Leu and B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Proc. of 4th International Conference on Data Engineering, pages 154--163, Los Angeles (CA), February 1988.
....a consistent snapshot to a minimum. This can be achieved by forcing a minimal subset of nodes to take their local snapshots, and by employing data structures that impose low memory overheads. Consistent snapshot collection algorithms for static distributed systems have been proposed in [5, 7, 8, 13, 14, 16, 17, 18]. The snapshot collection algorithm by Chandy and Lamport [7] forces every node to take its local snapshot. The underlying computation is allowed to proceed while the global snapshot is being collected. Snapshot collection algorithms in [8, 14, 17, 18] also force every node to take its snapshot. ....
P.-J. Leu and B. Bhargava. Concurrent Robust Checkpointing and Recovery in Distributed Systems. In Proceedings of the 4 th International Conference on Data Engineering, pages 154--163, February 1988.
....of Q checkpoints, where Q is the maximum ratio of the checkpoint intervals of all the processes in the worst case. This bound on the rollback distance helps the process to decide garbage checkpoints asynchronously. 7. Comparison With Existing Work The checkpointing algorithms proposed in [10, 2] have a two phase structure. This causes processes to suspend the normal computation for making checkpoint decisions which greatly increases the overhead during normal computation. The QSA does not cause any such overhead and avoids domino effect completely during recovery. In Acharya et al. s [1] ....
B. Bhargava and P. Leu. "Concurrent Robust Checkpointing and Recovery in Distributed Systems". In Proc. of 4 th IEEE Int. Conf. Data Eng., pages 154-- 163, February 1988.
....about the underlying kernel and hence can be integrated straightforwardly into many existing systems. However, this is generally done either at the expense of performance as a checkpoint operation induces processor synchronization during normal operation [Tamir Sequin 84] Koo Toueg 86] Leu Bhargava 88] or at the expense of memory as more than one checkpoint has to be kept at the same time [Cristian Jahanian 91] Li et al. 91] Silva Silva 92] To minimize synchronization, several studies have focused on how to reduce the number of dependent processors involved in a checkpoint and the ....
P. Leu & B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Proc. of 4th International Conference on Data Engineering, pages 154--163, Loas Angeles (CA), February 1988.
....in order to force the creation of a recent recovery line. This improves garbage collection (of old recovery points) but introduces a coordination delay. 1 Global snoopy bus for communication Numerous systems use coordinated checkpointing. Examples of localized coordination can be found in [24, 19, 28, 4, 48]. Globally coordinated checkpoints have been presented in [27, 34, 11, 29, 25, 41, 46, 47, 45] The variations among these systems are in the optimizations applied, in how each attempts to reduce the amount of traffic required for coordination. Of special note are the algorithms presented in [37, ....
Leu, P.-J. and Bhargava, B. Concurrent RobustCheckpointing and Recovery in Distributed systems. in: Fourth International Conference on Data Engineering, IEEE Computer Society. IEEE Computer Society, 1988, pp. 154--163.
No context found.
P.-J Leu and B. Bhargava. "Concurrent robust checkpointing and recovery in distributed systems." In Proceedings of the International Conference on Data Engineering, pp. 154---163, Feb. 1988.
No context found.
P. Leu and B. Bhargava, "Concurrent Robust Checkpointing and Recovery in Distributed Systems," Proceedings of the Eighth International Conference on Data Engineering, pp. 154- 163, 1988.
No context found.
P. Leu and B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Proceedings of the International Conference on Data Engineering, February 1988.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC