| L.M. Silva and J.G. Silva. "Global Checkpointing for Distributed Programs". Proc. of the 11th Symp. on Reliable Distributed Systems, pages 155--162, Oct. 1992. |
....CIC protocols are not new. The paper by Briatico et al. was perhaps the first to describe this style of checkpointing back in 1984 [2] Other papers have described variations over their protocol [9,13,26] or used some protocol features to simplify the implementation of coordinated checkpointing [6,25]. Recently, CIC protocols have attracted an increasing interest in the research community, with new sophisticated protocols based on the Z path theory [16] But to our knowledge, there are no published implementation or evaluation reports of CIC. CIC protocols piggyback special information on ....
L.M. Silva and J.G. Silva. Global checkpointing for distributed programs. In Proceedings of IEEE Symposium on Reliable Distributed Systems, pp. 155---162, Oct. 1992.
....connected by a 155 Mbit sec ATM. The results indicate that the protocol introduces very small overheads. 2. Related Work Several authors have proposed coordinated checkpoint protocols in the past. Most of these protocols exchange extra messages to coordinate the creation of new checkpoints [25, 10, 11, 18, 9, 2, 5]. More recently, timebased protocols were introduced that rely on approximately synchronized clocks or timers to avoid message coordination [26, 4, 12] Time based protocols save checkpoints periodically, whenever local timers expire. Tong et al. 26] proposed the first time based protocol. This ....
L. M. Silva and J. G. Silva. Global checkpointing for distributed programs. In Proceedings of the 11th Symposium on Reliable Distributed Systems, pages 155--162, October 1992.
....schemes, domino free recovery is achieved by sacrificing process autonomy and incurring extra message overhead during checkpointing. In this approach, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system [5, 10, 12, 13]. The storage requirement for the checkpoints is minimum because each process keeps only one checkpoint in the stable storage at any given time. Process execution may have to be suspended during the checkpointing coordination as in [8, 10] resulting in performance degradation. 1 A globally ....
Luis Moura e Silva and Jouao Gabriel Silva. "Global Checkpointing for Distributed Programs". In Proc. Symp. Reliable Distributed Systems, pages 155--162, 1992.
.... [19,22,23, 27, 28] Environmentbased Yes Yes Yes Optional Checkpointing supported by the environment Captured by the environment [19,21] this work System based No Yes No Optional All Not implemented [3,11] Figure 16 14 There has been much research in designing checkpointing algorithms [2, 3, 13 18]. However, none of these algorithms satisfy all our requirements: algorithms [13, 15, 16] do not implement reliable communication channels. Algorithms presented in [2, 3, 15] are blocking. Algorithm [17] relies on the assumption that all the processor clocks are approximately synchronized, which ....
....[13, 15, 16] do not implement reliable communication channels. Algorithms presented in [2, 3, 15] are blocking. Algorithm [17] relies on the assumption that all the processor clocks are approximately synchronized, which limits the generality of these algorithms. Several non blocking algorithms [14, 18] require less than O(n 2 ) communication messages. The Kai Li algorithm [14] performs well on multicomputers. It requires O(nlogn) messages for hypercube connected multicomputers and O(n) for mesh connected multicomputers. However, this algorithm depends on the knowledge of the process ....
[Article contains additional citation context not shown here]
L. M. Silva and J. G. Silva. Global checkpointing for distributed programs. In Proc. of the 11 th Symposium on Reliable Distributed Systems, pages 155-162, 1992
....the checkpointing process. Checkpointing includes the time to trace the dependency tree and to save the states of processes on the stable storage, which may be long. Therefore, blocking algorithms may dramatically degrade the performance of the system [2, 6] Recently, nonblocking algorithms [6, 19] have received considerable attention. In these algorithms, processes need not block during checkpointing by using a checkpointing sequence 2 number to identify orphan messages. However, these algorithms [6, 19] assume that a distinguished initiator decides when to take a checkpoint. Therefore, ....
....degrade the performance of the system [2, 6] Recently, nonblocking algorithms [6, 19] have received considerable attention. In these algorithms, processes need not block during checkpointing by using a checkpointing sequence 2 number to identify orphan messages. However, these algorithms [6, 19] assume that a distinguished initiator decides when to take a checkpoint. Therefore, they suffer from the disadvantages of centralized algorithms, such as one site failure, traffic bottle neck, etc. Moreover, these algorithms [6, 19] require all processes in the computation to take checkpoints ....
[Article contains additional citation context not shown here]
L.M. Silva and J.G. Silva. "Global Checkpointing for Distributed Programs". Proceedings of the 11th Symposium on Reliable Distributed Systems, pages 155--162, October 1992.
....schemes, domino free recovery is achieved by sacrificing process autonomy and incurring extra message overhead during checkpointing. In this approach, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system [6, 12, 15]. The storage requirement for the checkpoints is minimum because each process keeps only one checkpoint in the stable storage at any given time. Process execution may have to be suspended during the checkpointing coordination as in [10, 12] resulting in performance degradation. Paper Objectives ....
....the propagation of snapshot requests in the system. In our algorithm, several processes can simultaneously initiate and collect global snapshot successfully without restricting any other process from collecting global snapshot. The synchronous checkpointing algorithm 21 of Silva and Silva [6] requires a fixed process to take a checkpoint and send request messages to all the other processes for taking a checkpoint consistent with its latest checkpoint; it also requires the sequence number of the latest checkpoint taken to be piggybacked with each computation message. This requires the ....
[Article contains additional citation context not shown here]
Luis Moura e Silva and Jouao Gabriel Silva. "Global Checkpointing for Distributed Programs". In Proc. Symp. Reliable Distributed Systems, pages 155--162, 1992.
....where n is the number of processors in the system. This makes them unnecessarily slow as the number of participating nodes grows. In Wojciks algorithm [13] each process has to log each message sent to other processes, which makes this algorithm inefficient. Several non blocking algorithms [6, 16] require less than O(n 2 ) communication messages. The Kai Li algorithm [6] performs well on multicomputers. It requires O(nlogn) messages for hypercube connected multicomputers and O(n) for mesh connected multicomputers. However, this algorithm depends on the knowledge of the process ....
....knowledge of the process interconnection topology, which is unusable in systems where the pattern by which processors are connected varies, as in systems constructed by interconnecting PCs or workstations. Furthermore, this algorithm requires communication channels to be FIFO. The Silva algorithm [16] requires only O(n) communication messages. However, it relies on the knowledge of fault detection latency, and message latency, which might be difficult to determine in case of the internetbased distributed system. This paper describes a simple, scalable, non blocking algorithm that requires O(n ....
L. M. Silva and J. G. Silva. Global checkpointing for distributed programs. In Proc. of the 11 th Symposium on Reliable Distributed Systems, pages 155-162, 1992
....and message replay. So, this type of approach focusses on reducing communication overhead during the checkpointing and message logging phases, and puts most work into the recovery phase. It is assumed in these systems that failures are infrequent. 2. Consistent checkpointing: This type of system [7, 10, 12, 17, 19, 22] attempts to construct a consistent distributed system state in a checkpointing phase. Checkpointing of processes is synchronised in such a way that the resulting set of checkpoints forms a consistent distributed system state; consequently, this makes rollback recovery less expensive. Compared to ....
....at which time the previous checkpoints can be discarded, resulting in more efficient use of stable storage. Systems using a one phase commit protocol must always keep the two most recent checkpoints for each process. 3 System model 3. 1 Assumption Our work is partially motivated by the systems [17, 18, 19], and focusses on the above issues which were not addressed in the previous systems. We make the following assumptions about the distributed environment on which our model is built: 1. nodes fail by stopping. The failed processes can be relocated to some other working node, and the process states ....
[Article contains additional citation context not shown here]
L.M. Silva and J.G. Silva. Global checkpointing for distributed programs. In Proceedings of the 11th Symposium on Reliable Distributed Systems, pages 155--162, October 1992.
....Consistent checkpointing techniques can further be divided into two sub groups: blocking and non blocking techniques. In blocking techniques [9, 10, 11] processes synchronize together when saving a checkpoint and are halted during the whole checkpointing protocol. In non blocking techniques [12, 13, 14], each process takes a temporary checkpoint and resumes its execution. Later on, temporary checkpoints are made definitive when it is known that all proPI n924 4 Gilbert Cabillic, Gilles Muller, Isabelle Puaut cesses have saved their temporary checkpoint, and that no message is in transit. It ....
L.M. Silva and J.G. Silva. Global checkpointing for distributed programs. In Proc. of the 11th Symposium on Reliable Distributed Systems, pages 155--162, Houston (TX), October 1992.
....systems are notoriously difficult to design. An overview of the problems of fault tolerance can be found in [6] A particularly difficult task is the saving of distributed system state. This problem was described in [21] initially solved in [4] and later addressed by a number of authors ([2, 3, 8, 13, 15, 20, 24, 29], to mention only a few) This multiplicity of solutions stems partly from assumptions of different failure semantics [6] and architectures. For example, in [7] it is assumed that processor clocks are synchronized within ffl units of each other and that interprocess communication delays are ....
L.M. Silva and J.G. Silva. Global checkpointing for distributed programs. In 11th Symposium on Reliable Distributed Systems, IEEE, pages 155--162, 1992.
....schemes, domino free recovery is achieved by sacrificing process autonomy and incurring extra message overhead during checkpointing. In this approach, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system [7, 14, 15]. The storage requirement for the checkpoints is minimum because each process keeps only one checkpoint in the stable storage at any given time. Synchronous checkpointing schemes involve high message overhead and any checkpointing method that involves high message overhead is not suitable for ....
Luis Moura e Silva and JouŻao Gabriel Silva. "Global Checkpointing for Distributed Programs". In Proc. Symp. Reliable Distributed Systems, pages 155--162, 1992.
....the domino effect [5] First, in the methods based on (optimistic) message logging [6, 7, 8, 9] every application process saves its status independently, and the method keeps track of the inter process messages to guarantee consistency. In the second approach, called coordinated checkpointing [5, 10, 11, 12, 13], a consistent view of the application is saved by coordinating the checkpointing between the different processes. 14] shows that the latter approach is better suited for number crunching applications, as the overhead mainly occurs during the checkpointing itself, and not between consecutive ....
L.M. Silva, J.G. Silva, "Global Checkpointing for Distributed Programs", Proc. 11th Symp. Reliable Distributed Systems, Houston, TX, Oct. 1992, pp. 155-162
....for supporting fault tolerant objects in distributed systems. Section 5 concludes the paper and discusses the future work. 2 Related Work In the literature, the subject of fault tolerance for distributed system is primarily addressed for process based systems with asynchronous message passing[5, 7, 10, 11, 12, 13]. A few systems also consider synchronous message passing like remote procedure calls[6, 14] Although there exists a duality between object oriented systems and process (and messages) based systems[15, 16] not much work has been done in exploiting the structure and properties of the object ....
L. M. Silva and J. G. Silva, "Global checkpointing for distributed programs," In Symp. Reliab. Distr. Systems, pp. 155--162, October 1992.
....for distributed systems. These schemes are generally classified into two categories synchronous and asynchronous. In synchronous checkpointing schemes, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system [3, 5, 9]. The storage requirement for the checkpoints is minimum because each process keeps only one checkpoint in the stable storage at any given time. Major disadvantages of synchronous checkpointing are, i) process execution may have to be suspended during the checkpointing coordination as in [5] ....
Luis Moura e Silva and JouŻao Gabriel Silva. "Global Checkpointing for Distributed Programs". In Proc. Symp. Reliable Distributed Systems, pages 155--162, 1992.
....and inter process messages are logged [7, 12, 15, 21, 26, 27] Coordinated checkpointing: all [4, 30] or an interacting set [5, 16, 17, 28] of processes save their states in a coordinated way. Hybrid techniques: a combination of the two previous techniques that merges their advantages [18, 24, 31, 32]. The latter three systems are user transparent. Overhead is associated with each class of techniques, both during failure free operation and after a fault is detected. This overhead, together with other characteristic features of a particular technique, determine what scheme is best suited for an ....
.... case an entire system state is checkpointed or restored at once [4, 30] Load on the data network: control messages are sent and datamessages may be enlarged to include information for the schemes (send receive sequence numbers [15, 26, 27] incarnation numbers [12, 27, 32] or crash counters [24], etc. or to obtain special communication protocols (e.g. two phase commit protocols) Proc. of IASTED Int. Conf. on Modelling and Simulation, Pittsburgh, PA, May 10 12, 1993, pp. 262 265 Partly supported by ESPRIT project 6731 (FTMPS) and by IUAP 50 Senior Research Assistant of the ....
[Article contains additional citation context not shown here]
L.M. Silva, J.G. Silva, "Global Checkpointing for Distributed Programs ", Proc. of 11th Symp. on Reliable Distributed Systems, Houston, Texas, Oct. 1992, pp.155-162
....is necessary to guarantee that the application checkpoint is consistent and recoverable (section 3. 2 explains these concepts) These protocols usually select one of the processes, the coordinator, to initiate the creation of the checkpoints and to ensure that each process saves its state [6, 14, 17, 19]. This task is accomplished with the exchange of a set of messages. The protocol adds information to each message to detect in transit messages. Whenever an in transit message arrives, the protocol saves it in stable storage, together with the state of the processes. Both types of protocols have ....
L. M. Silva and J. G. Silva. Global checkpointing for distributed programs. In Proceedings of the 11th Symposium on Reliable Distributed Systems, pages 155-- 162, October 1992.
....in a homogeneous environment, using system specific techniques to efficiently capture consistent memory images from each process. Among the recent ones are Li, Naughton, and Plank s [22, 23, 26] which is designed to minimize the checkpointing overhead on multicomputers; Silva and Silva s [28], which takes into account the latency between failure occurrence and detection; and Leon, Fisher, and Steenkiste s [21] which is designed specifically for programs written in PVM. In most work on checkpointing for distributed systems, the primary focus is on attempting to minimize the cost of ....
Luis Silva and Jo~ao Silva. Global checkpointing for distributed programs. In Proceedings of the 11th Symposium on Reliable Distributed Systems, pages 155--162. IEEE Computer Society Press, 1992.
....Consistent checkpointing techniques can furthermore be divided into two sub groups: blocking and non blocking techniques. In blocking techniques [9, 10, 11] processes synchronize together when saving a checkpoint and are halted during the whole checkpointing protocol. In non blocking techniques [12, 13, 14], each process takes a temporary checkpoint and resumes its execution. Later on, temporary checkpoints are made definitive when it is known that all processes have saved their temporary checkpoint, and that no more message is in transit. It should be noticed that non blocking checkpointing is more ....
L.M. Silva and J.G. Silva. Global checkpointing for distributed programs. In Proc. of the 11th Symposium on Reliable Distributed Systems, pages 155--162, Houston (TX), October 1992.
....for distributed systems. These schemes are generally classified into two categories synchronous and asynchronous. In synchronous checkpointing schemes, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system [6, 8, 12]. The storage requirement for the checkpoints is minimum because each process needs to keep at most two checkpoints (one committed and one possibly not committed) in stable storage at any given time. Major disadvantages of synchronous checkpointing are, i) process execution may have to be ....
Luis Moura e Silva and JouŻao Gabriel Silva. "Global Checkpointing for Distributed Programs". In Proceedings of Symposium on Reliable Distributed Systems, pages 155-- 162, 1992.
....we will describe a system level checkpointing algorithm and a user defined checkpointing scheme. Both schemes have been implemented in the same parallel system and we have conducted an experimental study to take some conclusions about the previous metrics. 3. System level Checkpointing In [15] we have presented an algorithm to implement a coordinated global checkpoint for distributed applications. That algorithm has some similarities with another one presented in [11] For lack of space we do not present details about the algorithm. The interested reader is referred to any of those two ....
L.M.Silva, J.G.Silva. "Global Checkpointing for Distributed Programs", Proc. 11 th Symposium on Reliable Distributed Programs, Houston USA, pp. 155162, October 1992
No context found.
L.M. Silva and J.G. Silva. "Global Checkpointing for Distributed Programs". Proc. of the 11th Symp. on Reliable Distributed Systems, pages 155--162, Oct. 1992.
No context found.
L.M. Silva and J.G. Silva. "Global Checkpointing for Distributed Programs". Proc. of the 11th Symp. on Reliable Distributed Systems, pages 155--162, Oct. 1992.
No context found.
L.M. Silva and J.G. Silva. "Global checkpointing for distributed programs." In Proceedings of IEEE Symposium on Reliable Distributed Systems, pp. 155---162, Oct. 1992.
No context found.
L.M. Silva & J.G. Silva. Global checkpointing for distributed programs. In Proc. of 11th Symposium on Reliable Distributed Systems, pages 155--162, Houston (TX), October 1992.
No context found.
L. M. Silva and J. G. Silva, "Global Checkpointing for Distributed Programs," Proceedings of the 11th Symposium on Reliable Distributed Systems, pp. 155--162, October 1992.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC