| J. S. Plank. An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Technical Report UT-CS-97-372, Department of Computer Science, University of Tennessee, July 1997. |
....migratable application provide added functionality and flexibility to the scheduling and resource management systems for distributed computing. In order to achieve starting and stopping of the parallel applications, the state of the applications have to be checkpointed. Elonazhy [16] and Plank [29] have surveyed several checkpointing strategies for sequential and parallel appli cations. Checkpointing systems for sequential [30, 3 ] and parallel applications [15, 10, 4, 34, 20] have been built. Checkpointing systems are of different types depending on the transparency to the user and the ....
....on which the checkpoints are stored also fail when the machines on which the IBP depots are located fail. 5. The machine on which the RSS daemon is executing must be failure free for the duration of the application. 7 Related Work Checkpointing parallel applications have been widely studied in [16, 29, 25] and checkpointing systems for parallel applications have been developed [12, 10, 33, 38, 31, 15, 20, 34, 3, 23, 20, 4, 22, 21, 27] Some of the systems were developed for homogeneous systems [12, 11,33,34] while some checkpointing systems allows applications to be checkpointed and restarred on ....
James S. Plank. An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance. Technical Report UTCS -97-372, 1997.
....has been other work on checkpointing in the context of migrating applications [10] using extra processors for fault tolerance [12] post mortem and replay debugging, elimination of boundary condition errors [13] etc. Our work is very similar to user level transparent checkpointing techniques [11]. Such techniques usually work by compiling the application program with a special checkpointing library. Our technique, on the other hand, relies on program analysis and therefore can optimize the result for size and speed. Our system also shares many of the same restrictions as user level ....
J. S. Plank. An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Tech Report UT-CS-97-372, 1997.
....0.5 0.6 0.7 0.8 0.9 1e 05 0.0001 0.001 0.01 0.1 1 10 100 1000 Reliability Internal Message Sending Rate MDCD baseline Figure 7: Reliability as a Function of Internal Message Sending Rate protocol. We plan also to investigate the size reducing technique called incremental checkpointing [16], which can be applied to augment our useless checkpoint avoidance strategy for further performance overhead reduction. ....
J. S. Plank, "An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance," Technical Report UT-CS-97372, Department of Computer Science, University of Tennessee, Knoxville, TN, July 1997. 10
....program construction. Furthermore, we would rather that the system perform the encoding and decoding of the state automatically, as opposed to requiring the programmer to do so. A number of general purpose approaches to enable state transfer have been developed. For example, checkpointing [Pla97] and general purpose persistence [PJW96] are means to generally and automatically capture a program s state for later restart, e.g. to support process migration [Smi88] However, these approaches have a number of problems: 1. Like application specific state transfer, OS level datastructures ....
....are checked for accuracy with programmerprovided acceptance tests, while messages that would have been sent by the old version are logged. If an erroneous message is detected, the old version is restored. So that this switch over is semantically consistent, GSU employs checkpointing technology [Pla97] to checkpoint the state of the old version when its is known to be consistent, so that the system can roll back to that state on a failure. Checkpointing is also used to enable the new version to begin with the state of the old version. Unfortunately, enabling GSU error detection and recovery ....
[Article contains additional citation context not shown here]
James S. Plank. An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Technical Report UTCS -97-372, 1997.
....performance, or other reasons. Strong mobility requires that the entire state of the running agent, including its execution stack, be saved prior to a move so that it can be restored once the agent has moved to its new location. The standard term describing this process is checkpointing [23]. Over the last few years, the more general concept of orthogonal persistence has also been developed by the research community [2] The goal of orthogonal persistence research is to define language independent principles and language specific mechanisms by which persistence can be made available ....
Plank, J. S. (1997). An overview of checkpointing in uniprocessor and distributed systems focusing on implementation and performance. Technical report UT-CS97 -372. Department of Computer Science, University of Tennessee, July.
....is often adjusted to hedge against failures, and balance this loss against the periodic checkpoint overhead. CUMULVS user level approach to checkpointing is considered non transparent because it requires the programmer to modify the application source to coordinate the checkpointing activity [11]. In this case, the application defines the point(s) in its computation where checkpoints can be consistently collected, across concurrent sets of tasks, and which data should be included in checkpoints. Consistency here relates to identifying a global state for all tasks such that the ....
J. S. Plank, "An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance," Technical Report UT-CS-97372, Department of Computer Science, University of Tennessee, Knoxville, TN, July 1997.
....performance, or other reasons. Strong mobility requires that the entire state of the running agent, including its execution stack, be saved prior to a move so that it can be restored once the agent has moved to its new location. The standard term describing this process is checkpointing [22]. Over the last few years, the more general concept of orthogonal persistence has also been developed by the research community [2] The goal of orthogonal persistence research is to define language independent principles and language specific mechanisms by which persistence can be made available ....
Plank, J. S. (1997). An overview of checkpointing in uniprocessor and distributed systems focusing on implementation and performance. Technical report UT-CS-97-372. Department of Computer Science, University of Tennessee, July.
....node to use computation performed by the failed node, the failed node periodically writes a checkpoint record containing its computation state. The checkpoint record contains sufficient information to restart computation from the point of writing the checkpoint record. A number of recent studies [24, 23] have addressed the issue of checkpointing a process and restarting it at a later time; the state of computation can be recovered by checkpointing the contents of the stack, the CPU context, and the values of private and heap variables. In this paper, we focus on checkpointing and restarting the ....
....the cost of recovering from a failure. A survey of recoverable distributed shared memory systems is provided in [18] To support high availability, the run time system has to save the state of the machine periodically. Mechanisms for checkpointing are discussed in [24] for a uniprocessor and in [23] for a cluster of machines. In coordinated checkpointing [28, 15] the system checkpoints the state of all threads simultaneously by stopping computation periodically. In order to avoid expensive checkpointing operations, some designs [21, 22, 8] checkpoint infrequently while logging changes to ....
James S. Plank. An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Technical report, University of Tennessee, July 1997.
....status of open files[4] After a failure, the state of memory and registers can be reconstructed and target program can be restarted from the most recent checkpoint. On uniprocessor system environment, checkpoint and recovery facility can be provided either at the kernel level or at the user level[1, 3, 9]. User level checkpoint and recovery library can save the state of a process by linking the user program with checkpointing library. This facility can provide flexibility to programmers but there are also many restrictions. First, User level checkpoint library should take a checkpoint through a ....
....on the stable storage. The overhead is sequential checkpoint is same to the time it takes to write the checkpoint file to disk[8] Kckpt provide the forked checkpointing to reduce the checkpoint overheads. Since the most time consuming portions of checkpointing do not require the use of the CPU[9], a child process can take the responsibility of writing of checkpoint to disk in the forked checkpointing scheme. Thus, the parent process can continue to execute the application program concurrently while the child process is checkpoint to disk. In this approach, when a process takes a ....
James S. Plank, "An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance", Technical Report of University of Tennessee, UT-CS-97-372, Jul. 1997.
No context found.
J. S. Plank. An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Technical Report UT-CS-97-372, Department of Computer Science, University of Tennessee, July 1997.
No context found.
J. S. Plank. An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance. Technical Report UT-CS97 -372, Department of Computer Science, Tennessee University, July 1997.
No context found.
James S. Plank. An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Technical Report UT-CS-97-372, University of Tennessee, Department of Computer Science, 1997.
No context found.
J. S. Plank. An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance. Technical Report UT-CS-97-372, Department of Computer Science, University of Tennessee, July 1997.
No context found.
J. S. Plank. An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance. Technical Report UT-CS-97-372, Department of Computer Science, University of Tennessee, July 1997.
No context found.
J. S. Plank. An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Technical Report UT-CS-97-372, 1997.
No context found.
James S. Plank. An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance. Technical Report UT-CS-97-372, Department of Computer Science, University of Tennessee, July 1997.
No context found.
J. S. Plank, "An Overview of Checkpointing in Uniprocessor and Distributed Systems Focusing on Implementation and Performance," Tech. Report UT-CS-97-372, Department of Computer Science, University of Tennessee, Knoxville, Tenn., 1997.
No context found.
J.S. Plank, An Overview of Checkpointing in Uniprocessor and Distributed Systems Focusing on Implementation and Performance, Tech. Report UT-CS-97-372, Dept. of Computer Science, Univ. of Tennessee, Knoxville, Tenn., 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC