| P.E. Chung, Y. Huang, et al. "Checkpointing in CosMic a User-level Process Migration Environment", Pacific Rim International Symposium on Fault-Tolerant Systems, Dec.1997, pp. 187-193. |
....Libckp [1] makes a shadow copy when the portion of the file that existed at previous checkpoint time is about to be modified or when the file is about to be deleted. During rollbacks, the shadow copy can be used to restore file to have correct content. 2) In place update with undo logs. libfcp [5], winckp [6] and SCR algorithm [7] use this approach. It intercepts all file operations and generates undo log of restoring the pre modification data. When a rollback occurs, these undo logs are applied in a reversed order to restore the original files. In this approach, a normal write operation ....
P.E. Chung, Y. Huang, et al. "Checkpointing in CosMic a User-level Process Migration Environment", Pacific Rim International Symposium on Fault-Tolerant Systems, Dec.1997, pp. 187-193.
....1 13.0 days 1 2.02 days 2.04 MB sec 0.120 MB sec LOW 1 70 min 1 75 min 1.00 MB sec 0.200 MB sec Table 3: Failure, repair and checkpointing data for the three processing environments. Finally, LOW is based on an idle workstation environment such as the ones supported by Condor [19] and CosMiC [8], where workstations are available for computations only when they are not in use by their owners. Failure and repair data was obtained by the authors of [8] and the checkpointing performance data was gleaned from performance results of CosMiC s transparent checkpointer libckp [35] It is assumed ....
....three processing environments. Finally, LOW is based on an idle workstation environment such as the ones supported by Condor [19] and CosMiC [8] where workstations are available for computations only when they are not in use by their owners. Failure and repair data was obtained by the authors of [8], and the checkpointing performance data was gleaned from performance results of CosMiC s transparent checkpointer libckp [35] It is assumed that the copy on write optimization yields an 80 percent improvement in checkpoint overhead [24] The failure rate of LOW is extremely high, which is ....
[Article contains additional citation context not shown here]
P. E. Chung, Y. Huang, S. Yajnik, G. Fowler, K. P. Vo, and Y. M. Wang. Checkpointing in CosMiC: a user-level process migration environment. In Pacific Rim International Symposium on Fault-Tolerant Systems, December 1997.
....of selected resource management systems, with an emphasis on resource allocation for adaptive jobs, is given in Secion 8. Section 9 concludes the paper. 2 Background On networks of machines that support both parallel jobs and interactive users, machine loads change over time. A number of studies [30, 34, 11, 28, 2, 9] indicate that in most institutions up to 60 of machines are idle at any given time. A machine is referred to as idle and, hence, available to participate in a computation when it is not used by its owner and its CPU cycles are mostly unused. However, the availability of machines is unpredictable ....
....requires is rsh access to remote hosts. To manage the transient availability of machines, systems such as Remote Unix, Sprite, and MOSIX utilize checkpointing and process migration to move processes once machines become unavailable. Coshell provides user level process migration through CosMiC [9]. In contrast to ResourceBroker, focus of these systems is to support sequential computations and they do not make any special provision for parallel programs. 8.2 Static Allocation for Parallel Jobs A number of research and commercial products such as Condor [26] Utopia [36] now LSF [32] ....
E. Chung, Y. Huang, and S. Yajnik. Checkpointing in CosMic: A user-level process migration environment. In Proceedings of Pacific Rim Symposium on Fault-Tolerant Computing, 1997.
....(MB) BT LU EP (a) b) Figure 3. a) Running time (RT a ) of the applications as a function of the number of processors. b) Checkpoint size (CS a ) as a function of the number of processors. Finally, LOW is based on an idle workstation environment such as the one supported by CosMiC [6], where workstations are available for computations only when they are not in use by their owners. Failure and repair data was obtained by the authors of [6] and the checkpointing performance data was gleaned from performance results of CosMiC s transparent checkpointer libckp [27] It is ....
....a ) as a function of the number of processors. Finally, LOW is based on an idle workstation environment such as the one supported by CosMiC [6] where workstations are available for computations only when they are not in use by their owners. Failure and repair data was obtained by the authors of [6], and the checkpointing performance data was gleaned from performance results of CosMiC s transparent checkpointer libckp [27] It is assumed that the copy on write optimization yields an 80 improvement in checkpoint overhead [18] The failure rate of LOW is extremely high, which is typical of ....
[Article contains additional citation context not shown here]
P. E. Chung et al, Checkpointing in CosMiC: a user-level process migration environment. In Pac. Rim Int. Symp. on Fault-Tol. Systems, Dec. 1997.
....to encourage machine owners to donate the spare resources of their machines to the L Bone. It is well known that most computers are not consistently or continuously utilized, and this has led to the successful implementations of cyclestealing programming environments such as Condor [32] and Cosmic [14]. Our intent is for the L Bone to be composed of dedicated storage resources, such as those allocated on the I2 DSI deployment machines, and contributed resources from individual and institutional machine owners. This mirrors the design of the academic Internet, with shared backbone links and ....
P. E. Chung, Y. Huang, S. Yajnik, G. Fowler, K. P. Vo, and Y. M. Wang. Checkpointing in CosMiC: a user-level process migration environment. In Pacific Rim International Symposium on Fault-Tolerant Systems, December 1997.
....is an important issue. Specifically, the CPU capacity of most workstations is rarely utilized by their owners. Separate studies have shown that if a computation may be structured so that it only uses idle cycles of privately owned machines, an enormous amount of computation may be performed [13,26,28]. The issue of using privately owned resources has been touched upon in NetSolve by the Condor servers, but ideally the brokering of such resources should be performed by the NetSolve agents, and then the migration co managed by the servers and the agents. We plan to leverage off the similar ....
P. E. Chung, Y.Huang, S. Yajnik, G. Fowler, K. P. Vo, and Y. M. Wang. Checkpointing in CosMiC: a user-level process migration environment. In Pacific Rim International Symposium on Fault-Tolerant Systems, December 1997.
....Operation Buffering (MOB) to checkpoint user files. It is developed for our checkpoint and migration library libcsm, which is the underlying library of ChaRM system [4] Section 4 will compare MOB approach with the only two reported implementations of file checkpoint in libckp [1] and libfcp [5]. Summary and future work are given in Section 5. 2.Inconsistent rollbacks Existing checkpoint libraries usually select a straightforward but incomplete way in which only active information of user files is recorded during the time of checkpoint. When a rollback occurs, the active information is ....
....the file can be truncated to the correct size according to recorded size. During RARW and RAD rollback, the shadow copy can be used to restore file to have correct content. The file rollback functionality in libckp was optimized later and taken out as a separate file checkpoint library libfcp [5]. It uses an in place update with undo logs approach to checkpoint files. It intercepts all file operations except for read only ones. When a file is opened for modifications, its size is recorded and an undo log of file truncation is generated. When the portion of the file that existed at ....
P.E. Chung, Y. Huang, et al. "Checkpointing in CosMic a User-level Process Migration Environment", Pacific Rim International Symposium on Fault-Tolerant Systems, Dec.1997, pp. 187-193.
....to migrate, the current process state will be saved in a machineindependent format and translated into a portable migration program. The migration program is then compiled, linked, loaded, and executed on the target machine. No implementation or performance evaluation was presented. Chung et al. [15] recently described a process migration facility, CosMiC, that implements heterogeneous migration through storing critical data structures in a machineindependent XDR (External Data Representation) 24] format. The critical data needed for recovery is predefined in a specification file and then ....
P. E. Chung, Y. Huang, S. Yajnik, G. Fowler, K.-P. Vo, and Y.-M. Wang, "Checkpointing in CosMiC: A User-level Process Migration Environment", Pacific Rim International Symposium on Fault-Tolerant Systems, pp. 187--193, Dec. 1997.
....24.8 MB sec MIDDLE 1 13.0 days 1 2.02 days 2.04 MB sec 0.120 MB sec LOW 1 70 min 1 75 min 1.00 MB sec 0.200 MB sec Table 3: Failure, repair and checkpointing data for the three processing environments. Finally, LOW is based on an idle workstation environment such as the one supported by CosMiC [9], where workstations are available for computations only when they are not in use by their owners. Failure and repair data was obtained by the authors of [9] and the checkpointing performance data was gleaned from performance results of CosMiC s transparent checkpointer libckp [35] It is assumed ....
....data for the three processing environments. Finally, LOW is based on an idle workstation environment such as the one supported by CosMiC [9] where workstations are available for computations only when they are not in use by their owners. Failure and repair data was obtained by the authors of [9], and the checkpointing performance data was gleaned from performance results of CosMiC s transparent checkpointer libckp [35] It is assumed that the copy on write optimization yields an 80 improvement in checkpoint overhead [24] Note that the failure rate of LOW is extremely high, which is ....
[Article contains additional citation context not shown here]
P. E. Chung, Y. Huang, S. Yajnik, G. Fowler, K. P. Vo, and Y. M. Wang. Checkpointing in CosMiC: a user-level process migration environment. In Pacific Rim International Symposium on Fault-Tolerant Systems, December 1997.
....status of open files[4] After a failure, the state of memory and registers can be reconstructed and target program can be restarted from the most recent checkpoint. On uniprocessor system environment, checkpoint and recovery facility can be provided either at the kernel level or at the user level[1, 3, 9]. User level checkpoint and recovery library can save the state of a process by linking the user program with checkpointing library. This facility can provide flexibility to programmers but there are also many restrictions. First, User level checkpoint library should take a checkpoint through a ....
....]W[d W] W cW]c[ cW[d[Wbba cW[d W]b] a [ a [ Table 1: Result of Performance Evaluation popen( or signals[11] and user application must linked with Libckpt. ffl CosMic CosMic is equipped with four checkpoint libraries, namely, Libck, Libfcp, Libft and Libst [1]. Libck is a transparent checkpoint library, Libfcp is a file checkpoint library, Libft is a critical data checkpoint library and Libst is a strong type checkpoint library. Cosmic focuses on a user level process migration and it does not provide a simple solution for checkpointing a general user ....
P. E. Chung, Y. Huang, S. Yajnik, G. Fowler, K. P. Vo, and Y. M. Wang "Checkpointing in CosMiC: a User-level Process Migration Environment," In Proc. Pacific Rim Int. Symp. on Fault-Tolerant Systems, Dec. 1997.
..... 11 6 Experiments 12 7 Brief Overview of Selected Resource Management Systems 15 8 Conclusions 16 Acknowledgements 16 References 17 1 Introduction The number of networked machines in many organization is rapidly growing, while frequently many are underutilized or even idle [29, 34, 10, 13, 27]. Such environments are therefore potentially excellent platforms for executing compute intensive parallel jobs as guests, in addition to the regular applications for which the machines are designated. A parallel computation most benefitting from such platforms will exhibit one or both of the ....
....resources in a dynamic environment where a jobs are dynamically introduced. The setting was as follows. An adaptive Calypso program initially utilized eight machines. Then repeatedly, we introduce a sequential program that runs for t minutes, where t is chosen uniformly from the interval [1,10], and wait 100 seconds after each run. After five hours, we measured the idleness (total amount of time when a machines was not running a job) of the machines to be less than 1 (approximately 4 milliseconds to allocate and then to deallocate resources to that job) This number can be viewed two ....
E. Chung, Y. Huang, and S. Yajnik. Checkpointing in CosMic: A user-level process migration environment. In Proc. of Pacific Rim Symp. on Fault-Tolerant Computing, 1997.
....that are only as good as the simulator s base assumptions. Thus, questions also may be raised concerning the validity of the simulations. For example, all uptimes are considered equivalent by the simulator, whereas in real life, machines undergo fluctuations in load and true availability [5, 18]. Further, the three networks studied may not be a true reflection of the average workstation network. Finally, the decision to treat reserved time in the CETUS network as downtime may deflate the true availability of the system. For example, a user might checkpoint his or her code just before ....
P. E. Chung et al. Checkpointing in CosMiC: a user-level process migration environment. In Pac. Rim Int. Symp. on Fault-Tol. Systems, Dec. 1997.
....else exit(COSMIC MIGRATE) void malloc(size t size) call xept call rv = NULL ckp b4 exit( Figure 7. Specification to provide checkpointbefore exit A different style of exception masking can be done with checkpoint and restart. We developed a system called CosMiC [2] that supports automatic job submission and migration on a network of machines. Many users routinely use CosMiC to execute long running programs with unpredictable varying memory usage. Valuable work may be lost if such a program simply exits upon an out of memory exception. A ....
P.-Y. Chung, Y. Huang, S. Yajnik, G. S. Fowler, K. P. Vo, and Y. M. Wang. Checkpointing in CosMiC: A User-level Process Migration Environment. In Proc. Pacific Rim International Symposium on Fault-Tolerant Systems, Dec 1997.
No context found.
P.E. Chung, Y. Huang, S. Yajnik, G. Fowler, K.-P. Vo, Y.-M. Wang, "Checkpointing in CosMiC: a user-level process migration environment", Proceedings of Pacific Rim International Symposium on Fault-Tolerant Systems, 1997 Dec, pp 187 -- 193.
No context found.
P.E. Chung, Y. Huang, S. Yajnik, G. Fowler, K.-P. Vo, Y.-M. Wang, "Checkpointing in CosMiC: a user-level process migration environment", Proceedings of Pacific Rim International Symposium on Fault-Tolerant Systems, 1997 Dec, pp 187 -- 193.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC