20 citations found. Retrieving documents...
Y. Chen, K. Li, J.S. Planck, Clip: A checkpointing tool for message-passing parallel programs, High Performance Networking and Computing (SC97), IEEE/ACM, November

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
SRS - A Framework for Developing Malleable and Migratable.. - Vadhiyar, Dongarra (2002)   (6 citations)  (Correct)

....Checkpointing systems for sequential [30, 3 ] and parallel applications [15, 10, 4, 34, 20] have been built. Checkpointing systems are of different types depending on the transparency to the user and the portability of the check points. Transparent and semi transparent checkpointing systems [30, 12, 34] hide the details of checkpointing and restoration of saved states from the users, but are not portable. Non transparent checkpointing systems [23, 21, 27, 20] involves the users to make some modifications to their programs but are highly portable across systems. Checkpointing can also be ....

....are located fail. 5. The machine on which the RSS daemon is executing must be failure free for the duration of the application. 7 Related Work Checkpointing parallel applications have been widely studied in [16, 29, 25] and checkpointing systems for parallel applications have been developed [12, 10, 33, 38, 31, 15, 20, 34, 3, 23, 20, 4, 22, 21, 27]. Some of the systems were developed for homogeneous systems [12, 11,33,34] while some checkpointing systems allows applications to be checkpointed and restarred on heterogeneous systems [15, 20, 3 5, 23, 21, 27] Calypso [5] and Plinda [23] require application writers to write their programs in ....

[Article contains additional citation context not shown here]

Y. Chen, K. Li, and J. S. Plank. CLIP: A Checkpointing Tool for Message-passing Parallel Programs. In SC97: High Performance Networking and Computing, San Jose, November 1997.


MPICH-V: Toward a Scalable Fault Tolerant MPI for.. - Bosilca, Bouteiller, .. (2002)   (1 citation)  (Correct)

....on the ChandyLamport s algorithm. For an uncoordinated checkpoint, the environment sends to all surviving processes a notification of the failure. The application may take decision and corrective operations to continue execution (i.e. adapts the data sets repartition and work distribution) Clip [7] is a user level coordinated checkpoint library dedicated to IntelParagon systems. This library can be linked to MPI codes to provide semi transparent checkpoint. The user add checkpoint calls in his code but does not need to manage the program state on restart. 2.2 Optimistic log A theoretical ....

Yuqun Chen, Kai Li, and James S. Plank. CLIP: A checkpointing tool for message-passing parallel programs. In Proceedings the IEEE Supercomputing '97 Conference (SC97), november 1997.


The Average Availability of Parallel Checkpointing Systems.. - Plank, Thomason (1999)   (1 citation)  (Correct)

....For uniprocessor systems, selection of such an interval is for the most part a solved problem [19, 26] There has been important research in parallel systems [12, 25, 28] but the results are less unified. To date, most checkpointing systems for long running distributed memory computations (e.g. [4, 5, 13, 22, 24]) are based on coordinated checkpointing [8] At each checkpoint, the global state of all the processors is defined and stored to a highly available stable storage. If any processor fails, then a replacement processor is selected to take the place of the failed processor, and then all processors ....

....in Table 3 and below. HIGH is a high performance environment characterized by low failure rates and excellent checkpointing performance. The failure and repair rates come from the PRINCETON data set in [19] where failures are infrequent, and the checkpointing performance data comes from CLIP [5], a checkpointer for the Intel Paragon, which has an extremely fast file system. In HIGH, C, L and R are equal because CLIP cannot implement the copy on write optimization. MEDIUM is a medium performance workstation network such as the Ultra Sparc network from [3] We use workstation failure data ....

Y. Chen et al. CLIP: A checkpointing tool for message-passing parallel programs. In SC97: High Perf. Networking and Comp., Nov. 1997.


System-Level versus User-Defined Checkpointing - Silva, Silva (1998)   (5 citations)  (Correct)

....completely transparent to the programmer, but also supports the use of checkpointing primitives to specify the contents of checkpoint. It has been proved to be a very effective tool and we have used it for checkpointing UNIX processes. Another semi transparent checkpointing tool was presented in [22], although this one is restricted to the Intel Paragon multicomputer. Another proposal was presented in [23] That paper presents an object based DSM system called Dome. That system was implemented on top of PVM and provides a library of distributed objects for parallel programming. The main ....

Y.Chen, J.Plank, K.Li: "CLIP: A Checkpointing Tool for Message-Passing Parallel Programs", Proceedings of Supercomputing'97, San Jose, California, November 1997


Efficient and Flexible Fault Tolerance and Migration of.. - Kohl, Papadopoulos (1998)   (1 citation)  (Correct)

....application. Specific plans for future research in this area will also be discussed. 2 Background The CUMULVS approach to checkpointing has several advantages over traditional core image checkpointing. Many transparent checkpointing environments, such as CoCheck [13] MPVM MIST [14, 15] CLIP [16], Fail Safe PVM [17] Isis [18] Totem [19] Condor [20] and others [21] are designed for single architecture programs. CoCheck works with PVM and MPI to save the entire binary image of a program and move it to another similar machine. While this system works well for fault recovery, the size of ....

Y. Chen, J. Plank, K. Li, "CLIP: A Checkpointing Tool for Message-Passing Parallel Programs," SC97: High Performance Computing & Networking, San Jose, CA, November 1997.


Portable Fault-Tolerant File I/O - Lyubashevskiy   (Correct)

....transaction. The implementation of Eden is by no means portable, since it is tightly integrated with the operating system. A number of checkpointing systems exist for homogeneous environments, such as libckpt [15] a supplement of the Condor system [11, 12] and the CLIP system for Intel Paragon [4]. Not only are these systems tied to a particular machine architecture, they also do not support transactional file operations. By saving only the name of the file and the current file position in their checkpoints, these systems limit the applications to read only or write only file I O. There ....

Yuqun Chen, James S. Plank, and Kai Li. CLIP: A checkpointing tool for message-passing parallel programs. Technical Report TR-543-97, Princeton University, Computer Science Department, May 1997.


The Effect of Timeout Prediction and Selection on Wide Area - Collective Operations James   Self-citation (Plank)   (Correct)

....with the failure are beyond the scope of this paper. They may involve aborting the program and starting anew, attempting a reconnection of the socket to retry the communication, or perhaps performing rollback recovery to a saved state so that the loss of work due to the failure is minimized [1, 3, 4]. No method of dealing with the failure will be successful, however, unless the failure is properly identified. The default failure identification method in TCP IP sockets is a method of probing called keep alive. At regular intervals, if a socket connection is idle, the operating system of one ....

Y. Chen, J. S. Plank, and K. Li. CLIP: A checkpointing tool for message-passing parallel programs. In SC97: High Performance Networking and Computing, San Jose, November 1997.


Processor Allocation and Checkpoint Interval Selection in.. - Plank, Thomason (2001)   (3 citations)  Self-citation (Plank)   (Correct)

....computation state to stable storage. When one or more processors fail, the application may be restarted from the most recent checkpoint, thereby reducing the amount of recomputation that must be performed. To date, most checkpointing systems for long running distributed memory computations (e.g. [1, 5, 6, 18, 26, 29, 32]) are based on coordinated checkpointing [11] At each checkpoint, the global state of all the processors is defined and stored to a highly available stable storage. If any processor fails, a replacement processor is selected to take the place of the failed processor, and then all processors ....

....in Table 3 and below. HIGH is a high performance cluster characterized by low failure rates and excellent checkpointing performance. The failure and repair rates come from the PRINCETON data set in [25] where failures are infrequent, and the checkpointing performance data comes from CLIP [6], a checkpointer for the Intel Paragon, which has an extremely fast file system. In HIGH, C, L and R are equal because CLIP cannot implement the copy on write optimization. 11 MEDIUM is a medium performance workstation cluster such as the Ultra Sparc cluster from [4] We use workstation failure ....

Y. Chen, J. S. Plank, and K. Li. CLIP: A checkpointing tool for message-passing parallel programs. In SC97: High Performance Networking and Computing, San Jose, November 1997.


The Effect of Timeout Prediction and Selection on Wide.. - Plank, Wolski, Allen (2001)   (1 citation)  Self-citation (Plank)   (Correct)

....with the failure are beyond the scope of this paper. They may involve aborting the program and starting anew, attempting a reconnection of the socket to retry the communication, or perhaps performing rollback recovery to a saved state so that the loss of work due to the failure is minimized [1, 3, 4]. No method of dealing with the failure will be successful, however, unless the failure is properly identified. The default failure identification method in TCP IP sockets is a method of probing called keep alive. At regular intervals, if a socket connection is idle, the operating system of one ....

Y. Chen, J. S. Plank, and K. Li. CLIP: A checkpointing tool for message-passing parallel programs. In SC97: High Performance Networking and Computing, San Jose, November 1997.


Deploying Fault Tolerance and Task Migration with NetSolve - Plank, Casanova, Beck.. (1999)   (1 citation)  Self-citation (Plank)   (Correct)

....and load balancing become especially important. There has been a vast amount of research on embedding fault tolerance and load balancing into parallel and distributed computing platforms. Approaches that have been explored include user transparent checkpointing and migration libraries (e.g. [10,12,38,25]) programming paradigms that facilitate the task of fault tolerance or load balancing (e.g. 27,35] or modified algorithms for performing certain specific computations in a fault tolerant manner (e.g. 7,20,30] While the effectiveness of these techniques has been demonstrated experimentally, ....

....from the point of the checkpoint. Several checkpointing libraries have been written for performing coordinated checkpointing on various parallel computing platforms. For example, MIST [10] and CoCheck [38] provide transparent checkpointing for PVM and MPI programs on networks of workstations, CLIP [12] provides semi transparent checkpointing for Intel Paragon programs, and the PULlibrary [35] provides non transparent checkpointing for a certain class of parallel applications. Here, transparency refers to the amount of programmer involvement necessary to get checkpointing to work. As a first ....

Y. Chen, J. S. Plank, and K. Li. CLIP: A Checkpointing Tool for MessagePassing Parallel Programs. In SC97: High Performance Networking and Computing, San Jose, November 1997.


Design, Implementation, and Performance of Checkpointing in.. - Agbaria, Plank   Self-citation (Plank)   (Correct)

....to be fault tolerant. Many basic checkpointing algorithms [EAWJ99,MS99] and optimization techniques [Pla99] have been developed for uniprocessor and parallel computing systems, and several checkpointing libraries and systems have been implemented [EZ92, HKW95,PBKL95, TL95, WHV 95, Ste96, CPL97, RS97, AF99] However, for the typical scientific user, actually using a checkpointing system is a difficult task. All systems require the user to Gateway2000 Gateway2000 Gateway2000 NetSolve Agent Client Computation Servers 1. Query 2. Response 3. RPC Init 4. Results Figure 1: ....

....Response 3. RPC Init 4. Results Figure 1: The structure of NetSolve applications port a library and recompile or relink their code subject to a number of restrictions imposed by the library. These restrictions range from strong typing of the source code [RS97] to restricted file I O [PBKL95,CPL97] to static linking of runtime libraries [AF99] to restricted communication patterns [CPL97] One restriction shared by all checkpointers is that no connections to the outside world may be open while checkpointing is underway. Because of all of these factors, few scientific users actually employ ....

[Article contains additional citation context not shown here]

Y. Chen, J. S. Plank, and K. Li. CLIP: A checkpointing tool for message-passing parallel programs. In SC97: High Performance Networking and Computing, San Jose, November 1997.


Design, Implementation, and Performance of Checkpointing in.. - Agbaria, Plank   Self-citation (Plank)   (Correct)

....research area for enabling long running applications to be fault tolerant. Many basic checkpointing algorithms [6, 11] and optimization techniques [12] have been developed for uniprocessor and parallel computing systems, and several checkpointing libraries and systems have been implemented [1, 5, 8, 10, 14, 17, 18, 20, 22]. However, for the typical scientific user, actually using a checkpointing system is a difficult task. All systems require the user to port a library and recompile or relink their code subject to a number of restrictions imposed by the library. These restrictions range from strong typing of the ....

....user, actually using a checkpointing system is a difficult task. All systems require the user to port a library and recompile or relink their code subject to a number of restrictions imposed by the library. These restrictions range from strong typing of the source code [17] to restricted file I O [5, 14] to static linking of runtime libraries [1] to restricted communication patterns [5] One restriction shared by all checkpointers is that no connections to the outside world may be open while checkpointing is underway. Because of all of these factors, few scientific users actually employ ....

[Article contains additional citation context not shown here]

Y. Chen, J. S. Plank, and K. Li. CLIP: A checkpointing tool for message-passing parallel programs. In SC97: High Performance Networking and Computing, San Jose, Nov. 1997.


Experimental Assessment of Workstation Failures and Their.. - Plank, Elwasif (1997)   (12 citations)  Self-citation (Plank)   (Correct)

....powerful computational resources, rivaling supercomputers in their utility for scientific programming. Traditionally, checkpointing and rollback recovery have been employed to provide fault tolerance for longrunning computations on all computing platforms (e.g. LS92, PBKL95, HKW95, Ste96, EJZ92, CPL97] By storing a checkpoint, a program limits the amount of re execution necessary following a process or processor failure. In turn, this improves the program s running time in the presence of failures. plank cs.utk.edu. This material is based upon work supported by the National Science ....

Y. Chen, J. S. Plank, and K. Li. CLIP: A checkpointing tool for message-passing parallel programs. In SC97: High Performance Networking and Computing, San Jose, November 1997.


The Average Availability of Uniprocessor Checkpointing.. - Plank, Thomason (1998)   (1 citation)  Self-citation (Plank)   (Correct)

....except that in our model the first interval, and the intervals following recovery are all I seconds, whereas in Vaidy s, they are T seconds. We have chosen our model since it conforms to the model implemented by two public domain checkpointing implementations: libckpt [PBKL95] and CLIP [CPL97]. However, as will be shown, unless failures are quite frequent, the results from our model and Vaidya s do not differ significantly. 2.2 Assumptions Our model (along with all the others) makes a few assumptions that do not hold in real checkpointing systems. First is that C; L, and R are ....

Y. Chen, J. S. Plank, and K. Li. CLIP: A checkpointing tool for message-passing parallel programs. In SC97: High Performance Networking and Computing, San Jose, November 1997.


The Average Availability of Parallel Checkpointing Systems.. - Plank, Thomason (1999)   (1 citation)  Self-citation (Plank)   (Correct)

....For uniprocessor systems, selection of such an interval is for the most part a solved problem [25, 34] There has been important research in parallel systems [17, 33, 36] but the results are less unified. To date, most checkpointing systems for long running distributed memory computations (e.g. [5, 7, 19, 26, 29, 32]) are based on coordinated checkpointing [12] At each checkpoint, the global state of all the processors is defined and stored to a highly available stable storage. If any processor fails, then a replacement processor is selected to take the place of the failed processor, and then all processors ....

....in Table 3 and below. HIGH is a high performance environment characterized by low failure rates and excellent checkpointing performance. The failure and repair rates come from the PRINCETON data set in [25] where failures are infrequent, and the checkpointing performance data comes from CLIP [7], a checkpointer for the Intel Paragon, which has an extremely fast file system. In HIGH, C, L and R are equal because CLIP cannot implement the copy on write optimization. MEDIUM is a medium performance workstation network such as the Ultra Sparc network from [4] We use workstation failure data ....

Y. Chen, J. S. Plank, and K. Li. CLIP: A checkpointing tool for message-passing parallel programs. In SC97: High Performance Networking and Computing, San Jose, November 1997.


An Overview of Checkpointing in Uniprocessor and Distributed.. - Plank (1997)   (18 citations)  Self-citation (Plank)   (Correct)

.... distributed systems MIST MPVM [CCK 95] Fault tolerance migration Message passing distributed systems CoCheck [Ste96] Fault tolerance migration Message passing distributed systems [CGS 96] Fault tolerance Distributed shared memory systems Ickp [PL94] Fault tolerance Intel iPSC 860 CLIP [CPL97] Fault tolerance Intel Paragon Table 2. Examples of transparent checkpointers Uses of checkpointing Checkpointing provides the backbone for many tools, enumerated below. This list is not exhaustive. For more uses of checkpointing, please see Reference [WHV 95] Fault tolerance (rollback ....

Y. Chen, J. S. Plank, and K. Li. CLIP: A checkpointing tool for message-passing parallel programs. In SC97: High Performance Networking and Computing, San Jose, November 1997.


Experimental Assessment of Workstation Failures and Their.. - Plank, Elwasif (1997)   (12 citations)  Self-citation (Plank)   (Correct)

....workstation networks have become powerful computational resources, rivaling supercomputers in their utility for scientific programming. Traditionally, checkpointing and rollback recovery have been employed to provide fault tolerance for long running computations on all computing platforms (e.g. [4, 19, 21, 25, 27, 31]) By storing a checkpoint, a program limits the amount of re execution necessary following a process or processor failure. In turn, this improves the program s running time in the presence of failures. How often to checkpoint is a question of paramount practical importance. If one checkpoints too ....

Y. Chen, J. S. Plank, and K. Li. CLIP: A checkpointing tool for message-passing parallel programs. In SC97: High Performance Networking and Computing, Nov. 1997.


Fault-tolerant Parallel Applications with Dynamic Parallel.. - Gerlach, Hersch (2005)   (Correct)

No context found.

Y. Chen, K. Li, J.S. Planck, Clip: A checkpointing tool for message-passing parallel programs, High Performance Networking and Computing (SC97), IEEE/ACM, November


Flashback: A Lightweight Extension for Rollback and .. - Srinivasan.. (2004)   (1 citation)  (Correct)

No context found.

Y. Chen, J. S. Plank, and K. Li. Clip: a checkpointing tool for message-passing parallel programs. In Proceedings of the 1997.


Fault Manager for Distributed Operating Environments Design.. - Sens (1998)   (Correct)

No context found.

Y. Chen, K. Li, and J.S. Planck, `CLIP: A Checkpointing Tool for Message-passing Parallel Programs', Proceedings of High Performance Networking and Computing, November 1997.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC