| J.S. Plank, Y. Kim, J.J. Dongarra, Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations, FTCS25, Proc. 25th Annual International Symposium on FaultTolerant Computing, pp. 351-360, 1995 |
....the algorithm, modify the encoded data concurrently with the original data, and check that the encoding is preserved at various points during the execution of the algorithm. Fault tolerant algorithms using this approach have been devised for several numerical algorithms ( 1] 2] 3] 4] 5] [6], 7] For a class of algorithms performing linear transformations on the data, a natural encoding to choose is the checksum encoding [1] where a checksum is computed of the data being operated on by the algorithm. The checksum is then transformed concurrently with the computations on the data ....
J. S. Plank, Y. Kim, and J. J. Dongarra, "Algorithm-based diskless checkpointing for fault-tolerant matrix operations," Proc. FTCS-25, June 1995.
....state, nor does it support reinitializing state at run time. Also, it does not include any sort of program analysis to determine the extent of the checkpointing. There has been other work on checkpointing in the context of migrating applications [10] using extra processors for fault tolerance [12], post mortem and replay debugging, elimination of boundary condition errors [13] etc. Our work is very similar to user level transparent checkpointing techniques [11] Such techniques usually work by compiling the application program with a special checkpointing library. Our technique, on the ....
J. S. Plank, Y. Kim, and J. J. Dongarra. Algorithm-based diskless checkpointing for fault-tolerant matrix operations. In FTCS-25: 25th International Symposium on Fault Tolerant Computing Digest of Papers, pp. 351-360, 1995.
.... regularly save global snapshots of the session state on stable storage such that a failed session can be re started from the last saved state (potentially migrating tasks to other machines) stable storage can be implemented by disks or by replicated copies in the memories of multiple machines [11]. Some systems perform checkpointing transparently to the application, often on top of PVM [4] or MPI [12] Other systems rely on application support for checkpointing [2, 10] Many metacomputing and message passing frameworks include at least failure detection and handling services that enable ....
J. S. Plank, Y. Kim, and J. Dongarra. Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations. In 25th International Symposium on FaultTolerant Computing, Pasadena, California, June 1995. IEEE CS Press.
.... global snapshots of the session 1 INTRODUCTION 3 state on stable storage such that a failed session can be re started from the last saved state (potentially migrating tasks to other machines) stable storage can be implemented by disks or by replicated copies in the memories of multiple machines [17, 19]. Some systems perform checkpointing transparently to the application, often on top of PVM [6, 5, 9, 13] or MPI [18] Other systems rely on application support for checkpointing [3, 16] Most metacomputing and message passing frameworks include at least failure detection and handling services ....
James S. Plank, Youngbae Kim, and Jack Dongarra. Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations. In 25th International Symposium on Fault-Tolerant Computing, Pasadena, California, June 1995. IEEE Computer Society Press.
....the same architecture and operating system. Moreover, NOWs are dynamic environments i.e. workstations frequently leave and enter the network due to their availability or failure. These factors make that it is necessary to find new solutions for the problems of parallelism like the fault tolerance [12]. It is also necessary to change the parallel programming methodology in order to take into account the adaptive nature of the applications, we refer that to parallel adaptive programming. An application programmed according to this philosophy is called a parallel adaptive application, namely ....
....For example, in [2] the authors show, with the help of the multi block Navier Stokes solver application implemented on a PVM NOW, that if the number of processors does not vary frequently the data redistribution cost is not important compared with the time required for the actual calculations. In [12], the authors experiment a checkpointing algorithm for fault tolerance on a SUN NOW and an IBM SP2. The results on both systems are close together. In this paper, we experiment adaptive parallelism on a numerical application, the blockbased Gauss Jordan method. A classical parallel version of this ....
[Article contains additional citation context not shown here]
James S. Plank, Youngbae Kim, and Jack. J. Dongarra. Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations. 25 t h Symposium on Fault-Tolerant Computing, Pasadena, CA, June 1995.
....to the application programmer, e.g. 12] 15] However, a transparent scheme is unlikely to take advantage of points in an application where data to be saved is minimum, such as when data has just been written to disk for instance. One non transparent scheme for the static partitioning approach [17] maintains a parity copy of distributed partitions of computation state. While performance for a Cholesky factorization of 5000element square matrix , at 1700seconds employing 17 Sparc 2 machines, is similar to that recorded here the computation is bounded by total memory and the approach here ....
James S. Plank, Youngbae Kim, and Jack J. Dongarra. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In 25th International Symposium on Fault-Tolerant Computing, June 1995.
....to the application programmer, e.g. 11] 14] However, a transparent scheme is unlikely to take advantage of points in an application where data to be saved is minimum, such as when data has just been written to disk for instance. One non transparent scheme for the static partitioning approach [17] maintains a parity copy of distributed partitions of computation state. While perfor3 mance for a Cholesky factorization of 5000element square matrix , at 1700 seconds employing 17 Sparc 2 machines, is similar to that recorded here the computation is bounded by total memory and the approach here ....
James S. Plank, Youngbae Kim, and Jack J. Dongarra. Algorithm-based diskless checkpointing for fault tolerant matrix operations. Technical Report CS-94-268, University of Tennessee, December 1994.
....programs. Huang and Kintala [13] have built a system that provides several levels of software fault tolerance for client server applications. They demonstrate that these mechanisms are quite useful in providing appropriate system behavior in terms of availability and data consistency. Plank et al. [22] have a unique approach which uses diskless checkpointing. They utilize a parity processor rather than a disk for storing processor state. Upon failure the parity processor is able to reconstruct the state of the failed processor from the parity and the state of the remaining processors. ....
James S. Plank, Youngbae Kim, and Jack J. Dongarra. Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations. In FTCS-25, Pasadena, CA, June 1995.
....We term any such event a failure. Thus, on the wish list of scientific programmers is a way to perform computation efficiently on a NOW whose components are tolerant to failure. Recently, a fault tolerant computing paradigm based on diskless checkpointing [25] has been developed in the papers [16, 17, 23, 24]. The paradigm is based on checkpointing and rollback recovery using processor and memory redundancy without any reliance on disk. Its underlying idea is to adopt the N 1 parity, used by Gibson to provide reliability in RAID (Redundant Array of Inexpensive Disks) 12] The paradigm is an ....
....stores an encoding of the application processors checkpoints. When a processor failure occurs, an extra idle processor replaces the failed processor and recovers its data from remaining application processors and the checkpoint encoding. Recently, checkpointing techniques based on parity [23, 24] or checksum and reverse computation [17] have been used to incorporate fault tolerance into highperformance matrix operations. Throughout this paper, we call these techniques single checkpointing because they employ only one checkpointing processor. In this paper, we present a new technique ....
[Article contains additional citation context not shown here]
J. S. Plank, Y. Kim, and J. J. Dongarra. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In The 25th International Symposium on Fault-Tolerant Computing, pages 351--360, Pasadena, CA, June 1995.
....be available for computation at one moment, but gone the next due to failure, load, or availability. We term any such event a failure. Thus, on the wish list of scientific programmers is a way to perform computation efficiently on a NOW whose components are prone to failure. Recently, the papers [23, 24] have developed such a fault tolerant computing paradigm. The paradigm is based on checkpointing and rollback recovery using processor and memory redundancy. It is called diskless checkpointing as it provides fault tolerance without any reliance on disk. For this paradigm, a paritybased ....
....rollback recovery enables a system with fail stop failures [33] to tolerate failures by periodically saving the entire state and rolling back to the saved state if a failure occurs. Our technique for checkpointing and rollback recovery adopts the idea of algorithm based diskless checkpointing [23]. If the program is executing on N processors, there is a N 1 st processor called the checkpointing processor. At all points in time, a consistent checkpoint is held in the N processors in memory. A checksum (floating point addition) of the N checkpoints is held in the checkpointing processor. ....
J. S. Plank, Y. Kim, and J. J. Dongarra. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In The 25th International Symposium on Fault-Tolerant Computing, pages 351--360, Pasadena, CA, 1995.
.... all at a low cost [PGK88, Gib92, CLG 94] Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth [HO93, CLVW94] and to design fast checkpointing systems where extra processors provide reliability instead of disks [PL94, PKD95, CD96] We call all such systems RAID like systems. The above problem is central to all RAID like systems. When storage is distributed among n devices, the chances of one of these devices failing becomes significant. To be specific, if the mean time before failure of one device is F , then the ....
J. S. Plank, Y. Kim, and J. Dongarra. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In 25th International Symposium on Fault-Tolerant Computing, pages 351--360, Pasadena, CA, June 1995.
No context found.
J.S. Plank, Y. Kim, J.J. Dongarra, Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations, FTCS25, Proc. 25th Annual International Symposium on FaultTolerant Computing, pp. 351-360, 1995
No context found.
J. S. Plank, Y. Kim and J.J. Dongarra. "Algorithm-based diskless checkpointing for fault-tolerant matrix computations." In Proceedings of the Twenty Fifth International Symposium on Fault-Tolerant Computing Systems, pp. 351---360, Jun. 1995.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC