Results 1 - 10
of
49
Diskless Checkpointing
, 1997
"... Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkp ..."
Abstract
-
Cited by 91 (3 self)
- Add to MetaCart
Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.
CLIP: A Checkpointing Tool for Message-Passing Parallel Programs
, 1997
"... Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpo ..."
Abstract
-
Cited by 60 (9 self)
- Add to MetaCart
Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpointer: CLIP. CLIP is a user-level library that provides semitransparent checkpointing for parallel programs on the Intel Paragon multicomputer. It is publicly available to Paragon users at no cost. Conceptually, checkpointing a multicomputer is quite straightforward. However, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design decisions to be made. Sometimes ease-of-use must be sacrificed for efficiency and/or correctness. This paper details what these decisions are, and how they were made in CLIP. We also present performance data when checkpointing several long-running Paragon applications with CLIP. The bottom line is that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.
An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance
, 1997
"... Checkpointing is the act of saving the state of a running program so that it may be reconstructed later in time. It is an important basic functionality in computing systems that paves the way for powerful tools in many fields of computer science. This article provides a comprehensive overview of che ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
Checkpointing is the act of saving the state of a running program so that it may be reconstructed later in time. It is an important basic functionality in computing systems that paves the way for powerful tools in many fields of computer science. This article provides a comprehensive overview of checkpointing in uniprocessor and parallel processing systems, including definitions, uses of checkpointing, and implementation details. Also included in this overview is a brief discussion of checkpoint consistency, which is a major concern in parallel processing systems, and a thorough discussion of issues related to the performance of checkpointing. It is intended that the reader of this article should receive a thorough grounding in checkpointing, with enough detail to implement an efficient checkpointer if so desired.
The Cost of Recovery in Message Logging Protocols
, 1998
"... Past research in message logging has focused on studying the relative overhead imposed by pessimistic, optimistic, and causal protocols during failure-free executions. In this paper, we give the first experimental evaluation of the performance of these protocols during recovery. We discover that, if ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
Past research in message logging has focused on studying the relative overhead imposed by pessimistic, optimistic, and causal protocols during failure-free executions. In this paper, we give the first experimental evaluation of the performance of these protocols during recovery. We discover that, if a single failure is to be tolerated, pessimistic and causal protocols perform best, because they avoid rollbacks of correct processes. For multiple failures, however, the dominant factor in determining performance becomes where the recovery information is logged (i.e. at the sender, at the receiver, or replicated at a subset of the processes in the system) rather than when this information is logged (i.e. if logging is synchronous or asynchronous). 1 Introduction Message-logging protocols (for example, [2, 3, 4, 6, 9, 10, 14, 15]) are popular techniques for building systems that can tolerate process crash failures. These protocols are built on the assumption that the state of a process is...
Improving the Performance of Coordinated Checkpointers on Networks of Workstations using RAID Techniques
, 1996
"... Coordinated checkpointing systems are popular and general-purpose tools for implementing process migration, coarse-grained job swapping, and fault-tolerance on networks of workstations. Though simple in concept, there are several design decisions concerning the placement of checkpoint files that can ..."
Abstract
-
Cited by 22 (10 self)
- Add to MetaCart
Coordinated checkpointing systems are popular and general-purpose tools for implementing process migration, coarse-grained job swapping, and fault-tolerance on networks of workstations. Though simple in concept, there are several design decisions concerning the placement of checkpoint files that can impact the performance and functionality of coordinated checkpointers. Although several such checkpointers have been implemented for popular programming platforms like PVM and MPI, none have taken this issue into consideration. This paper addresses the issue of checkpoint placement and its impact on the performance and functionality of coordinated checkpointing systems. Several strategies, both old and new, are described and implemented on a network of SPARC-5 workstations running PVM. These strategies range from very simple to more complex, borrowing heavily from ideas in RAID (Redundant Arrays of Inexpensive Disks) faulttolerance. The results of this paper will serve as a guide so that f...
Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
- Journal of Parallel and Distributed Computing
, 2001
"... Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In parti ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we present a performance model for long-running parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today's parallel computing environments and software, and present case studies of using the model to select runtime parameters. Keywords: Checkpointing, performance prediction, parameter selection, parallel computation, Markov chain, exponential failure and repair distributions. 1
Why Optimistic Message Logging Has Not Been Used In Telecommunications Systems
- In Proc. IEEE Fault-Tolerant Computing Symp
, 1995
"... Much of the literature on message logging and checkpointing in the past decade has been based on a so-called optimistic approach [1] that places more emphasis on failure-free overhead than recovery efficiency. Our experience has shown that most telecommunications systems use a pessimistic approach b ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
Much of the literature on message logging and checkpointing in the past decade has been based on a so-called optimistic approach [1] that places more emphasis on failure-free overhead than recovery efficiency. Our experience has shown that most telecommunications systems use a pessimistic approach because the main purpose of using message logging and checkpointing is to achieve fast and localized recovery, and the failure-free overhead of a pessimistic approach can often be made reasonably low by exploiting application-specific information. 1 A Brief Literature Survey Much of the existing work on message logging and checkpointing assumes a piecewise deterministic (PWD) execution model [2]. Under the PWD assumption, each process execution is viewed as a number of state intervals bounded by nondeterministic message receiving events 1 . Execution within each state interval is completely deterministic, and hence replayable. This allows the use of message logging as a form of checkpoint...
Stardust: an Environment for Parallel Programming on Networks of Heterogeneous Workstations
- Journal of Parallel and Distributed Computing
, 1996
"... This paper describes Stardust, an environment for parallel programming on networks of heterogeneous machines. Stardust runs on distributed memory multicomputers and networks of workstations. Applications using Stardust can communicate both through message-passing and distributed shared memory. Stard ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
This paper describes Stardust, an environment for parallel programming on networks of heterogeneous machines. Stardust runs on distributed memory multicomputers and networks of workstations. Applications using Stardust can communicate both through message-passing and distributed shared memory. Stardust includes a mechanism for application reconfiguration. This mechanism is used for balancing the load of the machines hosting the application, as well as for tolerating machine restarts (anticipated or not). At reconfiguration time, application processes can migrate between heterogeneous machines, and the number of application processes can vary (increase or decrease) depending on the available resources. Stardust is currently implemented on an heterogeneous system including an Intel Paragon running Mach/OSF1 and a set of Pentiums running Chorus/classiX. The paper details the design and implementation of Stardust, as well as its performance. Contact author Isabelle Puaut IRISA, Campus Uni...
Understanding The Message Logging Paradigm For Masking Process Crashes
, 1996
"... ... This dissertation presents the first such formal specification. From this specification, the two major classes of message-logging protocols, namely optimistic and pessimistic, are characterized. A third and new class of message-logging protocols, called causal, is introduced. A notion of optimal ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
... This dissertation presents the first such formal specification. From this specification, the two major classes of message-logging protocols, namely optimistic and pessimistic, are characterized. A third and new class of message-logging protocols, called causal, is introduced. A notion of optimality, based on three important performance metrics, is proposed, and it is shown that optimal implementations of causal message-logging protocols exist. In particular, it is shown that causal message-logging protocols combine the positive aspects of optimistic and pessimistic message logging
Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations
, 1995
"... This paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the "Network Of Workstation" (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW envir ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
This paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the "Network Of Workstation" (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorization, QR factorization, and preconditioned conjugate gradient. These implementations are able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional. We discuss the details of how the algorithms are tuned for fault-tolerance, and present the performance results on a PVM network of SUN workstations, and on the IBM SP2.

