MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Analysis of checkpointing schemes for multiprocessor systems (1993) [6 citations — 0 self]

Download:
Download as a PDF | Download as a PS
by Avi Ziv, Jehoshua Bruck
Tech. Rep. RJ 9593, IBM Almaden Research Center
http://vigeland.paradise.caltech.edu/papers/etr003.ps
Add To MetaCart

Abstract:

Parallel computing systems provide hardware redundancy that helps to achieve low cost faulttolerance. Fault-tolerance is achieved, in those systems, by duplicating the task into more than one processor, and comparing the states of the processors at checkpoints. Many schemes that achieve fault tolerance exist, and most of them use checkpointing to reduce the time spent retrying a task. Performance evaluation for most of the schemes either relies on simulation results, or uses a simplified fault model. This paper suggests a novel technique, based on a Markov Reward Model (MRM), for analyzing the performance of checkpointing schemes for fault-tolerance. We show how this technique can be used to derive the average execution time of a task and other important parameters related to the performance of checkpointing schemes. Our analytical results match well the values we obtained using a simulation program. We compare the average task completion time and total work of four checkpointing schemes, TMR, DMR-B-2, DMR-F-1 and RFCS. We show that generally increasing the number of processors reduces the average completion time, but increases the total work done by the processors. Namely, the TMR scheme, which uses three processors, is the quickest but does the most work, while the DMR-B-2 scheme, which uses only two processors, is the slowest of the four schemes but does the least work. However, in cases where there is a big difference between the time it takes to perform different operations, those results can change. For example, when we assume that the schemes are implemented on workstations connected by a LAN and the time to move data between workstations is relatively long, the DMR-B-2 scheme can become quicker than the TMR scheme.

Citations

422 Mathematica : A system for doing mathematics by computer – WOLFRAM - 1991
75 Queueing Systems Vol. I Theory – Kleinrock - 1975
42 Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing – Bernstein - 1988
36 Rollback and recovery strategies for computer programs – Chandy, Ramamoorthy - 1972
19 Compiler-Assisted Static Checkpoint Insertion – Long, Fuchs, et al. - 1992
18 On the optimum checkpoint selection problem – Toueg, Babaoglu - 1984
15 On evaluating the cumulative performance distribution of fault-tolerant computer systems – Donatiello, Grassi - 1991
12 Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy – Agrawal - 1988
12 Forward recovery using checkpointing in parallel systems – Long, Fuchs, et al. - 1990
12 Roll-forward checkpointing scheme: Concurrent retry with nondedicated spares – Pradhan, Vaidya - 1992
8 Fault-Tolerant Systems in Commercial Applications – Serlin - 1984
5 The Tandem Non-Stop System – Dimmer - 1985
5 The Analysis of Computer Systems Using Markov Reward Processes – Smith, Trivedi - 1990
5 Dependability measurement and modeling of a multicomputer system – Tang, Iyer - 1993
3 A multi-reward stochastic model for the completion time of parallel tasks – Bobbio - 1991
2 Modeling correlated transient failures in faulttolerant systems – Krishna, Singh - 1989
1 Optimal number of checkpoints in checkpointing schemes – Ziv, Bruck - 1993