MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Fault Tolerant Matrix Operations for Networks of Workstations

Download:
Download as a PDF | Download as a PS
by James S. Plank Y, Youngbae Kim Y, Jack J. Dongarra Yz, Using Diskless Checkpointing, James S. Planky, Youngbae Kimy, Jack J. Dongarrayz
http://www.cs.utk.edu/~plank/plank/papers/ADCKP.ps.Z
Add To MetaCart

Abstract:

Networks of workstations (NOWs) offer a cost effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load or availability. As long as there are at least n processors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet. 1

Citations

864 Distributed Snapshots: Determining Global States of Distributed Systems – Chandy, Lamport - 1985
526 MPI: The Complete Reference – Snir, Otto, et al. - 1996
388 LAPACK Users' Guide – Anderson, Bai, et al. - 1992
274 Optimistic recovery in distributed systems – Strom, Yemini - 1985
253 Checkpointing and Rollback-Recovery for Distributed Systems – Koo, Toueg - 1987
208 Recovery in distributed systems using optimistic message logging and checkpointing – Johnson, Zwaenepoel - 1990
179 The performance of consistent checkpointing – Elnozahy, Johnson, et al. - 1992
159 ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers – Choi, Dongarra, et al. - 1992
146 ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance – Choi, Demmel, et al. - 1996
143 Cocheck: Checkpointing and process migration for mpi – Stellner - 1996
124 Fault tolerance under UNIX – Borg, Blau, et al. - 1989
112 Algorithm-based fault tolerance for matrix operations – Huang, Abraham - 1984
97 Supporting Fault-Tolerant Parallel Programming in Linda – Bakken, Schlichting - 1995
93 The available capacity of a privately owned workstation environment – Mutka, Livny - 1991
91 Supercomputing out of Recycled Garbage: Preliminary Experience with Piranha – Gelernter, Kaminsky - 1992
84 et al. Templates for the solution of linear systems: building blocks for iterative methods – Barrett - 1994
84 Low-latency, concurrent checkpointing for parallel programs – Li, Naughton, et al. - 1994
81 Fail-safe PVM: A portable package for distributed programming with transparent recovery – Leon, Fisher, et al. - 1993
69 PVM: A Users' Guide and Tutorial for Networked Parallel Computing – Geist, Beguelin, et al. - 1994
68 Parallel processing on dynamic resources with CARMI – Pruyne, Livny - 1995
67 MPVM: A migration transparent version of PVM – Casas, Clark, et al. - 1995
65 Disk array storage system reliability – Burkhard, Menon - 1993
62 On distributed snapshots – Lai, Yang - 1987
58 EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures – Blaum, Brady, et al. - 1994
57 A timestamp-based checkpointing protocol for long-lived distributed computations – Cristian, Jahanian - 1991
57 On the use and implementation of message logging – Elnozahy, Zwaenepoel - 1994
57 Lazy checkpoint coordination for bounding rollback propagation – Wang, Fuchs - 1993
46 Ickp --- a consistent checkpointer for multicomputers – Plank, Li - 1994
40 Reduced overhead logging for rollback recovery in distributed shared memory – Suri, Janssens, et al. - 1995
39 MIST: PVM with transparent migration and checkpointing – Casas, Clark, et al. - 1995
28 Error-Correcting Codes, Second edition – Peterson, Weldon - 1972
26 Failure correction techniques for large disk arrays – Gibson, Hellerstein, et al. - 1989
25 Faster Checkpointing with N + 1 Parity – Plank, Li - 1994
23 An efficient checkpointing method for multicomputers with wormhole routing – LI, NAUGHTON, et al. - 1992
21 Portable checkpointing and recovery – Silva, Silva, et al. - 1995
20 Checkpoint/rollback in a distributed system using coarse-grained dataflow – Cummings, Alkalaj - 1994
18 Dome: Parallel Programming in a Distributed Computing Environment – Arabe, Beguelin, et al. - 1996
12 An analysis of algorithm-based fault tolerance techniques – Luk, Park - 1988
10 Algorithm-based fault location and recovery for matrix computations – Roy-Chowdhury, Banerjee - 1994
8 Space/time overhead analysis and experiments with techniques for fault tolerance. Dependable Computing and FaultTolerant Systems – Laranjeira, Malek, et al. - 1993