Networks of workstations (NOWs) offer a cost effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load or availability. As long as there are at least n processors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet. 1
|
864
|
Distributed Snapshots: Determining Global States of Distributed Systems
– Chandy, Lamport
- 1985
|
|
526
|
MPI: The Complete Reference
– Snir, Otto, et al.
- 1996
|
|
388
|
LAPACK Users' Guide
– Anderson, Bai, et al.
- 1992
|
|
274
|
Optimistic recovery in distributed systems
– Strom, Yemini
- 1985
|
|
253
|
Checkpointing and Rollback-Recovery for Distributed Systems
– Koo, Toueg
- 1987
|
|
208
|
Recovery in distributed systems using optimistic message logging and checkpointing
– Johnson, Zwaenepoel
- 1990
|
|
179
|
The performance of consistent checkpointing
– Elnozahy, Johnson, et al.
- 1992
|
|
159
|
ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers
– Choi, Dongarra, et al.
- 1992
|
|
146
|
ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance
– Choi, Demmel, et al.
- 1996
|
|
143
|
Cocheck: Checkpointing and process migration for mpi
– Stellner
- 1996
|
|
124
|
Fault tolerance under UNIX
– Borg, Blau, et al.
- 1989
|
|
112
|
Algorithm-based fault tolerance for matrix operations
– Huang, Abraham
- 1984
|
|
97
|
Supporting Fault-Tolerant Parallel Programming in Linda
– Bakken, Schlichting
- 1995
|
|
93
|
The available capacity of a privately owned workstation environment
– Mutka, Livny
- 1991
|
|
91
|
Supercomputing out of Recycled Garbage: Preliminary Experience with Piranha
– Gelernter, Kaminsky
- 1992
|
|
84
|
et al. Templates for the solution of linear systems: building blocks for iterative methods
– Barrett
- 1994
|
|
84
|
Low-latency, concurrent checkpointing for parallel programs
– Li, Naughton, et al.
- 1994
|
|
81
|
Fail-safe PVM: A portable package for distributed programming with transparent recovery
– Leon, Fisher, et al.
- 1993
|
|
69
|
PVM: A Users' Guide and Tutorial for Networked Parallel Computing
– Geist, Beguelin, et al.
- 1994
|
|
68
|
Parallel processing on dynamic resources with CARMI
– Pruyne, Livny
- 1995
|
|
67
|
MPVM: A migration transparent version of PVM
– Casas, Clark, et al.
- 1995
|
|
65
|
Disk array storage system reliability
– Burkhard, Menon
- 1993
|
|
62
|
On distributed snapshots
– Lai, Yang
- 1987
|
|
58
|
EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures
– Blaum, Brady, et al.
- 1994
|
|
57
|
A timestamp-based checkpointing protocol for long-lived distributed computations
– Cristian, Jahanian
- 1991
|
|
57
|
On the use and implementation of message logging
– Elnozahy, Zwaenepoel
- 1994
|
|
57
|
Lazy checkpoint coordination for bounding rollback propagation
– Wang, Fuchs
- 1993
|
|
46
|
Ickp --- a consistent checkpointer for multicomputers
– Plank, Li
- 1994
|
|
40
|
Reduced overhead logging for rollback recovery in distributed shared memory
– Suri, Janssens, et al.
- 1995
|
|
39
|
MIST: PVM with transparent migration and checkpointing
– Casas, Clark, et al.
- 1995
|
|
28
|
Error-Correcting Codes, Second edition
– Peterson, Weldon
- 1972
|
|
26
|
Failure correction techniques for large disk arrays
– Gibson, Hellerstein, et al.
- 1989
|
|
25
|
Faster Checkpointing with N + 1 Parity
– Plank, Li
- 1994
|
|
23
|
An efficient checkpointing method for multicomputers with wormhole routing
– LI, NAUGHTON, et al.
- 1992
|
|
21
|
Portable checkpointing and recovery
– Silva, Silva, et al.
- 1995
|
|
20
|
Checkpoint/rollback in a distributed system using coarse-grained dataflow
– Cummings, Alkalaj
- 1994
|
|
18
|
Dome: Parallel Programming in a Distributed Computing Environment
– Arabe, Beguelin, et al.
- 1996
|
|
12
|
An analysis of algorithm-based fault tolerance techniques
– Luk, Park
- 1988
|
|
10
|
Algorithm-based fault location and recovery for matrix computations
– Roy-Chowdhury, Banerjee
- 1994
|
|
8
|
Space/time overhead analysis and experiments with techniques for fault tolerance. Dependable Computing and FaultTolerant Systems
– Laranjeira, Malek, et al.
- 1993
|