Download:
|
by Checkpointing, David B. Johnson, Willy Zwaenepoel
http://www.cs.cmu.edu/~dbj/ftp/jalg.ps.gz
Add To MetaCart
Abstract:
Message logging and checkpointing can provide fault tolerance in distributed systems in which all process communication is through messages. This paper presents a general model for reasoning about recovery in these systems. Using this model, we prove that the set of recoverable system states that have occurred during any single execution of the system forms a lattice, and that therefore, there is always a unique maximum recoverable system state, which never decreases. Based on this model, we present an algorithm for determining this maximum recoverable state, and prove its correctness. Our algorithm utilizes all logged messages and checkpoints, and thus always finds the maximum recoverable state possible. Previous recovery methods using optimistic message logging and checkpointing have not considered the existing checkpoints, and thus may not find this maximum state. Furthermore, by utilizing the checkpoints, some messages received by a process before it was checkpointed may not need to be logged. Using our algorithm also adds less communication overhead to the system than do previous methods. Our model and algorithm can be used with any message logging protocol, whether pessimistic or optimistic, but their full generality is only required with optimistic logging protocols.
Citations
|
1746
|
Time, clocks, and the ordering of events in a distributed system
– Lamport
- 1978
|
|
1319
|
Concurrency Control and Recovery in Database Systems
– Bernstein, Hadzilacos, et al.
- 1987
|
|
796
|
Distributed snapshots: Determining global states of distributed systems
– Chandy, Lamport
- 1985
|
|
693
|
Virtual time
– Jefferson
- 1985
|
|
425
|
System Structure for Software Fault Tolerance
– Randell
- 1975
|
|
247
|
Fail-Stop Processors: An Approach to Designing Fault-Tolerant Computing Systems
– Schlichting, Schneider
- 1982
|
|
241
|
Checkpointing and Rollback-Recovery for Distributed Systems
– Koo, Toueg
- 1987
|
|
118
|
SenderBased Message Logging
– Johnson, Zwaenepoel
- 1987
|
|
117
|
Fault Tolerance Under UNIX
– Borg, Blau, et al.
- 1989
|
|
112
|
A Message System Supporting Fault Tolerance
– Borg, Baumbach, et al.
- 1983
|
|
93
|
Efficient distributed recovery using message logging
– Sistla, Welch
- 1989
|
|
92
|
PUBLISHING: A Reliable Broadcast Communication Mechanism
– Powell, Presotto
- 1983
|
|
76
|
Crash recovery in a distributed data storage system
– Lampson, Sturgis
- 1976
|
|
61
|
State restoration in systems of communicating processes
– Russell
- 1980
|
|
55
|
Transaction management in the r* distributed database management system
– Mohan, Lindsay, et al.
- 1986
|
|
54
|
Recovery management in QuickSilver
– Haskin, Malachi, et al.
- 1988
|
|
42
|
Distributed simulation and the time warp operating system
– Jefferson, Beckman
- 1987
|
|
30
|
The state machine approach: A tutorial
– Schneider
- 1986
|
|
30
|
and Shaula Yemini. Optimistic recovery in distributed systems
– Strom
- 1985
|
|
26
|
Reliable Object Storage to Support Atomic Actions
– Oki, Liskov, et al.
- 1985
|
|
11
|
Distributed Transaction Processing and the Camelot System
– Spector
- 1987
|
|
5
|
Je erson. Virtual Time
– David
- 1985
|
|
2
|
Distributed Simulation and the Time Warp Operating System
– erson, Beckman, et al.
- 1987
|