MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Recovery in Distributed Systems Using Optimistic Message Logging

Download:
Download as a PDF | Download as a PS
by Checkpointing, David B. Johnson, Willy Zwaenepoel
http://www.cs.cmu.edu/~dbj/ftp/jalg.ps.gz
Add To MetaCart

Abstract:

Message logging and checkpointing can provide fault tolerance in distributed systems in which all process communication is through messages. This paper presents a general model for reasoning about recovery in these systems. Using this model, we prove that the set of recoverable system states that have occurred during any single execution of the system forms a lattice, and that therefore, there is always a unique maximum recoverable system state, which never decreases. Based on this model, we present an algorithm for determining this maximum recoverable state, and prove its correctness. Our algorithm utilizes all logged messages and checkpoints, and thus always finds the maximum recoverable state possible. Previous recovery methods using optimistic message logging and checkpointing have not considered the existing checkpoints, and thus may not find this maximum state. Furthermore, by utilizing the checkpoints, some messages received by a process before it was checkpointed may not need to be logged. Using our algorithm also adds less communication overhead to the system than do previous methods. Our model and algorithm can be used with any message logging protocol, whether pessimistic or optimistic, but their full generality is only required with optimistic logging protocols.

Citations

1746 Time, clocks, and the ordering of events in a distributed system – Lamport - 1978
1319 Concurrency Control and Recovery in Database Systems – Bernstein, Hadzilacos, et al. - 1987
796 Distributed snapshots: Determining global states of distributed systems – Chandy, Lamport - 1985
693 Virtual time – Jefferson - 1985
425 System Structure for Software Fault Tolerance – Randell - 1975
247 Fail-Stop Processors: An Approach to Designing Fault-Tolerant Computing Systems – Schlichting, Schneider - 1982
241 Checkpointing and Rollback-Recovery for Distributed Systems – Koo, Toueg - 1987
118 SenderBased Message Logging – Johnson, Zwaenepoel - 1987
117 Fault Tolerance Under UNIX – Borg, Blau, et al. - 1989
112 A Message System Supporting Fault Tolerance – Borg, Baumbach, et al. - 1983
93 Efficient distributed recovery using message logging – Sistla, Welch - 1989
92 PUBLISHING: A Reliable Broadcast Communication Mechanism – Powell, Presotto - 1983
76 Crash recovery in a distributed data storage system – Lampson, Sturgis - 1976
61 State restoration in systems of communicating processes – Russell - 1980
55 Transaction management in the r* distributed database management system – Mohan, Lindsay, et al. - 1986
54 Recovery management in QuickSilver – Haskin, Malachi, et al. - 1988
42 Distributed simulation and the time warp operating system – Jefferson, Beckman - 1987
30 The state machine approach: A tutorial – Schneider - 1986
30 and Shaula Yemini. Optimistic recovery in distributed systems – Strom - 1985
26 Reliable Object Storage to Support Atomic Actions – Oki, Liskov, et al. - 1985
11 Distributed Transaction Processing and the Camelot System – Spector - 1987
5 Je erson. Virtual Time – David - 1985
2 Distributed Simulation and the Time Warp Operating System – erson, Beckman, et al. - 1987