Message logging is a popular technique for building low-overhead protocols that tolerate process crash failures. Past research in message logging has focused on studying the relative overhead imposed by pessimistic, optimistic, and causal protocols during failure-free executions. In this paper, we give the first experimental evaluation of the performance of these protocols during recovery. We discover that, if a single failure is to be tolerated, pessimistic and causal protocols perform best, because they avoid rollbacks of correct processes. For multiple failures, however, the dominant factor in determining performance becomes where the recovery information is logged (i.e. at the sender, at the receiver, or replicated at a subset of the processes in the system) rather than when this information is logged (i.e. if logging is synchronous or asynchronous). From our results, we distil a few lessons that can guide the design of message-logging protocols that combine low-overhead during failure-free executions with fast recovery.
|
592
|
the ordering of events in a distributed system
– Time
- 1978
|
|
572
|
Implementing fault-tolerant services using the state machine approach: A tutorial
– Schneider
- 1990
|
|
501
|
Virtual time and global states of distributed systems
– Mattern
- 1989
|
|
329
|
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
– Elnozahy, Alvisi, et al.
- 1999
|
|
253
|
Optimistic recovery in distributed systems
– Strom, Yemini
- 1985
|
|
209
|
Libckpt: Transparent checkpointing under Unix
– Plank, Beck, et al.
- 1995
|
|
194
|
Recovery in distributed systems using optimistic message logging and checkpointing
– Johnson, Zwaenepoel
- 1990
|
|
170
|
The performance of consistent checkpointing
– Elnozahy, Johnson, et al.
- 1992
|
|
162
|
Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit
– Elnozahy, Zwaenepoel
- 1992
|
|
118
|
SenderBased Message Logging
– Johnson, Zwaenepoel
- 1987
|
|
112
|
A Message System Supporting Fault Tolerance
– Borg, Baumbach, et al.
- 1983
|
|
110
|
Monitors, Message, and Clusters: The p4 Parallel Programming System
– Butler, Lusk
- 1994
|
|
93
|
Efficient distributed recovery using message logging
– Sistla, Welch
- 1989
|
|
93
|
The Rio File Cache: Surviving Operating System Crashes
– Chen, Ng, et al.
- 1996
|
|
92
|
PUBLISHING: A Reliable Broadcast Communication Mechanism
– Powell, Presotto
- 1983
|
|
78
|
Message logging: pessimistic, optimistic, causal and optimal
– Alvisi, Marzullo
- 1998
|
|
68
|
Checkpointing and its applications
– Wang, Huang, et al.
- 1995
|
|
67
|
Volatile logging in n-fault-tolerant distributed systems
– Strom, Bacon, et al.
- 1988
|
|
61
|
Nonblocking and Orphan-Free Message Logging Protocols
– Alvisi, Hoppe, et al.
- 1994
|
|
55
|
On the Use and Implementation of Message Logging
– Elnozahy, Zwaenepoel
- 1994
|
|
49
|
The recovery box: Using fast recovery to provide high availability in the UNIX environment
– Baker, Sullivan
- 1992
|
|
43
|
Crash recovery with little overhead
– Juang, Venkatesan
- 1991
|
|
43
|
Efficient transparent optimistic rollback recovery for distributed application programs
– Johnson
- 1993
|
|
42
|
Distributed System Fault Tolerance Using Message Logging and Checkpointing
– Johnson
- 1989
|
|
36
|
How to recover efficiently and asynchronously when optimism fails
– Damani, Garg
- 1996
|
|
27
|
Egida: An extensible toolkit for low-overhead fault-tolerance
– Rao, Alvisi, et al.
- 1999
|
|
23
|
Manetho: Fault Tolerance in Distributed Systems Using Rollback-Recovery and Process Replication
– Elnozahy
- 1993
|
|
21
|
MPI: The Complete Reference. Scientific and Engineering Computation Series
– Snir, Otto, et al.
- 1996
|
|
17
|
Trade-offs in implementing optimal message logging protocols
– Alvisi, Marzullo
- 1996
|
|
16
|
On the relevance of communication costs of rollback-recovery protocols
– Elnozahy
- 1995
|
|
14
|
Efficient Algorithms for Optimistic Crash Recovery
– Venkatesan, Juang
- 1994
|
|
9
|
Message Logging
– Alvisi, Marzullo
- 1998
|
|
7
|
A non-blocking recovery algorithm for causal message logging
– Mitchell, Garg
- 1998
|
|
5
|
the Ordering of Events in a Distributed System,º
– Lamport, ªTime
- 1978
|
|
2
|
ªOptimistic Recovery in Distributed Systems,º Proc
– Strom, Yemeni
- 1985
|
|
1
|
ªTradeoffs in Implementing Optimal Message Logging
– Alvisi, Marzullo
- 1996
|
|
1
|
ªMessage Logging
– Alvisi, Marzullo
- 1998
|
|
1
|
ªA Message System Supporting Fault Tolerance,º
– Borg, Baumbach, et al.
- 1983
|
|
1
|
ªSender-Based Message Logging,º Digest of Papers: 17th Ann. Int'l Symp. Fault-Tolerant Computing
– Johnson, Zwaenepoel
- 1987
|
|
1
|
ªCrash Recovery with Little Overhead,º
– Juang, Venkatesan
- 1987
|
|
1
|
ªScientific and Engineering Computation Series,º MPI: The Complete Reference
– Snir, Otto, et al.
- 1996
|
|
1
|
ªVolatile Logging in nFault-Tolerant Distributed Systems,º Proc. Third Ann. Int'l Symp. Fault-Tolerant Computing
– Strom, Bacon, et al.
- 1988
|