This thesis describes a technique to build replicated services that combines Byzantine fault tolerance with work on abstract data types. Tolerating Byzantine faults is important because software errors are a major cause of outages and they can make faulty replicas behave arbitrarily. Abstraction hides implementation details to enable the reuse of existing service implementations and to improve the ability to mask software errors. We improve resilience to software errors by enabling the recovery of faulty replicas using state stored in replicas with distinct implementations; using an opportunistic N-version programming technique that runs distinct, off-the-shelf implementations at each replica to reduce the probability of common mode failures; and periodically repairing each replica using an abstract view of the state stored by the correct replicas in the group, which improves tolerance to faults due to software aging. We have built two replicated services that demonstrate the use of this technique. The first is
|
1747
|
Time, clocks and the ordering of events in a distributed system
– Lamport
- 1978
|
|
1139
|
Transaction Processing: Concepts and Techniques
– Gray, Reuter
- 1993
|
|
1027
|
Distributed Algorithm
– Lynch
- 1996
|
|
703
|
Scale and performance in a distributed file system
– Howard, Kazar, et al.
- 1988
|
|
573
|
Implementing fault-tolerant services using the state machine approach: A tutorial
– Schneider
- 1990
|
|
367
|
Reaching agreement in the presence of faults
– Pease, Shostak, et al.
- 1980
|
|
353
|
Practical byzantine fault tolerance
– Castro, Liskov
- 1999
|
|
294
|
Why aren’t operating systems getting faster as fast as hardware
– Ousterhout
- 1990
|
|
260
|
Notes on database operating systems
– Gray
- 1978
|
|
248
|
The OO7 benchmark
– Carey, DeWitt, et al.
- 1993
|
|
135
|
Replication in the Harp file system
– Liskov, Ghemawat, et al.
- 1991
|
|
120
|
The SecureRing protocols for securing group communication
– Kihlstrom, Moser, et al.
- 1998
|
|
103
|
Proactive Recovery in a Byzantine-Fault-Tolerant System
– Castro, Liskov
- 2000
|
|
99
|
Efficient optimistic concurrency control using loosely synchronized clocks
– Adya, Gruber, et al.
|
|
90
|
Software rejuvenation: Analysis, module and applications
– Huang, Kintala, et al.
- 1995
|
|
85
|
N-version programming: A fault-tolerance approach to reliability of software operation
– Chen, Avizienis
- 1978
|
|
84
|
Axioms for Concurrent Objects
– Herlihy, Wing
- 1987
|
|
81
|
Fine-Grained Sharing in a Page Server OODBMS
– Cary, Franklin, et al.
- 1994
|
|
81
|
The MD5 message-digest algorithm. Internet RFC-1321. Available at ftp ://ftp.isi.edu/in-notes/rfc 1321 .txt
– Rivest
- 1992
|
|
80
|
UMAC: Fast and secure message authentication
– Black, Halevi, et al.
- 1999
|
|
74
|
A secure group membership protocol
– Reiter
- 1996
|
|
63
|
Reliable Computer Systems
– Siewiorek, Swarz
- 1992
|
|
62
|
Distributed object management in Thor
– Liskov, Day, et al.
- 1993
|
|
54
|
and efficient sharing of persistent objects in Thor
– LISKOV, ADYA, et al.
- 1996
|
|
52
|
A new paradigm for collision-free hashing: Incrementality at reduced cost
– Bellare, Micciancio
- 1997
|
|
50
|
Using abstraction to improve fault tolerance
– BASE
- 2001
|
|
46
|
Fault-tolerant distributed garbage collection in a client-server object-oriented database
– Maheshwari, Liskov
- 1994
|
|
42
|
HAC: Hybrid Adaptive Caching for Distributed Storage Systems
– Castro, Adya, et al.
- 1997
|
|
37
|
Observations on Optimistic Concurrency Control Schemes
– Haerder
- 1984
|
|
37
|
Providing persistent objects in distributed systems
– Liskov, Castro, et al.
- 1999
|
|
34
|
Efficient commit protocols for the tree of processes model of distributed transactions
– Mohan, Lindsay
- 1983
|
|
33
|
Collecting Cyclic Distributed Garbage by Controlled Migration
– Maheshwari, Liskov
- 1995
|
|
32
|
The modified object buffer: a storage management technique for object-oriented databases
– Ghemawat
- 1995
|
|
32
|
Network Time Protocol (Version 1) Specification and Implementation. DARPA-Internet Report RFC 1059
– Mills
- 1988
|
|
27
|
Using abstraction to improve fault tolerance
– RODRIGUES, CASTRO, et al.
- 2001
|
|
24
|
Inside ODBC
– Geiger
- 1995
|
|
23
|
Minimizing Completion Time of a Program by Checkpointing and Rejuvenation
– Garg, Huang, et al.
- 1996
|
|
20
|
A Correctness Proof for a Practical Byzantine-Fault-Tolerant Replication Algorithm
– Castro, Liskov
- 1999
|
|
19
|
Community error recovery in N-version software: A design study with experimentation
– Tso, Avizienis
- 1987
|
|
18
|
NFS Illustrated
– Callaghan
- 2000
|
|
14
|
et al. Design and Implementation of the Sun Network Filesystem
– Sandberg
- 1985
|
|
12
|
Transaction Management for Mobile Objects Using Optimistic Concurrency Control
– Adya
- 1994
|
|
6
|
Faulty version recovery in object-oriented N-version programming
– Romanovsky
- 2000
|
|
5
|
A scalable byzantine fault tolerant secure domain name system
– Ahmed
- 2001
|
|
4
|
Partitioned Collection of a Large Object Store
– Maheshwari, Liskov
- 1997
|
|
4
|
Collecting Cyclic Distributed Garbage using Back Tracing
– Maheswari, Liskov
- 1997
|
|
2
|
std 1003.1-1990, information technology Portable Operating System Interface (POSIX) part 1: System application program interface (API) [C language
– IEEE
- 1990
|
|
2
|
A Liveness Proof for a Practical Byzantine FaultTolerant Replication Algorithm
– Rodrigues, Jamieson, et al.
|