Results 1 - 10
of
598
Practical Byzantine fault tolerance
, 1999
"... This paper describes a new replication algorithm that is able to tolerate Byzantine faults. We believe that Byzantinefault-tolerant algorithms will be increasingly important in the future because malicious attacks and software errors are increasingly common and can cause faulty nodes to exhibit arbi ..."
Abstract
-
Cited by 673 (15 self)
- Add to MetaCart
(Show Context)
This paper describes a new replication algorithm that is able to tolerate Byzantine faults. We believe that Byzantinefault-tolerant algorithms will be increasingly important in the future because malicious attacks and software errors are increasingly common and can cause faulty nodes to exhibit arbitrary behavior. Whereas previous algorithms assumed a synchronous system or were too slow to be used in practice, the algorithm described in this paper is practical: it works in asynchronous environments like the Internet and incorporates several important optimizations that improve the response time of previous algorithms by more than an order of magnitude. We implemented a Byzantine-fault-tolerant NFS service using our algorithm and measured its performance. The results show that our service is only 3 % slower than a standard unreplicated NFS.
The dangers of replication and a solution
- IN PROCEEDINGS OF THE 1996 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA
, 1996
"... Update anywhere-anytime-anyway transactional replication has unstable behavior as the workload scales up: a ten-fold increase in nodes and traflc gives a thousand fold increase in deadlocks or reconciliations. Master copy replication (primary copyj schemes reduce this problem. A simple analytic mode ..."
Abstract
-
Cited by 575 (3 self)
- Add to MetaCart
Update anywhere-anytime-anyway transactional replication has unstable behavior as the workload scales up: a ten-fold increase in nodes and traflc gives a thousand fold increase in deadlocks or reconciliations. Master copy replication (primary copyj schemes reduce this problem. A simple analytic model demonstrates these results. A new two-tier replication algorithm is proposed that allows mobile (disconnected) applications to propose tentative update transactions that are later applied to a master copy. Commutative update transactions avoid the instability of other replication schemes.
Coda: A Highly Available File System for a Distributed Workstation Environment
- IN IEEE TRANSACTIONS ON COMPUTERS
, 1990
"... Coda is a file system for a large-scale distributed computing environment composed of Unix workstations. It provides resiliency to server and network failures through the use of two distinct but complementary mechanisms. One mechanism, server replication,stores copies of a file at multiple servers ..."
Abstract
-
Cited by 530 (45 self)
- Add to MetaCart
Coda is a file system for a large-scale distributed computing environment composed of Unix workstations. It provides resiliency to server and network failures through the use of two distinct but complementary mechanisms. One mechanism, server replication,stores copies of a file at multiple servers. The other mechanism, disconnected operation, is a mode of execution in which a caching site temporarily assumes the role of a replication site. Disconnected operation is particularly useful for supporting portable workstations. The design of Coda optimizes for availability and performance, and strives to provide the highest degree of consistency attainable in the light of these objectives. Measurements from a prototype show that the performance cost of providing high availability in Coda is reasonable.
Nested Transactions: An Approach to Reliable Distributed Computing
, 1981
"... Distributed computing systems are being built and used more and more frequently. This distributod computing revolution makes the reliability of distributed systems an important concern. It is fairly well-understood how to connect hardware so that most components can continue to work when others are ..."
Abstract
-
Cited by 517 (4 self)
- Add to MetaCart
Distributed computing systems are being built and used more and more frequently. This distributod computing revolution makes the reliability of distributed systems an important concern. It is fairly well-understood how to connect hardware so that most components can continue to work when others are broken, and thus increase the reliability of a system as a whole. This report addressos the issue of providing software for reliable distributed systems. In particular, we examine how to program a system so that the software continues to work in tho face of a variety of failures of parts of the system. The design presented
Small Byzantine Quorum Systems
- DISTRIBUTED COMPUTING
, 2001
"... In this paper we present two protocols for asynchronous Byzantine Quorum Systems (BQS) built on top of reliable channels---one for self-verifying data and the other for any data. Our protocols tolerate Byzantine failures with fewer servers than existing solutions by eliminating nonessential work in ..."
Abstract
-
Cited by 468 (49 self)
- Add to MetaCart
(Show Context)
In this paper we present two protocols for asynchronous Byzantine Quorum Systems (BQS) built on top of reliable channels---one for self-verifying data and the other for any data. Our protocols tolerate Byzantine failures with fewer servers than existing solutions by eliminating nonessential work in the write protocol and by using read and write quorums of different sizes. Since engineering a reliable network layer on an unreliable network is difficult, two other possibilities must be explored. The first is to strengthen the model by allowing synchronous networks that use time-outs to identify failed links or machines. We consider running synchronous and asynchronous Byzantine Quorum protocols over synchronous networks and conclude that, surprisingly, "self-timing" asynchronous Byzantine protocols may offer significant advantages for many synchronous networks when network time-outs are long. We show how to extend an existing Byzantine Quorum protocol to eliminate its dependency on reliable networking and to handle message loss and retransmission explicitly.
Practical Byzantine fault tolerance and proactive recovery
- ACM Transactions on Computer Systems
, 2002
"... Our growing reliance on online services accessible on the Internet demands highly available systems that provide correct service without interruptions. Software bugs, operator mistakes, and malicious attacks are a major cause of service interruptions and they can cause arbitrary behavior, that is, B ..."
Abstract
-
Cited by 410 (7 self)
- Add to MetaCart
Our growing reliance on online services accessible on the Internet demands highly available systems that provide correct service without interruptions. Software bugs, operator mistakes, and malicious attacks are a major cause of service interruptions and they can cause arbitrary behavior, that is, Byzantine faults. This article describes a new replication algorithm, BFT, that can be used to build highly available systems that tolerate Byzantine faults. BFT can be used in practice to implement real services: it performs well, it is safe in asynchronous environments such as the Internet, it incorporates mechanisms to defend against Byzantine-faulty clients, and it recovers replicas proactively. The recovery mechanism allows the algorithm to tolerate any number of faults over the lifetime of the system provided fewer than 1/3 of the replicas become faulty within a small window of vulnerability. BFT has been implemented as a generic program library with a simple interface. We used the library to implement the first Byzantine-fault-tolerant NFS file system, BFS. The BFT library and BFS perform well because the library incorporates several important optimizations, the most important of which is the use of symmetric cryptography to authenticate messages. The performance results show that BFS performs 2 % faster to 24 % slower than production implementations of the NFS protocol that are not replicated. This supports our claim that the
Exploiting virtual synchrony in distributed systems
, 1987
"... We describe applications of a virtually synchronous environment for distributed programming, which underlies a collection of distributed programming tools in the 1SIS2 system. A virtually synchronous environment allows processes to be structured into process groups, and makes events like broadcasts ..."
Abstract
-
Cited by 360 (30 self)
- Add to MetaCart
(Show Context)
We describe applications of a virtually synchronous environment for distributed programming, which underlies a collection of distributed programming tools in the 1SIS2 system. A virtually synchronous environment allows processes to be structured into process groups, and makes events like broadcasts to the group as an entity, group membership changes, and even migration of an activity from one place to another appear to occur instantaneously -- in other words, synchronously. A major advantage to this approach is that many aspects of a distributed application can be treated independently without compromising correctness. Moreover, user code that is designed as if the system were synchronous can often be executed concurrently. We argue that this approach to building distributed and fault-tolerant software is more straightforward, more flexible, and more likely to yield correct solutions than alternative approaches.
Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs
- Proc. ACM SIGMETRICS
, 2000
"... We consider an architecture for a serverless distributed file system that does not assume mutual trust among the client computers. The system provides security, availability, and reliability by distributing multiple encrypted replicas of each file among the client machines. To assess the feasibility ..."
Abstract
-
Cited by 325 (9 self)
- Add to MetaCart
We consider an architecture for a serverless distributed file system that does not assume mutual trust among the client computers. The system provides security, availability, and reliability by distributing multiple encrypted replicas of each file among the client machines. To assess the feasibility of deploying this system on an existing desktop infrastructure, we measure and analyze a large set of client machines in a commercial environment. In particular, we measure and report results on disk usage and content; file activity; and machine uptimes, lifetimes, and loads. We conclude that the measured desklop infrastructure would passably support our proposed system, providing availability on the order of one unfilled file request per user per thousand days. Keywords Serverless distributed file system architecture, personal computer
Optimistic replication
- ACM COMPUTING SURVEYS
, 2005
"... Data replication is a key technology in distributed data sharing systems, enabling higher availability and performance. This paper surveys optimistic replication algorithms that allow replica contents to diverge in the short term, in order to support concurrent work practices and to tolerate failure ..."
Abstract
-
Cited by 290 (19 self)
- Add to MetaCart
Data replication is a key technology in distributed data sharing systems, enabling higher availability and performance. This paper surveys optimistic replication algorithms that allow replica contents to diverge in the short term, in order to support concurrent work practices and to tolerate failures in low-quality communication links. The importance of such techniques is increasing as collaboration through wide-area and mobile networks becomes popular. Optimistic replication techniques are different from traditional “pessimistic” ones. Instead of synchronous replica coordination, an optimistic algorithm propagates changes in the background, discovers conflicts after they happen and reaches agreement on the final contents incrementally. We explore the solution space for optimistic replication algorithms. This paper identifies key challenges facing optimistic replication systems — ordering operations, detecting and resolving conflicts, propagating changes efficiently, and bounding replica divergence — and provides a comprehensive survey of techniques developed for addressing these challenges.
Time Synchronization in Wireless Sensor Networks
, 2003
"... OF THE DISSERTATION University of California, Los Angeles, 2003 Professor Deborah L. Estrin, Chair active research in large-scale networks of small, wireless, low-power sensors and actuators. Time synchronization is a critical piece of infrastructure in any dis- tributed system, but wirel ..."
Abstract
-
Cited by 256 (11 self)
- Add to MetaCart
OF THE DISSERTATION University of California, Los Angeles, 2003 Professor Deborah L. Estrin, Chair active research in large-scale networks of small, wireless, low-power sensors and actuators. Time synchronization is a critical piece of infrastructure in any dis- tributed system, but wireless sensor networks make particularly extensive use physical world. However, while the clock accuracy and precision requirements are often stricter in sensor networks than in traditional distributed systems, energy and channel constraints limit the resources available to meet these goals.