Results 1  10
of
133
Unreliable Failure Detectors for Reliable Distributed Systems
 Journal of the ACM
, 1996
"... We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with ..."
Abstract

Cited by 1089 (19 self)
 Add to MetaCart
(Show Context)
We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].
Consensus in the presence of partial synchrony
 JOURNAL OF THE ACM
, 1988
"... The concept of partial synchrony in a distributed system is introduced. Partial synchrony lies between the cases of a synchronous system and an asynchronous system. In a synchronous system, there is a known fixed upper bound A on the time required for a message to be sent from one processor to ano ..."
Abstract

Cited by 521 (19 self)
 Add to MetaCart
The concept of partial synchrony in a distributed system is introduced. Partial synchrony lies between the cases of a synchronous system and an asynchronous system. In a synchronous system, there is a known fixed upper bound A on the time required for a message to be sent from one processor to another and a known fixed upper bound (I, on the relative speeds of different processors. In an asynchronous system no fixed upper bounds A and (I, exist. In one version of partial synchrony, fixed bounds A and (I, exist, but they are not known a priori. The problem is to design protocols that work correctly in the partially synchronous system regardless of the actual values of the bounds A and (I,. In another version of partial synchrony, the bounds are known, but are only guaranteed to hold starting at some unknown time T, and protocols must be designed to work correctly regardless of when time T occurs. Faulttolerant consensus protocols are given for various cases of partial synchrony and various fault models. Lower bounds that show in most cases that our protocols are optimal with respect to the number of faults tolerated are also given. Our consensus protocols for partially synchronous processors use new protocols for faulttolerant "distributed clocks" that allow partially synchronous processors to reach some approximately common notion of time.
Gossipbased aggregation in large dynamic networks
 ACM TRANS. COMPUT. SYST
, 2005
"... As computer networks increase in size, become more heterogeneous and span greater geographic distances, applications must be designed to cope with the very large scale, poor reliability, and often, with the extreme dynamism of the underlying network. Aggregation is a key functional building block fo ..."
Abstract

Cited by 262 (43 self)
 Add to MetaCart
(Show Context)
As computer networks increase in size, become more heterogeneous and span greater geographic distances, applications must be designed to cope with the very large scale, poor reliability, and often, with the extreme dynamism of the underlying network. Aggregation is a key functional building block for such applications: it refers to a set of functions that provide components of a distributed system access to global information including network size, average load, average uptime, location and description of hotspots, and so on. Local access to global information is often very useful, if not indispensable for building applications that are robust and adaptive. For example, in an industrial control application, some aggregate value reaching a threshold may trigger the execution of certain actions; a distributed storage system will want to know the total available free space; loadbalancing protocols may benefit from knowing the target average load so as to minimize the load they transfer. We propose a gossipbased protocol for computing aggregate values over network components in a fully decentralized fashion. The class of aggregate functions we can compute is very broad and includes many useful special cases such as counting, averages, sums, products, and extremal values. The protocol is suitable for extremely large and highly dynamic systems due to its proactive structure—all nodes receive the aggregate value continuously, thus being able to track
The Consensus Problem in Unreliable Distributed Systems (A Brief Survey)
, 2000
"... Agreement problems involve a system of processes, some of which may be faulty. A fundamental problem of faulttolerant distributed computing is for the reliable processes to reach a consensus. We survey the considerable literature on this problem that has developed over the past few years and giv ..."
Abstract

Cited by 127 (3 self)
 Add to MetaCart
(Show Context)
Agreement problems involve a system of processes, some of which may be faulty. A fundamental problem of faulttolerant distributed computing is for the reliable processes to reach a consensus. We survey the considerable literature on this problem that has developed over the past few years and give an informal overview of the major theoretical results in the area.
A new faulttolerant algorithm for clock synchronization
 INFORMATION AND COMPUTATION
, 1988
"... ..."
The asynchronous computability theorem for tresilient tasks
 In Proceedings of the 1993 ACM Symposium on Theory of Computing
, 1993
"... We give necessary and sufficient combinatorial conditions characterizing the computational tasks that can be solved by N asynchronous processes, up to t of which can fail by halting. The range of possible input and output values for an asynchronous task can be associated with a highdimensional geom ..."
Abstract

Cited by 103 (15 self)
 Add to MetaCart
(Show Context)
We give necessary and sufficient combinatorial conditions characterizing the computational tasks that can be solved by N asynchronous processes, up to t of which can fail by halting. The range of possible input and output values for an asynchronous task can be associated with a highdimensional geometric structure called a simplicial complex. Our main theorem characterizes computability y in terms of the topological properties of this complex. Most notably, a given task is computable only if it can be associated with a complex that is simply connected with trivial homology groups. In other words, the complex has “no holes!” Applications of this characterization include the first impossibility results for several longstanding open problems in distributed computing, such as the “renaming ” problem of Attiya et. al., the “kset agreement ” problem of Chaudhuri, and a generalization of the approximate agreement problem. 1
Fundamentals of FaultTolerant Distributed Computing in Asynchronous Environments
 ACM Computing Surveys
, 1999
"... Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like f ..."
Abstract

Cited by 97 (9 self)
 Add to MetaCart
Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the closetoreality asynchronous messagepassing model of distributed computing.
Easy Impossibility Proofs for Distributed Consensus Problems
 DISTRIBUTED COMPUTING
, 1986
"... Easy proofs are given, of the impossibility of solving several consensus problems (Byzantine agreement, weak agreement, Byzantine firing squad, approximate agreement and clock synchronization) in certain communication graphs. It is shown that, in the presence of m faults, no solution to these proble ..."
Abstract

Cited by 96 (8 self)
 Add to MetaCart
(Show Context)
Easy proofs are given, of the impossibility of solving several consensus problems (Byzantine agreement, weak agreement, Byzantine firing squad, approximate agreement and clock synchronization) in certain communication graphs. It is shown that, in the presence of m faults, no solution to these problems exists for communication graphs with fewer than 3m+ 1 nodes or less than 2m+l connectivity. While some of these results had previously been proved, the new proofs are much simpler, provide considerably more insight, apply to more general models of computation, and (particularly in the case of clock synchronization) significantly strengthen the results.
Faulttolerance in collaborative sensor networks for target detection
 IEEE Transactions on Computers
, 2004
"... Abstract—Collaboration in sensor networks must be faulttolerant due to the harsh environmental conditions in which such networks can be deployed. This paper focuses on finding algorithms for collaborative target detection that are efficient in terms of communication cost, precision, accuracy, and n ..."
Abstract

Cited by 86 (4 self)
 Add to MetaCart
(Show Context)
Abstract—Collaboration in sensor networks must be faulttolerant due to the harsh environmental conditions in which such networks can be deployed. This paper focuses on finding algorithms for collaborative target detection that are efficient in terms of communication cost, precision, accuracy, and number of faulty sensors tolerable in the network. Two algorithms, namely, value fusion and decision fusion, are identified first. When comparing their performance and communication overhead, decision fusion is found to become superior to value fusion as the ratio of faulty sensors to fault free sensors increases. As robust data fusion requires agreement among nodes in the network, an analysis of fully distributed and hierarchical agreement is also presented. The impact of hierarchical agreement on communication cost and system failure probability is evaluated and a method for determining the number of tolerable faults is identified. Index Terms—Collaborative target detection, decision fusion, fault tolerance, sensor networks, value fusion. 1
Understanding Protocols for Byzantine Clock Synchronization
, 1987
"... All published faulttolerant clock synchronization protocols are shown to result from refining a single paradigm. This allows the differera clock synchronization protocols to be compared and permits presemation of a single correctness analysis that holds for all. The paradigm is based on a reliab ..."
Abstract

Cited by 80 (0 self)
 Add to MetaCart
All published faulttolerant clock synchronization protocols are shown to result from refining a single paradigm. This allows the differera clock synchronization protocols to be compared and permits presemation of a single correctness analysis that holds for all. The paradigm is based on a reliable time source that periodically causes events; detection of such an event causes a processor to reset its clock. In a distributed system, the reliable time source can be approximated by combining the values of processor clocks using a generalization of a "faulttolerant average", called a convergence function. The performance of a clock synchronization protocol based on our paradigm can be quantified in terms of the two parameters that characterize the behavior of the convergence function used: accuracy and precision.