Results 1 - 10
of
363
Unreliable Failure Detectors for Reliable Distributed Systems
- Journal of the ACM
, 1996
"... We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with ..."
Abstract
-
Cited by 1094 (19 self)
- Add to MetaCart
(Show Context)
We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].
The process group approach to reliable distributed computing
- Communications of the ACM
, 1993
"... The difficulty of developing reliable distributed softwme is an impediment to applying distributed computing technology in many settings. Expeti _ with the Isis system suggests that a structured approach based on virtually synchronous _ groups yields systems that are substantially easier to develop, ..."
Abstract
-
Cited by 572 (19 self)
- Add to MetaCart
The difficulty of developing reliable distributed softwme is an impediment to applying distributed computing technology in many settings. Expeti _ with the Isis system suggests that a structured approach based on virtually synchronous _ groups yields systems that are substantially easier to develop, exploit sophisticated forms of cooperative computation, and achieve high reliability. This paper reviews six years of resemr,.hon Isis, describing the model, its impl_nentation challenges, and the types of applicatiom to which Isis has been appfied. 1 In oducfion One might expect the reliability of a distributed system to follow directly from the reliability of its con-stituents, but this is not always the case. The mechanisms used to structure a distributed system and to implement cooperation between components play a vital role in determining how reliable the system will be. Many contemporary distributed operating systems have placed emphasis on communication performance, overlooking the need for tools to integrate components into a reliable whole. The communication primitives supported give generally reliable behavior, but exhibit problematic semantics when transient failures or system configuration changes occur. The resulting building blocks are, therefore, unsuitable for facilitating the construction of systems where reliability is impo/tant. This paper reviews six years of research on Isis, a syg_,,m that provides tools _ support the construction of reliable distributed software. The thesis underlying l._lS is that development of reliable distributed software can be simplified using process groups and group programming too/_. This paper motivates the approach taken, surveys the system, and discusses our experience with real applications.
Horus: A flexible group communication system
- Comm. of the ACM
, 1996
"... innovative system offering application developers an extensively flexible group communication model is described. The emergence of process-group environments for distributed computing represents a promising step toward robustness for mission-critical distributed applications. Process groups have a “ ..."
Abstract
-
Cited by 431 (28 self)
- Add to MetaCart
(Show Context)
innovative system offering application developers an extensively flexible group communication model is described. The emergence of process-group environments for distributed computing represents a promising step toward robustness for mission-critical distributed applications. Process groups have a “natural’ ’ correspondence with data or services that have been replicated for availability or as part of a coherent cache. They can be used to support highly available security domains, and group mechanisms fit well with an emerging generation of intelligent network and collaborative work applications.
Group Communication Specifications: A Comprehensive Study
- ACM COMPUTING SURVEYS
, 1999
"... View-oriented group communication is an important and widely used building block for many distributed applications. Much current research has been dedicated to specifying the semantics and services of view-oriented Group Communication Systems (GCSs). However, the guarantees of different GCSs are for ..."
Abstract
-
Cited by 370 (15 self)
- Add to MetaCart
(Show Context)
View-oriented group communication is an important and widely used building block for many distributed applications. Much current research has been dedicated to specifying the semantics and services of view-oriented Group Communication Systems (GCSs). However, the guarantees of different GCSs are formulated using varying terminologies and modeling techniques, and the specifications vary in their rigor. This makes it difficult to analyze and compare the different systems. This paper provides a comprehensive set of clear and rigorous specifications, which may be combined to represent the guarantees of most existing GCSs. In the light of these specifications, over thirty published GCS specifications are surveyed. Thus, the specifications serve as a unifying framework for the classification, analysis and comparison of group communication systems. The survey also discusses over a dozen different applications of group communication systems, shedding light on the usefulness of the p...
The Transis Approach to High Availability Cluster Communication
- Communications of the ACM
, 1996
"... Introduction In the local elections system of the municipality of "Wiredville" 1 , several computers were used to establish an electronic town hall. The computers were linked by a network. When an issue was put to a vote, voters could manually feed their votes into any of the computers, ..."
Abstract
-
Cited by 252 (14 self)
- Add to MetaCart
(Show Context)
Introduction In the local elections system of the municipality of "Wiredville" 1 , several computers were used to establish an electronic town hall. The computers were linked by a network. When an issue was put to a vote, voters could manually feed their votes into any of the computers, which replicated the updates to all of the other computers. Whenever the current tally was desired, any computer could be used to supply an up-to-the-moment count. On the night of an important election, a room with one of the computers became crowded with lobbyists and politicians. Unexpectedly, someone accidentally stepped on the network wire, cutting communication between two parts of the network. The vote counting stopped until the network was repaired, and the entire tally had to be restarted from scratch. This would not have happened if the vote-counting system had been built with partitions in mind. After the unexpected severance, vote counting could have continued at all t
A Gossip-Style Failure Detection Service
, 1998
"... Failure Detection is valuable for system management, replication, load balancing, and other distributed services. To date, Failure Detection Services scale badly in the number of members that are being monitored. This paper describes a new protocol based on gossiping that does scale well and provide ..."
Abstract
-
Cited by 246 (23 self)
- Add to MetaCart
(Show Context)
Failure Detection is valuable for system management, replication, load balancing, and other distributed services. To date, Failure Detection Services scale badly in the number of members that are being monitored. This paper describes a new protocol based on gossiping that does scale well and provides timely detection. We analyze the protocol, and then extend it to discover and leverage the underlying network topology for much improved resource utilization. We then combine it with another protocol, based on broadcast, that is used to handle partition failures.
Building Secure and Reliable Network Applications
, 1996
"... ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably deliv ..."
Abstract
-
Cited by 230 (16 self)
- Add to MetaCart
ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably delivered to the invoker, and exceptions are raised if (and only if) an error occurs. Given a completely reliable communication environment, which never loses, duplicates, or reorders messages, and given client and server processes that never fail, RPC would be trivial to solve. The sender would merely package the invocation into one or more messages, and transmit these to the server. The server would unpack the data into local variables, perform the desired operation, and send back the result (or an indication of any exception that occurred) in a reply message. The challenge, then, is created by failures. Were it not for the possibility of process and machine crashes, an RPC protocol capable of overcomi...
Total order broadcast and multicast algorithms: Taxonomy and survey
- ACM COMPUTING SURVEYS
, 2004
"... ..."
A High Performance Totally Ordered Multicast Protocol
, 1994
"... This paper presents the Reliable Multicast Protocol (RMP). RMP provides a totally ordered, reliable, atomic multicast service on top of an unreliable multicast datagram service such as IP Multicasting. RMP is fully and symmetrically distributed so that no site bears an undue portion of the communica ..."
Abstract
-
Cited by 195 (4 self)
- Add to MetaCart
This paper presents the Reliable Multicast Protocol (RMP). RMP provides a totally ordered, reliable, atomic multicast service on top of an unreliable multicast datagram service such as IP Multicasting. RMP is fully and symmetrically distributed so that no site bears an undue portion of the communication load. RMP provides a wide range of guarantees, from unreliable delivery to totally ordered delivery, to K-resilient, majority resilient, and totally resilient atomic delivery. These QoS guarantees are selectable on a per packet basis. RMP provides many communication options, including virtual synchrony, a publisher/subscriber model of message delivery, a client/server model of delivery, an implicit naming service, mutually exclusive handlers for messages, and mutually exclusive locks.
Extended Virtual Synchrony
- in Proceedings of the IEEE 14th International Conference on Distributed Computing Systems
, 1994
"... . We formulate a model of extended virtual synchrony that defines a group communication transport service for multicast and broadcast communication in a distributed system. The model extends the virtual synchrony model of the Isis system to support continued operation in all components of a partitio ..."
Abstract
-
Cited by 188 (43 self)
- Add to MetaCart
(Show Context)
. We formulate a model of extended virtual synchrony that defines a group communication transport service for multicast and broadcast communication in a distributed system. The model extends the virtual synchrony model of the Isis system to support continued operation in all components of a partitioned network. The significance of extended virtual synchrony is that, during network partitioning and remerging and during process failure and recovery, it maintains a consistent relationship between the delivery of messages and the delivery of configuration changes across all processes in the system and provides well-defined self-delivery and failure atomicity properties. We describe an algorithm that implements extended virtual synchrony and construct a filter that reduces extended virtual synchrony to virtual synchrony. 1 Introduction In many applications in distributed systems messages must be disseminated to multiple destinations. To achieve better performance, protocols have been deve...