Results 1 - 10
of
234
Unreliable Failure Detectors for Reliable Distributed Systems
- Journal of the ACM
, 1996
"... We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with ..."
Abstract
-
Cited by 1094 (19 self)
- Add to MetaCart
(Show Context)
We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties — completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
, 1996
"... this paper, we use the terms event logging and message logging interchangeably ..."
Abstract
-
Cited by 716 (22 self)
- Add to MetaCart
this paper, we use the terms event logging and message logging interchangeably
The x-Kernel: An Architecture for Implementing Network Protocols
- IEEE Transactions on Software Engineering
, 1991
"... This paper describes a new operating system kernel, called the x-kernel, that provides an explicit architecture for constructing and composing network protocols. Our experience implementing and evaluating several protocols in the x-kernel shows that this architecture is both general enough to acc ..."
Abstract
-
Cited by 662 (21 self)
- Add to MetaCart
(Show Context)
This paper describes a new operating system kernel, called the x-kernel, that provides an explicit architecture for constructing and composing network protocols. Our experience implementing and evaluating several protocols in the x-kernel shows that this architecture is both general enough to accommodate a wide range of protocols, yet efficient enough to perform competitively with less structured operating systems. 1 Introduction Network software is at the heart of any distributed system. It manages the communication hardware that connects the processors in the system and it defines abstractions through which processes running on those processors exchange messages. Network software is extremely complex: it must hide the details of the underlying hardware, recover from transmission failures, ensure that messages are delivered to the application processes in the appropriate order, and manage the encoding and decoding of data. To help manage this complexity, network software is divi...
Lightweight causal and atomic group multicast
- ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 1991
"... ..."
(Show Context)
Horus: A flexible group communication system
- Comm. of the ACM
, 1996
"... innovative system offering application developers an extensively flexible group communication model is described. The emergence of process-group environments for distributed computing represents a promising step toward robustness for mission-critical distributed applications. Process groups have a “ ..."
Abstract
-
Cited by 431 (28 self)
- Add to MetaCart
(Show Context)
innovative system offering application developers an extensively flexible group communication model is described. The emergence of process-group environments for distributed computing represents a promising step toward robustness for mission-critical distributed applications. Process groups have a “natural’ ’ correspondence with data or services that have been replicated for availability or as part of a coherent cache. They can be used to support highly available security domains, and group mechanisms fit well with an emerging generation of intelligent network and collaborative work applications.
Transis: A Communication Sub-System for High Availability
, 1992
"... This paper describes Transis, a communication sub-system for high availability. Transis is a transport layer package that supports a variety of reliable multicast message passing services between processors. It provides highly tuned multicast and control services for scalable systems with arbitrary ..."
Abstract
-
Cited by 363 (47 self)
- Add to MetaCart
(Show Context)
This paper describes Transis, a communication sub-system for high availability. Transis is a transport layer package that supports a variety of reliable multicast message passing services between processors. It provides highly tuned multicast and control services for scalable systems with arbitrary topology. The communication domain comprises of a set of processors that can initiate multicast messages to a chosen subset. Transis delivers them reliably and maintains the membership of connected processors automatically, in the presence of arbitrary communication delays, of message losses and of processor failures and joins. The contribution of this paper is in providing an aggregate definition of communication and control services over broadcast domains. The main benefit is the efficient implementation of these services using the broadcast capability. In addition, the membership algorithm has a novel approach in handling partitions and remerging; in allowing the regular flow of messages...
The Transis Approach to High Availability Cluster Communication
- Communications of the ACM
, 1996
"... Introduction In the local elections system of the municipality of "Wiredville" 1 , several computers were used to establish an electronic town hall. The computers were linked by a network. When an issue was put to a vote, voters could manually feed their votes into any of the computers, ..."
Abstract
-
Cited by 252 (14 self)
- Add to MetaCart
(Show Context)
Introduction In the local elections system of the municipality of "Wiredville" 1 , several computers were used to establish an electronic town hall. The computers were linked by a network. When an issue was put to a vote, voters could manually feed their votes into any of the computers, which replicated the updates to all of the other computers. Whenever the current tally was desired, any computer could be used to supply an up-to-the-moment count. On the night of an important election, a room with one of the computers became crowded with lobbyists and politicians. Unexpectedly, someone accidentally stepped on the network wire, cutting communication between two parts of the network. The vote counting stopped until the network was repaired, and the entire tally had to be restarted from scratch. This would not have happened if the vote-counting system had been built with partitions in mind. After the unexpected severance, vote counting could have continued at all t
A Reliable Dissemination Protocol for Interactive Collaborative Applications
, 1995
"... The widespread availability of networked multimedia workstations and PCs has caused a significant interest in the use of collaborative multimedia applications. Examples of such applications include distributed shared whiteboards, group editors, and distributed games or simulations. Such applications ..."
Abstract
-
Cited by 235 (10 self)
- Add to MetaCart
The widespread availability of networked multimedia workstations and PCs has caused a significant interest in the use of collaborative multimedia applications. Examples of such applications include distributed shared whiteboards, group editors, and distributed games or simulations. Such applications often involve many participants and typically require a specific form of multicast communication called dissemination in which a single sender must reliably transmit data to multiple receivers in a timely fashion. This paper describes the design and implementation of a reliable multicast transport protocol called TMTP (Tree-based Multicast Transport Protocol). TMTP exploits the efficient best-effort delivery mechanism of IP multicast for packet routing and delivery. However, for the purpose of scalable flow and error control, it dynamically organizes the participants into a hierarchical control tree. The control tree hierarchy employs restricted nacks with suppression and an expanding ring search to distribute the functions of state management and error recovery among many members, thereby allowing scalability to large numbers of receivers. An Mbone-based implementation of TMTP spanning the United States and Europe has been tested and experimental results are presented.
Building Secure and Reliable Network Applications
, 1996
"... ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably deliv ..."
Abstract
-
Cited by 230 (16 self)
- Add to MetaCart
ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably delivered to the invoker, and exceptions are raised if (and only if) an error occurs. Given a completely reliable communication environment, which never loses, duplicates, or reorders messages, and given client and server processes that never fail, RPC would be trivial to solve. The sender would merely package the invocation into one or more messages, and transmit these to the server. The server would unpack the data into local variables, perform the desired operation, and send back the result (or an indication of any exception that occurred) in a reply message. The challenge, then, is created by failures. Were it not for the possibility of process and machine crashes, an RPC protocol capable of overcomi...
Total order broadcast and multicast algorithms: Taxonomy and survey
- ACM COMPUTING SURVEYS
, 2004
"... ..."