• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Using Process Groups to Implement Failure Detection in Asynchronous Environments (1991)

by Aleta M. Ricciardi, Kenneth P. Birman
Add To MetaCart

Tools

Sorted by:
Results 11 - 20 of 128
Next 10 →

On the Impossibility of Group Membership

by Tushar Deepak Chandra, Vassos Hadzilacos, Sam Toueg, Bernadette Charron-bost , 1996
"... We prove that the primary-partition group membership problem cannot be solved in asynchronous systems with crash failures, even if one allows the removal or killing of non-faulty processes that are erroneously suspected to have crashed. 1 Introduction The problem of group membership has been the ..."
Abstract - Cited by 146 (5 self) - Add to MetaCart
We prove that the primary-partition group membership problem cannot be solved in asynchronous systems with crash failures, even if one allows the removal or killing of non-faulty processes that are erroneously suspected to have crashed. 1 Introduction The problem of group membership has been the focus of much theoretical and experimental work on fault-tolerant distributed systems. A group membership protocol manages the formation and maintenance of a set of processes called a group. For example, a group may be a set of processes that are cooperating towards a common task (e.g., the primary and backup servers of a database), a set of processes that share a common interest (e.g., clients that subscribe to a particular newsgroup), or the set of all processes in the system that are currently deemed to be operational. In general, a process may leave a group because it failed, it voluntarily requested to leave, or it is forcibly expelled by other members of the group. Similarly, a proces...

Consul: A Communication Substrate for Fault-Tolerant Distributed Programs

by Shivakant Mishra , Larry L. Peterson, Richard D. Schlichting - DISTRIBUTED SYSTEMS ENGINEERING JOURNAL , 1991
"... Replicating important services on multiple processors in a distributed architecture is a common technique for constructing dependable computing systems. This paper describes a communication substrate, called Consul, that facilitates the development of such systems by providing a collection of fun ..."
Abstract - Cited by 118 (22 self) - Add to MetaCart
Replicating important services on multiple processors in a distributed architecture is a common technique for constructing dependable computing systems. This paper describes a communication substrate, called Consul, that facilitates the development of such systems by providing a collection of fundamental abstractions for constructing fault-tolerant programs based on replicated processing. These abstractions include a multicast service, a membership service, and a recovery service. Consul is unique in two respects. First, its services are implemented using a collection of algorithms that exploit the partial (or causal) ordering of messages exchanged in the system. Such algorithms are generally more efficient than those that depend on a total ordering of events. Second, its underlying architecture is configurable, thereby allowing a system to be structured according to the needs of the application. The paper sketches Consul's architecture, presents the algorithms used by its pr...

Weak-Consistency Group Communication and Membership

by Richard Andrew Golding , 1992
"... Many distributed systems for wide­area networks can be built conveniently, and operate efficiently and correctly, using a weak consistency group communication mechanism. This mechanism organizes a set of principals into a single logical entity, and provides methods to multicast messages to the membe ..."
Abstract - Cited by 92 (7 self) - Add to MetaCart
Many distributed systems for wide­area networks can be built conveniently, and operate efficiently and correctly, using a weak consistency group communication mechanism. This mechanism organizes a set of principals into a single logical entity, and provides methods to multicast messages to the members. A weak consistency distributed system allows the principals in the group to differ on the value of shared state at any given instant, as long as they will eventually converge to a single, consistent value. A group containing many principals and using weak consistency can provide the reliability, performance, and scalability necessary for wide­area systems. I have developed a framework for constructing group communication systems, for classifying existing distributed system tools, and for constructing and reasoning about a particular group communication model. It has four components: message delivery, message ordering, group membership, and the application. Each component may have a different implementation, so that the group mechanism can be tailored to application requirements. The framework supports a new message delivery protocol, called timestamped anti­entropy, which provides reliable, eventual message delivery; is efficient; and tolerates most transient processor and network failures. It can be combined with message ordering implementations that provide ordering guarantees ranging from unordered to total, causal delivery. A new group membership protocol completes the set, providing temporarily inconsistent membership views resilient to up to k simultaneous principal failures. The Refdbms distributed bibliographic database system, which has been constructed using this framework, is used as an example. Refdbms databases can be replicated on many different sites, using the group communication system described here.

Membership Algorithms for Multicast Communication Groups

by Yair Amir, Danny Dolev, Shlomo Kramer, Dalia Malki - In 6th Intl. Workshop on Distributed Algorithms proceedings (WDAG-6), (LCNS , 1992
"... We introduce a membership protocol that maintains the set of currently connected machines in an asynchronous and dynamic environment. The protocol handles both failures and joining of machines. It operates within a multicast communication sub-system. It is well known that solving the membership prob ..."
Abstract - Cited by 70 (15 self) - Add to MetaCart
We introduce a membership protocol that maintains the set of currently connected machines in an asynchronous and dynamic environment. The protocol handles both failures and joining of machines. It operates within a multicast communication sub-system. It is well known that solving the membership problem in an asynchronous environment when faults may be present is impossible. In order to circumvent this difficulty, our approach rarely extracts from the membership live (but not active) machines unjustfully. The benefit is that our procotol always terminates within a finite time. In addition, if a machine is inadvertently taken out of the membership, it can rejoin it right away using the membership protocol. Despite the asynchrony, configuration changes are logically synchronized with all the regular messages in the system, and appear virtually synchronous to the application layer. The protocol presented here supports partitions and merges. When partitions and merging occur, the protoco...

Hive: Fault Containment for Shared-Memory Multiprocessors

by John Chapin, et al. , 1995
"... Reliability and scalability are major concerns when designing operating systems for large-scale shared-memory multiprocessors. In this paper we describe Hive, an operating system with a novel kernel architecture that addresses these issues Hive is structured as an internal distributed system of inde ..."
Abstract - Cited by 65 (8 self) - Add to MetaCart
Reliability and scalability are major concerns when designing operating systems for large-scale shared-memory multiprocessors. In this paper we describe Hive, an operating system with a novel kernel architecture that addresses these issues Hive is structured as an internal distributed system of independent kernels called cells. This improves reliabihty because a hardwme or software fault damages only one cell rather than the whole system, and improves scalability because few kernel resources are shared by processes running on different cells. The Hive prototype is a complete implementation of UNIX SVR4 and is targeted to run on the Stanford FLASH multiprocessor. This paper focuses on Hive’s solutlon to the following key challenges: ( 1) fault containment, i.e. confining the effects of hardware or software faults to the cell where they occur, and (2) memory sharing among cells, which is requmed to achieve application performance competitive with other multiprocessor operating systems. Fault containment in a shared-memory multiprocessor requmes defending each cell against erroneous writes caused by faults in other cells. Hive prevents such damage by using the FLASH jirewzdl, a write permission bit-vector associated with each page of memory, and by discarding potentially corrupt pages when a fault is detected. Memory sharing is provided through a unified file and virtual memory page cache across the cells, and through a umfied free page frame pool. We report early experience with the system, including the results of fault injection and performance experiments using SimOS, an accurate simulator of FLASH, The effects of faults were contained to the cell in which they occurred m all 49 tests where we injected fail-stop hardware faults, and in all 20 tests where we injected kernel data corruption. The Hive prototype executes test workloads on a four-processor four-cell system with between 0’%.and 11YOslowdown as compared to SGI IRIX 5.2 (the version of UNIX on which it is based).

Uniform Reliable Multicast in a Virtually Synchronous Environment

by Andr'e Schiper, Alain Sandoz - In IEEE 13th Intl. Conf. Distributed Computing Systems , 1993
"... This paper presents the definition and solution to the uniform reliable multicast problem in the virtually synchronous environment defined by the Isis system. A uniform reliable multicast of a message m has the property that if m has been received by any destination process (faulty or not), then m ..."
Abstract - Cited by 65 (19 self) - Add to MetaCart
This paper presents the definition and solution to the uniform reliable multicast problem in the virtually synchronous environment defined by the Isis system. A uniform reliable multicast of a message m has the property that if m has been received by any destination process (faulty or not), then m is received by all processes that reach a decision. Uniform reliable multicast provides a solution to the distributed commit problem. The paper defines two multicast primitives in the virtually synchronous model: reliable multicast (called view-atomic) and uniform reliable multicast (called uniform view-atomic). The view-atomic multicast is used to implement the uniform view-atomic primitive. As view-atomicity is based on the concept of process group membership, the paper establishes a connection between the process group membership and the distributed commit problems. 1 Introduction A distributed application is composed of processes communicating through message passing. Point to point is t...

Tangler: A Censorship-Resistant Publishing System Based On Document Entanglements

by Marc Waldman - In Proceedings of the 8th ACM Conference on Computer and Communications Security , 2001
"... We describe the design of a censorship-resistant system that employs a unique document storage mechanism. Newly published documents are dependent on the blocks of previously published documents. We call this dependency an entanglement. Entanglement makes replication of previously published content a ..."
Abstract - Cited by 60 (0 self) - Add to MetaCart
We describe the design of a censorship-resistant system that employs a unique document storage mechanism. Newly published documents are dependent on the blocks of previously published documents. We call this dependency an entanglement. Entanglement makes replication of previously published content an intrinsic part of the publication process. Groups of files, called collections, can be published together and named in a host-independent manner. Individual documents within a collection can be securely updated in such a way that future readers of the collection see and tampercheck the updates. The system employs a self-policing network of servers designed to eject non-compliant servers and prevent them from doing more harm than good. 1.

An Asynchronous Membership Protocol that Tolerates Partitions

by Danny Dolev, Dalia Malki, Ray Strong , 1993
"... This paper presents a membership protocol for maintaining the set of operational and connected machines in agreement. The protocol operates in an asynchronous environment prone to crash failures, omission failures and network partitions. The protocol is suitable for systems with machines that commun ..."
Abstract - Cited by 55 (7 self) - Add to MetaCart
This paper presents a membership protocol for maintaining the set of operational and connected machines in agreement. The protocol operates in an asynchronous environment prone to crash failures, omission failures and network partitions. The protocol is suitable for systems with machines that communicate via broadcast (or multicast) messages. It supports continued operation with partitions and provides the mechanism for merging of partitions. The principles of the protocol presented here have been successfully incorporated into the Transis system [3, 2], the Totem system [4], and the Horus system [29]. The membership protocol presented here is integrated in the communication system, such that the notifications of membership changes are delivered to the application among the stream of regular messages. Changes to the membership are coordinated with the delivery of regular messages in the system. This valuable approach was presented in [7, 9] in the context of a primary-partition system,...

A Framework for Partitionable Membership Service

by Danny Dolev, Dalia Malki, Ray Strong , 1995
"... This paper presents a framework for a membership service that operates in a partitionable environment and supports partitionable operation, which is a form of distributed operation in which multiple network components that are (temporarily) disconnected from each other operate autonomously. The serv ..."
Abstract - Cited by 54 (6 self) - Add to MetaCart
This paper presents a framework for a membership service that operates in a partitionable environment and supports partitionable operation, which is a form of distributed operation in which multiple network components that are (temporarily) disconnected from each other operate autonomously. The service assumes an asynchronous environment and must tolerate crash failures, omission failures and network partitions. The principles of partitionable operation that we present here have been incorporated in the Transis system [13, 1], the Totem system [3], and the Horus system [19]. The paper discusses applications built in these projects, and relates them to the membership service definition. We introduce a distinction between partial and complete installations of system views that makes feasible what we believe are the strongest possible requirements for causal order and virtual synchrony. We propose our specification of partitionable membership service as a standard against which other memb...

Adaptive Distributed and Fault-Tolerant Systems

by Matti A. Hiltunen, Richard D. Schlichting - International Journal of Computer Systems Science and Engineering , 1995
"... An adaptive computing system is one that modifies its behavior based on changes in the environment. Since sites connected by a local-area network inherently have to deal with network congestion and the failure of other sites, distributed systems can be viewed as an important subclass of adaptive ..."
Abstract - Cited by 52 (5 self) - Add to MetaCart
An adaptive computing system is one that modifies its behavior based on changes in the environment. Since sites connected by a local-area network inherently have to deal with network congestion and the failure of other sites, distributed systems can be viewed as an important subclass of adaptive systems. As such, use of adaptive methods in this context has the same potential advantages of improved efficiency and structural simplicity as for adaptive systems in general. This paper describes a model for adaptive systems that can be applied in many scenarios arising in distributed and fault-tolerant systems. This model divides the adaptation process into three different phases---change detection, agreement, and action---that can be used to describe existing algorithms that deal with change, as well as to develop new adaptive algorithms. In addition to clarifying the logical structure of such algorithms, this model can also serve as a unifying implementation framework. Several ad...
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University