Results 1 - 10
of
1,592
Distributed Snapshots: Determining Global States of Distributed Systems
- ACM Transactions on Computer Systems
, 1985
"... This paper presents an algorithm by which a process in a distributed system determines a global state of the system during a computation. Many problems in distributed systems can be cast in terms of the problem of detecting global states. For instance, the global state detection algorithm helps to s ..."
Abstract
-
Cited by 929 (5 self)
- Add to MetaCart
This paper presents an algorithm by which a process in a distributed system determines a global state of the system during a computation. Many problems in distributed systems can be cast in terms of the problem of detecting global states. For instance, the global state detection algorithm helps to solve an important class of problems: stable property detection. A stable property is one that persists: once a stable property becomes true it remains true thereafter. Examples of stable properties are “computation has terminated, ” “ the system is deadlocked ” and “all tokens in a token ring have disappeared. ” The stable property detection problem is that of devising algorithms to detect a given stable property. Global state detection can also be used for checkpointing. Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems-distributed applications; distributed databases; network operating systems; D.4.1 [Operating Systems]: Process Management-concurrency; deadlocks, multiprocessing/multiprogramming; mutual exclusion; scheduling; synchronization; D.4.5 [Operating Systems]: Reliability-backup procedures; checkpoint/restart; fault-tolerance; verification
Virtual time
- ACM Transactions on Programming Languages and Systems
, 1985
"... Virtual time is a new paradigm for organizing and synchronizing distributed systems which can be applied to such problems as distributed discrete event simulation and distributed database concur-rency control. Virtual time provides a flexible abstraction of real time in much the same way that virtua ..."
Abstract
-
Cited by 790 (5 self)
- Add to MetaCart
Virtual time is a new paradigm for organizing and synchronizing distributed systems which can be applied to such problems as distributed discrete event simulation and distributed database concur-rency control. Virtual time provides a flexible abstraction of real time in much the same way that virtual memory provides an abstraction of real memory. It is implemented using the Time Warp mechanism, a synchronization protocol distinguished by its reliance on lookahead-rollback, and by its implementation of rollback via antimessages.
A Highly Adaptive Distributed Routing Algorithm for Mobile Wireless Networks
, 1997
"... We present a new distributed routing protocol for mobile, multihop, wireless networks. The protocol is one of a family of protocols which we term "link reversal" algorithms. The protocol's reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting ..."
Abstract
-
Cited by 746 (3 self)
- Add to MetaCart
We present a new distributed routing protocol for mobile, multihop, wireless networks. The protocol is one of a family of protocols which we term "link reversal" algorithms. The protocol's reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed l i nk reversals. The protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. In these networks, the protocol's reaction to link failures typically involves only a localized "single pass" of the distributed algorithm. This capability is unique among protocols which are stable in the face of network partitions, and results in the protocol's high degree of adaptivity. This desirable behavior is achieved through the novel use of a "physical or logical clock" to establish the "temporal order" of topological change events which is used to structure (or order) the algorithm's reaction to topological changes. We refer to the protocol as the Temporally-Ordered Routing Algorithm (TORA).
The part-time parliament
- ACM Transactions on Computer Systems
, 1998
"... Digital Equipment Corporation Recent archaeological discoveries on the island of Paxos reveal that the parliament functioned despite the peripatetic propensity of its part-time legislators. The legislators maintained consistent copies of the parliamentary record, despite their frequent forays from t ..."
Abstract
-
Cited by 516 (11 self)
- Add to MetaCart
Digital Equipment Corporation Recent archaeological discoveries on the island of Paxos reveal that the parliament functioned despite the peripatetic propensity of its part-time legislators. The legislators maintained consistent copies of the parliamentary record, despite their frequent forays from the chamber and the forgetfulness of their messengers. The Paxon parliament’s protocol provides a new way of implementing the state-machine approach to the design of distributed systems.
Reliable Communication in the Presence of Failures
- ACM Transactions on Computer Systems
, 1987
"... The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols at ..."
Abstract
-
Cited by 480 (13 self)
- Add to MetaCart
The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistent orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.
Eraser: a dynamic data race detector for multithreaded programs
- ACM Transaction of Computer System
, 1997
"... Multi-threaded programming is difficult and error prone. It is easy to make a mistake in synchronization that produces a data race, yet it can be extremely hard to locate this mistake during debugging. This paper describes a new tool, called Eraser, for dynamically detecting data races in lock-based ..."
Abstract
-
Cited by 478 (2 self)
- Add to MetaCart
Multi-threaded programming is difficult and error prone. It is easy to make a mistake in synchronization that produces a data race, yet it can be extremely hard to locate this mistake during debugging. This paper describes a new tool, called Eraser, for dynamically detecting data races in lock-based multi-threaded programs. Eraser uses binary rewriting techniques to monitor every shared memory reference and verify that consistent locking behavior is observed. We present several case studies, including undergraduate coursework and a multi-threaded Web search engine, that demonstrate the effectiveness of this approach. 1
Internet Time Synchronization: the Network Time Protocol
- IEEE Transactions on Communications
, 1991
"... This paper describes the Network Time Protocol (NTP), which is designed to distribute time information in a large, diverse internet system operating at speeds from mundane to lightwave. It uses a symmetric architecture in which a distributed subnet of time servers operating in a self-organizing, hie ..."
Abstract
-
Cited by 420 (12 self)
- Add to MetaCart
This paper describes the Network Time Protocol (NTP), which is designed to distribute time information in a large, diverse internet system operating at speeds from mundane to lightwave. It uses a symmetric architecture in which a distributed subnet of time servers operating in a self-organizing, hierarchical configuration synchronizes local clocks within the subnet and to national time standards via wire, radio or calibrated atomic clock. The servers can also redistribute time information within a network via local routing algorithms and time daemons. This paper also discusses the architecture, protocol and algorithms, which were developed over several years of implementation refinement and resulted in the designation of NTP as an Internet Standard protocol. The NTP synchronization system, which has been in regular operation in the Internet for the last several years, is described along with performance data which shows that timekeeping accuracy throughout most portions of the Internet can be ordinarily maintained to within a few milliseconds, even in cases of failure or disruption of clocks, time servers or networks. Keywords: network clock synchronization, standard time distribution, fault-tolerant architecture, maximumlikelihood principles, disciplined oscillator, internet protocol.
Fine-grained network time synchronization using reference broadcasts
, 2002
"... Permission is granted for noncommercial reproduction of the work for educational or research purposes. ..."
Abstract
-
Cited by 419 (26 self)
- Add to MetaCart
Permission is granted for noncommercial reproduction of the work for educational or research purposes.
Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System
- In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles
, 1995
"... Bayou is a replicated, weakly consistent storage system designed for a mobile computing environment that includes portable machines with less than ideal network connectivity. To maximize availability, users can read and write any accessible replica. Bayou's design has focused on supporting apphcatio ..."
Abstract
-
Cited by 352 (10 self)
- Add to MetaCart
Bayou is a replicated, weakly consistent storage system designed for a mobile computing environment that includes portable machines with less than ideal network connectivity. To maximize availability, users can read and write any accessible replica. Bayou's design has focused on supporting apphcation-specific mechanisms to detect and resolve the update conflicts that naturally arise in such a system, ensuring that replicas move towards eventual consistency, and defining a protocol by which the resolution of update conflicts stabilizes. It includes novel methods for conflict detection, called dependency checks, and per-write conflict resolution based on client-provided merge procedures. To guarantee eventual consistency, Bayou servers must be able to rollback the effects of previously executed writes and redo them according to a global senalization order. Furthermore, Bayou permits clients to observe the results of all writes received by a server, Including tentative writes whose conflicts have not been ultimately resolved. This paper presents the motivation for and design of these mechanisms and describes the experiences gained with an initial implementation of the system.
Transis: A Communication Sub-System for High Availability
, 1992
"... This paper describes Transis, a communication sub-system for high availability. Transis is a transport layer package that supports a variety of reliable multicast message passing services between processors. It provides highly tuned multicast and control services for scalable systems with arbitrary ..."
Abstract
-
Cited by 337 (46 self)
- Add to MetaCart
This paper describes Transis, a communication sub-system for high availability. Transis is a transport layer package that supports a variety of reliable multicast message passing services between processors. It provides highly tuned multicast and control services for scalable systems with arbitrary topology. The communication domain comprises of a set of processors that can initiate multicast messages to a chosen subset. Transis delivers them reliably and maintains the membership of connected processors automatically, in the presence of arbitrary communication delays, of message losses and of processor failures and joins. The contribution of this paper is in providing an aggregate definition of communication and control services over broadcast domains. The main benefit is the efficient implementation of these services using the broadcast capability. In addition, the membership algorithm has a novel approach in handling partitions and remerging; in allowing the regular flow of messages...

