Results 1  10
of
44
Fundamentals of FaultTolerant Distributed Computing in Asynchronous Environments
 ACM Computing Surveys
, 1999
"... Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like f ..."
Abstract

Cited by 94 (9 self)
 Add to MetaCart
Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the closetoreality asynchronous messagepassing model of distributed computing.
Detection of global predicates: Techniques and their limitations
 Distributed Computing
, 1998
"... We show that the problem of predicate detection in distributed systems is NPcomplete. In the past, efficient algorithms have been developed for special classes of predicates such as stable predicates, observerindependent predicates, and conjunctive predicates. We introduce a class of predicates, s ..."
Abstract

Cited by 47 (7 self)
 Add to MetaCart
(Show Context)
We show that the problem of predicate detection in distributed systems is NPcomplete. In the past, efficient algorithms have been developed for special classes of predicates such as stable predicates, observerindependent predicates, and conjunctive predicates. We introduce a class of predicates, semilinear predicates, which properly contains all of the above classes. We first discuss stable, observerindependent and semilinear classes of predicates and their relationships with each other. We also study closure properties of these classes with respect to conjunction and disjunction. Finally, we discuss algorithms for detection of predicates in these classes. We provide a nondeterministic, detection algorithm for each class of predicate. We show that each class can be equivalently characterized by the degree of nondeterminism present in the algorithm. Stable predicates are defined as those that can be detected by an algorithm with the most nondeterminism. All other classes can be derived by appropriately constraining the nondeterminism in this algorithm.
Preventing useless checkpoints in distributed computations
 IN PROCEEDINGS OF THE IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS
, 1997
"... A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following important problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design a communicationinduced che ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
(Show Context)
A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following important problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design a communicationinduced checkpointing protocol that directs processes to take additional local (forced) checkpoints to ensure that no local checkpoint is useless. A general and efficient protocol answering this problem is proposed. It is shown that several existing protocols that solve the same problem are particular instances of it. The design of this general protocol is motivated by the use of communicationinduced checkpointing protocols in “consistent global checkpoint”based distributed applications. Detection of stable or unstable properties, rollbackrecovery, and determination of distributed breakpoints are examples of such applications.
Techniques to Tackle State Explosion in Global Predicate Detection
 IEEE Transactions on Software Engineering
, 2001
"... AbstractÐGlobal predicate detection, which is an important problem in testing and debugging distributed programs, is very hard due to the combinatorial explosion of the global state space. This paper presents several techniques to tackle the state explosion problem in detecting whether an arbitrary ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
(Show Context)
AbstractÐGlobal predicate detection, which is an important problem in testing and debugging distributed programs, is very hard due to the combinatorial explosion of the global state space. This paper presents several techniques to tackle the state explosion problem in detecting whether an arbitrary predicate is true at some consistent global state of a distributed system. We present space efficient online algorithms for detecting. We then improve the performance of our algorithms, both in space and time, by increasing the granularity of the execution step from an event to a sequence of events in each process. Index TermsÐDistributed systems, global states, global predicates, lattice, space complexity, global intervals. 1
Detecting global predicates in distributed systems with clocks
 In Proceedings of the 11th International Workshop on Distributed Algorithms (WDAG’97
, 1997
"... This paper proposes a framework for detecting global state predicates in systems of processes with approximatelysynchronized realtime clocks. Timestamps from these clocks are used to de ne two orderings on events: \de nitely occurred before " and \possibly occurred before". These ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
This paper proposes a framework for detecting global state predicates in systems of processes with approximatelysynchronized realtime clocks. Timestamps from these clocks are used to de ne two orderings on events: \de nitely occurred before &quot; and \possibly occurred before&quot;. These orderings lead naturally to de nitions of 3 distinct detection modalities, i.e., 3 meanings of \predicate held during a computation&quot;, namely: Poss db db! ( \ possibly held&quot;), Def! ( \ de nitely held&quot;), and Inst ( \ de nitely held in a speci c global state&quot;). This paper de nes these modalities and gives e cient algorithms for detecting them. The algorithms are based on algorithms of Garg and Waldecker, Alagar and Venkatesan, Cooper and Marzullo, and Fromentin and Raynal. Complexity analysis shows that under reasonable assumptions, these realtimeclockbased detection algorithms are less expensive thandetection algorithms based on Lamport's happenedbefore ordering. Sample applications are given to illustrate the bene ts of this approach. Key words: global predicate detection, consistent global states, distributed debugging, realtime monitoring 1
Efficient Detection of Global Properties in Distributed Systems Using PartialOrder Methods
 IN PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON COMPUTERAIDED VERIFICATION (CAV), VOLUME 1855 OF LECTURE NOTES IN COMPUTER SCIENCE
, 2000
"... A new approach is presented for detecting whether a computation of an asynchronous distributed system satisfies Poss (read "possibly"), meaning the system could have passed through a global state satisfying property. Previous generalpurpose algorithms for this problem explicitly enumerate ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
A new approach is presented for detecting whether a computation of an asynchronous distributed system satisfies Poss (read "possibly"), meaning the system could have passed through a global state satisfying property. Previous generalpurpose algorithms for this problem explicitly enumerate the set of global states through which the system could have passed during the computation. The new approach is to represent this set symbolically, in particular, using ordered binary decision diagrams. We describe an implementation of this approach, suitable for offline detection of properties, and compare its performance to the enumerationbased algorithm of Alagar & Venkatesan. In typical cases, the new algorithm is signi cantly faster. We have measured over 400fold speedup in some cases.
Predicate control for active debugging of distributed programs
, 1998
"... Existing approaches to debugging distributed systems involve a cycle of passive observation followed by computation replaying. We propose predicate control as an active approach to debugging such systems. The predicate control approach involves a cycle of observation followed by controlled replaying ..."
Abstract

Cited by 18 (9 self)
 Add to MetaCart
(Show Context)
Existing approaches to debugging distributed systems involve a cycle of passive observation followed by computation replaying. We propose predicate control as an active approach to debugging such systems. The predicate control approach involves a cycle of observation followed by controlled replaying of computations, based on observation. We formalize the predicate control problem for both offline and online scenarios. We prove that offline predicate control for general boolean predicates is NPhard. However, we provide an efficient solution for offline predicate control for the class of disjunctive predicates. We further solve online predicate control for disjunctive predicates under certain restrictions on the system. Lastly, we demonstrate how both offline and online predicate control facilitate distributed debugging by allowing the programmer to control computations to maintain global safety properties. 1.
Detecting Temporal Logic Predicates on the HappenedBefore Model
 In Proc. of the International Parallel and Distributed Processing Symposium (IPDPS), Fort
, 2001
"... in distributed computing. In this paper we describe new predicate detection algorithms for certain temporal logic predicates. We use a temporal logic, CTL, for specifying properties of a distributed computation and interpret it on a finite lattice of global states. We present solutions to the predic ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
(Show Context)
in distributed computing. In this paper we describe new predicate detection algorithms for certain temporal logic predicates. We use a temporal logic, CTL, for specifying properties of a distributed computation and interpret it on a finite lattice of global states. We present solutions to the predicate detection of linear and observerindependent predicates under EG and AG operators of CTL. For linear predicates we develop polynomialtime predicate detection algorithms which exploit the structure of finite distributive lattices. For observerindependent predicates we prove that predicate detection is NPcomplete under EG operator and coNPcomplete under AG operator. We also present polynomialtime algorithms for a CTL operator called until , for which such algorithms did not exist. Finally, our work unifies many earlier results in predicate detection in a single framework.
Observation and Control for Debugging Distributed Computations
 In Proceedings of the International Workshop on Automated Debugging (AADEBUG
, 1997
"... Ipresent a general framework for observing and controlling a distributedcomputation and its applications to distributed debugging. Algorithms for observation are useful in distributed debugging to stop a distributed program under certain undesirable global conditions. Ipresent the main ideas require ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
Ipresent a general framework for observing and controlling a distributedcomputation and its applications to distributed debugging. Algorithms for observation are useful in distributed debugging to stop a distributed program under certain undesirable global conditions. Ipresent the main ideas required for developing e cient algorithms for observation. Algorithms for control are useful in debugging to restrict the behavior of the distributed program to suspicious executions. It is also useful when a programmer wants to test a distributed program under certain conditions. I present di erent models and their limitations for controlling distributed computations. 1
CommunicationInduced Determination of Consistent Snapshots
, 1997
"... : A classical way to determine consistent snapshots consists in using ChandyLamport's algorithm. This algorithm relies on specific control messages that allow processes to synchronize local checkpoint determination and message recording in order the resulting snapshot be consistent. This pape ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
: A classical way to determine consistent snapshots consists in using ChandyLamport's algorithm. This algorithm relies on specific control messages that allow processes to synchronize local checkpoint determination and message recording in order the resulting snapshot be consistent. This paper investigates a communicationinduced approach to determine consistent snapshots. In such an approach, control information is carried by application messages. Two abstract necessary and sufficient conditions are stated: one associated with global checkpoint consistency, the other associated with message recording. A general protocol is derived from these abstract conditions. Actually, this general protocol can be instantiated in distinct ways, giving rise to a family of communicationinduced snapshot protocols. This general protocol shows there is an intrinsic tradeoff between the number of forced checkpoints and the number of recorded messages. Finally, a particular instantiation of the general...