Results 1 - 10
of
96
Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments
- ACM Computing Surveys
, 1999
"... Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like f ..."
Abstract
-
Cited by 94 (9 self)
- Add to MetaCart
Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.
On slicing a distributed computation
- In Proceedings of the 21st IEEE International Conference on Distributed Computing Systems (ICDCS
, 2001
"... We introduce the notion of a slice of a distributed computation. A slice of a distributed computation with respect to a global predicate is a computation which captures those and only those consistent cuts of the original computation which satisfy the global predicate. We show that a slice exists fo ..."
Abstract
-
Cited by 67 (20 self)
- Add to MetaCart
(Show Context)
We introduce the notion of a slice of a distributed computation. A slice of a distributed computation with respect to a global predicate is a computation which captures those and only those consistent cuts of the original computation which satisfy the global predicate. We show that a slice exists for a global predicate iff the predicate is a regular predicate. We then give an efficient algorithm for computing the slice and show applications of slicing to testing and debugging of distributed programs. 1.
How to Recover Efficiently and Asynchronously when Optimism Fails
- IN PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS
, 1996
"... We propose a new algorithm for recovering asynchronously from failures in a distributed computation. Our algorithm is based on two novel concepts - a fault-tolerant vector clock to maintain causality information in spite of failures, and a history mechanism to detect orphan states and obsolete messa ..."
Abstract
-
Cited by 49 (5 self)
- Add to MetaCart
We propose a new algorithm for recovering asynchronously from failures in a distributed computation. Our algorithm is based on two novel concepts - a fault-tolerant vector clock to maintain causality information in spite of failures, and a history mechanism to detect orphan states and obsolete messages. These two mechanisms together with checkpointing and message-logging are used to restore the system to a consistent state after a failure of one or more processes. Our algorithm is completely asynchronous. It handles multiple failures, does not assume any message ordering, causes the minimum amount of rollback and restores the maximum recoverable state with low overhead. Earlier optimistic protocols lack one or more of the above properties.
Detection of global predicates: Techniques and their limitations
- Distributed Computing
, 1998
"... We show that the problem of predicate detection in distributed systems is NP-complete. In the past, efficient algorithms have been developed for special classes of predicates such as stable predicates, observer-independent predicates, and conjunctive predicates. We introduce a class of predicates, s ..."
Abstract
-
Cited by 47 (7 self)
- Add to MetaCart
(Show Context)
We show that the problem of predicate detection in distributed systems is NP-complete. In the past, efficient algorithms have been developed for special classes of predicates such as stable predicates, observer-independent predicates, and conjunctive predicates. We introduce a class of predicates, semi-linear predicates, which properly contains all of the above classes. We first discuss stable, observer-independent and semi-linear classes of predicates and their relationships with each other. We also study closure properties of these classes with respect to conjunction and disjunction. Finally, we discuss algorithms for detection of predicates in these classes. We provide a non-deterministic, detection algorithm for each class of predicate. We show that each class can be equivalently characterized by the degree of non-determinism present in the algorithm. Stable predicates are defined as those that can be detected by an algorithm with the most nondeterminism. All other classes can be derived by appropriately constraining the non-determinism in this algorithm.
Detection of Strong Unstable Predicates in Distributed Programs
- IEEE Transactions on Parallel and Distributed Systems
, 1996
"... This paper discusses detection of global predicates in a distributed program. A run of a distributed program results in a set of sequential traces, one for each process. These traces may be combined to form many global sequences consistent with the single run of the program. A strong global predicat ..."
Abstract
-
Cited by 44 (9 self)
- Add to MetaCart
(Show Context)
This paper discusses detection of global predicates in a distributed program. A run of a distributed program results in a set of sequential traces, one for each process. These traces may be combined to form many global sequences consistent with the single run of the program. A strong global predicate is true in a run if it is true for all global sequences consistent with the run. We present algorithms which detect if the given strong global predicate became true in a run of a distributed program. 1 Introduction Detection of global predicates is a fundamental problem in distributed computing. It arises in the designing, debugging and testing of distributed programs. Global predicates can be classified into two types - stable and unstable. A stable predicate is one which never turns false once it becomes true. An unstable predicate is one without such a property. Its value may alternate between true and false. Detection of stable predicates has been addressed in the literature by means ...
Distributed Algorithms for Detecting Conjunctive Predicates
- In Proc. of the IEEE International Conference on Distributed Computing Systems
, 1994
"... This paper discusses efficient distributed detection of global conjunctive predicates in a distributed program. Previous work in detection of such predicates is based on a checker process. The checker process requires O(n 2 m) time and space where m is the number of messages sent or received by an ..."
Abstract
-
Cited by 30 (7 self)
- Add to MetaCart
(Show Context)
This paper discusses efficient distributed detection of global conjunctive predicates in a distributed program. Previous work in detection of such predicates is based on a checker process. The checker process requires O(n 2 m) time and space where m is the number of messages sent or received by any process and n is the number of processes over which the predicate is defined. In this paper, we introduce token-based algorithms which distribute the computation and space requirements of the detection procedure. The distributed algorithm has O(n 2 m) time, space and message complexity, distributed such that each process performs O(nm) work. We describe another distributed algorithm with O(Nm) total work, where N is the total number of processes in the system. The relative values of n and N determine which algorithm is more efficient for a specific application. 1 Introduction Detection of a global predicate is a fundamental problem in distributed computing. This problem arises in many ...
On detecting global predicates in distributed computations
- In Proceedings of the 21st IEEE International Conference on Distributed Computing Systems (ICDCS
, 2001
"... Monitoring of global predicates is a fundamental problem in asynchronous distributed systems. This problem arises in various contexts such as design, testing and debugging, and fault-tolerance of distributed programs. In this paper, we establish that the problem of determining whether there exists a ..."
Abstract
-
Cited by 23 (10 self)
- Add to MetaCart
(Show Context)
Monitoring of global predicates is a fundamental problem in asynchronous distributed systems. This problem arises in various contexts such as design, testing and debugging, and fault-tolerance of distributed programs. In this paper, we establish that the problem of determining whether there exists a consistent cut of a computation that satisfies a predicate in ¡-CNF, ¡£¢¥ ¤ , in which no two clauses contain variables from the same process is NP-complete in general. A polynomial-time algorithm to find the consistent cut, if it exists, that satisfies the predicate for special cases is provided. We also give algorithms albeit exponential that can be used to achieve an exponential reduction in time over existing techniques for solving the general version. Furthermore, we present an algorithm to determine whether there exists a consistent cut of a computation for which the ¦¨§�©�¦���©�������©�¦� � sum exactly equals some con-stant ¡, where each ¦� � is an integer variable on process �¨� such that it is incremented or decremented by at most one at each step. As a corollary, any symmetric global predicate on boolean variables such as absence of simple majority and exclusive-or of local predicates can now be detected. Additionally, the problem is proved to be NP-complete if each ¦� � can be changed by an arbitrary amount at each step. Our results solve the previously open problems in predicate detection proposed in [7] and bridge the wide gap between the known tractability and intractability results that existed until now. 1.
Faster Possibility Detection by Combining Two Approaches
- IN PROCEEDINGS OF THE WORKSHOP ON DISTRIBUTED ALGORITHMS (WDAG
, 1995
"... A new algorithm is presented for detecting whether a particular computation of an asynchronous distributed system satisfies Poss Φ (read “possibly Φ”), meaning the system could have passed through a global state satisfying Φ. Like the algorithm of Cooper and Marzullo, Φ may be any global state pre ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
(Show Context)
A new algorithm is presented for detecting whether a particular computation of an asynchronous distributed system satisfies Poss Φ (read “possibly Φ”), meaning the system could have passed through a global state satisfying Φ. Like the algorithm of Cooper and Marzullo, Φ may be any global state predicate; and like the algorithm of Garg and Waldecker, Poss Φ is detected quite efficiently if Φ has a certain structure. The new algorithm exploits the structure of some predicates Φ not handled by Garg and Waldecker’s algorithm to detect Poss Φ more efficiently than is possible with any algorithm that, like Cooper and Marzullo’s, evaluates Φ on every global state through which the system could have passed. A second algorithm is also presented for off-line detection of Poss Φ. It uses Strassen’s scheme for fast matrix multiplication. The intrinsic complexity of off-line and on-line detection of Poss Φ is discussed.
Re-execution of Distributed Programs to Detect Bugs Hidden by Racing Messages
- In Proceedings of the International Conference on System Sciences
, 1997
"... Finding errors in non-deterministic programs is complicated by the fact that an anomaly may occur during one program execution, and not the next. Our objective is to provide a practical yet powerful testing environment for distributed systems, using re-execution. We focus on re-executing the program ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
(Show Context)
Finding errors in non-deterministic programs is complicated by the fact that an anomaly may occur during one program execution, and not the next. Our objective is to provide a practical yet powerful testing environment for distributed systems, using re-execution. We focus on re-executing the program, under a strictly different message ordering. We show that messages are grouped into waves, such that any two messages from different waves must always be received in the same order. We provide an algorithm that produces a re-execution that maximizes the number of reordered pairs of message delivery events. We also provide an efficient online algorithm for detecting racing messages.
Macrodebugging: Global Views of Distributed Program Execution
"... Creatinganddebuggingprogramsforwirelessembedded networks (WENs) is notoriously difficult. Macroprogramming is an emerging technology that aims to address this problem by providing high-level programming abstractions. We present MDB, the first system to support the debugging of macroprograms. MDB all ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
(Show Context)
Creatinganddebuggingprogramsforwirelessembedded networks (WENs) is notoriously difficult. Macroprogramming is an emerging technology that aims to address this problem by providing high-level programming abstractions. We present MDB, the first system to support the debugging of macroprograms. MDB allows the user to set breakpoints and step through a macroprogram using a sourcelevel debugging interface similar to GDB, a process we call macrodebugging. AkeychallengeofMDBistostepthrough a macroprogram in sequential order even though it executes on the network in a distributed, asynchronous manner. Besides allowing the user to view distributed state, MDB also provides the abilityto search for bugs over the entire history of distributed states. Finally, MDB allows the user to make hypothetical changes to a macroprogram and to see the effect on distributed state without the need to redeploy, execute, and test the new code. We show that macrodebugging is both easy and efficient: MDB consumes few system resourcesandrequiresfewusercommandstofindthecauseof bugs. We also provide a lightweight version of MDB called MDB Lite that can be used during the deployment phase to reduceresourceconsumptionwhilestilleliminatingthepossibility of heisenbugs: changes in the manifestation of bugs caused byenabling ordisabling the debugger. Categories and SubjectDescriptors