Results 1  10
of
21
Group Communication Specifications: A Comprehensive Study
 ACM Computing Surveys
, 1999
"... Vieworiented group communication is an important and widely used building block for many distributed applications. Much current research has been dedicated to specifying the semantics and services of vieworiented Group Communication Systems (GCSs). However, the guarantees of different GCSs are for ..."
Abstract

Cited by 369 (14 self)
 Add to MetaCart
(Show Context)
Vieworiented group communication is an important and widely used building block for many distributed applications. Much current research has been dedicated to specifying the semantics and services of vieworiented Group Communication Systems (GCSs). However, the guarantees of different GCSs are formulated using varying terminologies and modeling techniques, and the specifications vary in their rigor. This makes it difficult to analyze and compare the different systems. This paper provides a comprehensive set of clear and rigorous specifications, which may be combined to represent the guarantees of most existing GCSs. In the light of these specifications, over thirty published GCS specifications are surveyed. Thus, the specifications serve as a unifying framework for the classification, analysis and comparison of group communication systems. The survey also discusses over a dozen different applications of group communication systems, shedding light on the usefulness of the p...
Specifying and Using a Partitionable Group Communication Service
 ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 1997
"... Group communication services are becoming accepted as effective building blocks for the construction of faulttolerant distributed applications. Many specifications for group communication services have been proposed. However, there is still no agreement about what these specifications should say ..."
Abstract

Cited by 113 (21 self)
 Add to MetaCart
(Show Context)
Group communication services are becoming accepted as effective building blocks for the construction of faulttolerant distributed applications. Many specifications for group communication services have been proposed. However, there is still no agreement about what these specifications should say, especially in cases where the services are partitionable, that is, where communication failures may lead to simultaneous creation of groups with disjoint memberships, such that each group is unaware of the existence of any other group. In this paper
Workcompetitive scheduling for cooperative computing with dynamic groups
 SIAM JOURNAL ON COMPUTING
, 2005
"... The problem of cooperatively performing a set of t tasks in a decentralized computing environment subject to failures is one of the fundamental problems in distributed computing. The setting with partitionable networks is especially challenging, as algorithmic solutions must accommodate the possib ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
(Show Context)
The problem of cooperatively performing a set of t tasks in a decentralized computing environment subject to failures is one of the fundamental problems in distributed computing. The setting with partitionable networks is especially challenging, as algorithmic solutions must accommodate the possibility that groups of processors become disconnected (and, perhaps, reconnected) during the computation. The efficiency of taskperforming algorithms is often assessed in terms of work: the total number of tasks, counting multiplicities, performed by all of the processors during the computation. In general, the scenario where the processors are partitioned into g disconnected components causes any taskperforming algorithm to have work Ω(t · g) even if each group of processors performs no more than the optimal number of Θ(t) tasks. Given that such pessimistic lower bounds apply to any scheduling algorithm, we pursue a competitive analysis. Specifically, this paper studies a simple randomized scheduling algorithm for p asynchronous processors, connected by a dynamically changing communication medium, to complete t known tasks. The performance of this algorithm is compared against that of an omniscient offline algorithm with full knowledge of the future changes in the communication medium. The paper describes a notion of computation width, which associates a natural number with a history of changes in the communication medium, and shows both upper and lower bounds on workcompetitiveness in terms of this quantity. Specifically, it is shown that the simple randomized algorithm obtains the competitive ratio (1 + cw/e), where cw is the computation width and e is the base of the natural logarithm (e =2.7182...); this competitive ratio is then shown to be tight.
Distributed Cooperation during the Absence of Communication
, 2001
"... This paper presents a study of a distributed cooperation problem under the assumption that processors may not be able to communicate for a prolonged time. The problem for n processors is defined in terms of t tasks that need to be performed e#ciently and that are known to all processors. The resul ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
This paper presents a study of a distributed cooperation problem under the assumption that processors may not be able to communicate for a prolonged time. The problem for n processors is defined in terms of t tasks that need to be performed e#ciently and that are known to all processors. The results of this study characterize the ability of the processors to schedule their work so that when some processors establish communication, the wasted (redundant) work these processors have collectively performed prior to that time is controlled. The lower bound for wasted work presented here shows that for any set of schedules there are two processors such that when they complete t1 and t2 tasks respectively the number of redundant tasks is #(t1 t2 /t). For n = t and for schedules longer than # n,thenumberof redundant tasks for two or more processors must be at least 2. The upper bound on pairwise waste for schedules of length # n is shown to be 1. Our e#cient deterministic schedule construction is motivated by design theory. To obtain linear length schedules, a novel deterministic and e#cient construction is given. This construction has the property that pairwise wasted work increases gracefully as processors progress through their schedules. Finally our analysis of a random scheduling solution shows that with high probability pairwise waste is well behaved at all times: specifically, two processors having completed t1 and t2 tasks, respectively, are guaranteed to have no more than t1 t2 /t + # redundant tasks, where #=O(log n + t1 t2 /t # log n).
Cooperative Computing with Fragmentable and Mergeable Groups
 J. Discrete Algorithms
, 2000
"... This work considers the problem of performing a set of N tasks on a set of P cooperating messagepassing processors (P N ). The processors use a group communication service (GCS) to coordinate their activity in the setting where dynamic changes in the underlying network topology cause the processor ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
(Show Context)
This work considers the problem of performing a set of N tasks on a set of P cooperating messagepassing processors (P N ). The processors use a group communication service (GCS) to coordinate their activity in the setting where dynamic changes in the underlying network topology cause the processor groups to change over time. GCSs have been recognized as effective building blocks for faulttolerant applications in such settings. Our results explore the efficiency of faulttolerant cooperative computation using GCSs. Prior investigation of this area by Dolev et al. [8] focused on competitive lower bounds, nonredundant task allocation schemes and workefficient algorithms in the presence of fragmentation regroupings. In this work we investigate workefficient and messageefficient algorithms for fragmentation and merge regroupings. We present an algorithm that uses GCSs and implements a coordinatorbased strategy. This algorithm is motivated by the results in [8]. It achieves similar work complexity of O(N f + N) for fragmentations, where f is the number of new groups created by dynamic fragmentations.
The Complexity of Synchronous Iterative DoAll with Crashes
, 2001
"... DoAll is the problem of performing N tasks in a distributed system of P failureprone processors [9]. Many distributed and parallel algorithms have been developed for this basic problem and several algorithm simulations have been developed by iterating DoAll algorithms. The eciency of the solut ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
(Show Context)
DoAll is the problem of performing N tasks in a distributed system of P failureprone processors [9]. Many distributed and parallel algorithms have been developed for this basic problem and several algorithm simulations have been developed by iterating DoAll algorithms. The eciency of the solutions for DoAll is measured in terms of work complexity where all processing steps taken by the processors are counted. Work is ideally expressed as a function of N , P , and f , the number of processor crashes. However the known lower bounds and the upper bounds for extant algorithms do not adequately show how work depends on f . We present the rst nontrivial lower bounds for DoAll that capture the dependence of work on N , P and f . For the model of computation where processors are able to make perfect loadbalancing decisions locally, we also present matching upper bounds. Thus we give the rst complete analysis of DoAll for this model. We dene the riterative DoAll problem that abstracts the repeated use of DoAll such as found in algorithm simulations. Our fsensitive analysis enables us to derive a tight bound for riterative DoAll work (that is stronger than the rfold work complexity of a single DoAll). Our approach that models perfect loadbalancing allows for the analysis of specic algorithms to be divided into two parts: (i) the analysis of the cost of tolerating failures while performing work, and (ii) the analysis of the cost of implementing loadbalancing. We demonstrate the utility and generality of this approach by improving the analysis of two known ecient algorithms. We give an improved analysis of an ecient messagepassing algorithm (algorithm AN [5]). We also derive a new and complete analysis of the best known DoAll algorithm for...
Optimal Scheduling for Disconnected Cooperation
, 2001
"... We consider a distributed environment consisting of n processors that need to perform t tasks. We assume that communication is initially unavailable and that processors begin work in isolation. At some unknown point of time an unknown collection of processors may establish communication. Before proc ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
We consider a distributed environment consisting of n processors that need to perform t tasks. We assume that communication is initially unavailable and that processors begin work in isolation. At some unknown point of time an unknown collection of processors may establish communication. Before processors begin communication they execute tasks in the order given by their schedules. Our goal is to schedule work of isolated processors so that when communication is established for the rst time, the number of redundantly executed tasks is controlled. We quantify worst case redundancy as a function of processor advancements through their schedules. In this work we rene and simplify an extant deterministic construction for schedules with n t, and we develop a new analysis of its waste. The new analysis shows that for any pair of schedules, the number of redundant tasks can be controlled for the entire range of t tasks. Our new result is asymptotically optimal: the tails of these schedules are within a 1 +O(n 1 4 ) factor of the lower bound. We also present two new deterministic constructions one for t n, and the other for t n 3=2 , which substantially improve pairwise waste for all prexes of length t= p n, and oer near optimal waste for the tails of the schedules. Finally, we present bounds for waste of any collection of k 2 processors for both deterministic and randomized constructions. 1
The DoAll Problem with Byzantine Processor Failures
 Theoretical Computer Science
, 2003
"... DoAll is the abstract problem of using n processors to cooperatively perform m independent tasks in the presence of failures. This problem can be used as the cornerstone in identifying aspects of the tradeoff between efficiency and faulttolerance in cooperative computing and in developing efficie ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
DoAll is the abstract problem of using n processors to cooperatively perform m independent tasks in the presence of failures. This problem can be used as the cornerstone in identifying aspects of the tradeoff between efficiency and faulttolerance in cooperative computing and in developing efficient and faulttolerant algorithms for distributed cooperative applications. Many algorithms have been developed for DoAll in various models of computation, including messagepassing, partitionable networks, and sharedmemory models and under various failure models. However, to the best of our knowledge, DoAll has not been studied under Byzantine processor failures, where a faulty processor may exhibit completely unconstrained behavior. Byzantine failures model any arbitrary type of processor malfunction, including for example, failures of individual components within the processors.
The Complexity of Distributed Cooperation in the Presence of Failures
 In Proceedings of the International Conference on Principles of Distributed Computing
, 2000
"... Abstract. We consider the problem of performing N tasks in a distributed system of P processors that are subject to failures. An optimal solution would have the system perform Θ(N) tasks, however extant results show that this is impossible (when N = P) in the presence of an omniscient failureinduci ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the problem of performing N tasks in a distributed system of P processors that are subject to failures. An optimal solution would have the system perform Θ(N) tasks, however extant results show that this is impossible (when N = P) in the presence of an omniscient failureinducing adversary. In this work we present complexity results for this problem for the case when the processors are assisted by a perfect loadbalancing oracle in an attempt to foil the adversary. We generalize several existing results for this setting and we present new bounds for this problem that shed light on the behavior of deterministic distributed cooperation algorithms for the cases when the adversary imposes comparatively few failures. When a solution for the cooperation problem is implemented using a global loadbalancing strategy, our results can be used to explain the work complexity of the solution. In particular, if the complexity of implementing the loadbalancing oracle is known for a given model of computation, then our results can be directly translated into complexity results for that model.
Challenges in evaluating distributed algorithms
 In Proceedings of the International Workshop on Future Directions in Distributed Computing
, 2002
"... ..."