Results 11 - 20
of
1,208
Concurrent Online Tracking of Mobile Users
- J. ACM
, 1991
"... This paper deals with the problem of maintaining a distributed directory server, that enables us to keep track of mobile users in a distributed network in the presence of concurrent requests. The paper uses the graph-theoretic concept of regional matching for implementing efficient tracking mechanis ..."
Abstract
-
Cited by 231 (7 self)
- Add to MetaCart
This paper deals with the problem of maintaining a distributed directory server, that enables us to keep track of mobile users in a distributed network in the presence of concurrent requests. The paper uses the graph-theoretic concept of regional matching for implementing efficient tracking mechanisms. The communication overhead of our tracking mechanism is within a polylogarithmic factor of the lower bound. 1 Introduction Since the primary function of a communication network is to provide communication facilities between users and processes in the system, one of the key problems such a network faces is the need to be able to Department of Mathematics and Lab. for Computer Science, M.I.T., Cambridge, MA 02139, USA. E-mail: baruch@theory.lcs.mit.edu. Supported by Air Force Contract TNDGAFOSR-86-0078, ARO contract DAAL03-86-K0171, NSF contract CCR8611442, DARPA contract N00014-89J -1988, and a special grant from IBM. y Departmentof Applied Mathematicsand Computer Science, The Weizm...
Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail
- IN SEARCH OF THE HOLY GRAIL. DISTRIBUTED COMPUTING
, 1994
"... The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results concern ..."
Abstract
-
Cited by 230 (3 self)
- Add to MetaCart
(Show Context)
The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results concerning the characterization of causality are presented. Recent work on the detection of causal relationships in distributed computations is surveyed. The issue of observing distributed computations in a causally consistent way and the basic problems of detecting global predicates are discussed. To illustrate the major difficulties, some typical monitoring and debugging approaches are assessed, and it is demonstrated how their feasibility is severely limited by the fundamental problem to master the complexity of causal relationships.
Building Secure and Reliable Network Applications
, 1996
"... ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably deliv ..."
Abstract
-
Cited by 230 (16 self)
- Add to MetaCart
ly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of "properties" -- a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably delivered to the invoker, and exceptions are raised if (and only if) an error occurs. Given a completely reliable communication environment, which never loses, duplicates, or reorders messages, and given client and server processes that never fail, RPC would be trivial to solve. The sender would merely package the invocation into one or more messages, and transmit these to the server. The server would unpack the data into local variables, perform the desired operation, and send back the result (or an indication of any exception that occurred) in a reply message. The challenge, then, is created by failures. Were it not for the possibility of process and machine crashes, an RPC protocol capable of overcomi...
The Performance of Consistent Checkpointing
- In Proceedings of the 11th Symposium on Reliable Distributed Systems
, 1992
"... Consistent checkpointing provides transparent fault tolerance for long-running distributed applications. In this paper we describe performance measurements of an implementation of consistent checkpointing. Our measurements show that consistent checkpointing performs remarkably well. We executed eigh ..."
Abstract
-
Cited by 229 (10 self)
- Add to MetaCart
(Show Context)
Consistent checkpointing provides transparent fault tolerance for long-running distributed applications. In this paper we describe performance measurements of an implementation of consistent checkpointing. Our measurements show that consistent checkpointing performs remarkably well. We executed eight compute-intensive distributed applications on a network of 16 diskless Sun-3/60 workstations, comparing the performance without checkpointing to the performance with consistent checkpoints taken at 2-minute intervals. For six of the eight applications, the running time increased by less than 1% as a result of the checkpointing. The highest overhead measured for any of the applications was 5.8%. Incremental checkpointing and copy-on-write checkpointing were the most effective techniques in lowering the running time overhead. These techniques reduce the amount of data written to stable storage and allow the checkpoint to proceed concurrently with the execution of the processes. The overhead ...
Condor and the Grid
"... Since 1984, the Condor project has helped ordinary users to do extraordinary computing. Today, the project continues to explore the social and technical problems of cooperative computing on scales ranging from the desktop to the world-wide computational grid. In this chapter, we provide the history ..."
Abstract
-
Cited by 227 (37 self)
- Add to MetaCart
(Show Context)
Since 1984, the Condor project has helped ordinary users to do extraordinary computing. Today, the project continues to explore the social and technical problems of cooperative computing on scales ranging from the desktop to the world-wide computational grid. In this chapter, we provide the history and philosophy of the Condor project and describe how it has interacted with other projects and evolved along with the field of distributed computing. We outline the core components of the Condor system and describe how the technology of computing must reflect the sociology of communities. Throughout, we reflect on the lessons of experience and chart the course travelled by research ideas as they grow into production systems.
CoCheck: Checkpointing and Process Migration for MPI
- IN PROCEEDINGS OF THE 10TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM (IPPS ’96
, 1996
"... Checkpointing of parallel applications can be used as the core technology to provide process migration. Both, checkpointing and migration, are an important issue for parallel applications on networks of workstations. The CoCheck environment which we present in this paper introduces a new approach to ..."
Abstract
-
Cited by 224 (4 self)
- Add to MetaCart
(Show Context)
Checkpointing of parallel applications can be used as the core technology to provide process migration. Both, checkpointing and migration, are an important issue for parallel applications on networks of workstations. The CoCheck environment which we present in this paper introduces a new approach to provide checkpointing and migration for parallel applications. In difference to existing systems CoCheck rather sits on top of the message passing library than inside and achieves consistency at a level above the message passing system. It uses an existing single process checkpointer which is available for a wide range of systems. Hence, CoCheck can be easily adapted to both, different message passing systems and new machines.
Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing
, 1988
"... In a distributed system using message logging and checkpointing to provide fault tolerance, there is always a unique maximum recoverable system state, regardless of the message logging protocol used. The proof of this relies on the observation that the set of system states that have occurred during ..."
Abstract
-
Cited by 224 (14 self)
- Add to MetaCart
(Show Context)
In a distributed system using message logging and checkpointing to provide fault tolerance, there is always a unique maximum recoverable system state, regardless of the message logging protocol used. The proof of this relies on the observation that the set of system states that have occurred during any single execution of a system forms a lattice, with the sets of consistent and recoverable system states as sublattices. The maximum recoverable system state never decreases, and if all messages are eventually logged, the domino e ect cannot occur. This paper presents a general model for reasoning about recovery in such a system and, based on this model, an efficient algorithm for determining the maximum recoverable system state at any time. This work uni es existing approaches to fault tolerance based on message logging and checkpointing, and improves on existing methods for optimistic recovery in distributed systems.
Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit
- IEEE TRANSACTIONS ON COMPUTERS
, 1992
"... Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message loggin ..."
Abstract
-
Cited by 209 (11 self)
- Add to MetaCart
(Show Context)
Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message logging, namely limited rollback and fast output commit, and the advantage of optimistic message logging, namely low failure-free overhead. These advantages come at the expense of a complex recovery scheme.
Consistent Detection of Global Predicates
- In Proceedings of the ACM/ONR Workshop on Parallel and Distributed Debugging
, 1991
"... A fundamental problem in debugging and monitoring is detecting whether the state of a system satisfies some predicate. If the system is distributed, then the resulting uncertainty in the state of the system makes such detection, in general, ill-defined. This paper presents three algorithms for detec ..."
Abstract
-
Cited by 165 (3 self)
- Add to MetaCart
(Show Context)
A fundamental problem in debugging and monitoring is detecting whether the state of a system satisfies some predicate. If the system is distributed, then the resulting uncertainty in the state of the system makes such detection, in general, ill-defined. This paper presents three algorithms for detecting global predicates in a well-defined way. These algorithms do so by interpreting predicates with respect to the communication that has occurred in the system. Briefly, the first algorithm determines that the predicate was possibly true at some point in the past; the second determines that the predicate was definitely true in the past; while the third algorithm establishes that the predicate is currently true, but to do so it may delay the execution of certain processes. Our approach is in contrast to the considerable body of work that uses temporal predicates (i.e., predicates expressed over process histories) for distributed monitoring. Temporal predicates are more powerful, but also more complex to use. In many cases, the condition that the programmer wishes to monitor is simply and intuitively viewed as a predicate over the “instantaneous ” state of the system. Using the possibly/definitely/currently interpretation such a predicate becomes well-defined, without requiring it to be recast using temporal formulas. Further, our algorithms may be more efficient than techniques that use a notion of explicit time or process histories. Section 1 specifies the protocols and Section 2 gives an outline of