Results 1 - 10
of
50
An Integrated Experimental Environment for Distributed Systems and Networks
- In Proc. of the Fifth Symposium on Operating Systems Design and Implementation
, 2002
"... Three experimental environments traditionally support network and distributed systems research: network emulators, network simulators, and live networks. The continued use of multiple approaches highlights both the value and inadequacy of each. Netbed, a descendant of Emulab, provides an experimenta ..."
Abstract
-
Cited by 688 (41 self)
- Add to MetaCart
Three experimental environments traditionally support network and distributed systems research: network emulators, network simulators, and live networks. The continued use of multiple approaches highlights both the value and inadequacy of each. Netbed, a descendant of Emulab, provides an experimentation facility that integrates these approaches, allowing researchers to configure and access networks composed of emulated, simulated, and wide-area nodes and links. Netbed's primary goals are ease of use, control, and realism, achieved through consistent use of virtualization and abstraction.
NFTAPE: A Framework for Assessing Dependability in Distributed Systems with Lightweight Fault Injectors
- In Proceedings of the IEEE International Computer Performance and Dependability Symposium
, 2000
"... Many fault injection tools are available for dependability assessment. Although these tools are good at injecting a single fault model into a single system, they suffer from two main limitations for use in distributed systems: (1) no single tool is sufficient for injecting all necessary fault models ..."
Abstract
-
Cited by 50 (0 self)
- Add to MetaCart
(Show Context)
Many fault injection tools are available for dependability assessment. Although these tools are good at injecting a single fault model into a single system, they suffer from two main limitations for use in distributed systems: (1) no single tool is sufficient for injecting all necessary fault models; (2) it is difficult to port these tools to new systems. NFTAPE, a tool for composing automated fault injection experiments from available lightweight fault injectors, triggers, monitors, and other components, helps to solve these problems. We have conducted experiments using NFTAPE with several types of lightweight fault injectors, including driver-based, debugger-based, target-specific, simulation-based, hardware-based, and performance-fault injections. Two example experiments are described in this paper. The first uses a hardware fault injector with a Myrinet LAN; the other uses a Software Implemented Fault Injection (SWIFI) fault injector to target a spaceimaging application. Keywords...
ORCHESTRA: A Fault Injection Environment for Distributed Systems
- In 26th International Symposium on Fault-Tolerant Computing (FTCS
, 1996
"... This paper reports on orchestra, a portable fault injection environment for testing implementations of distributed protocols. The paper focuses on architectural features of orchestra that provide portability, minimize intrusiveness on target protocols, and support testing of real-time systems. orch ..."
Abstract
-
Cited by 43 (2 self)
- Add to MetaCart
(Show Context)
This paper reports on orchestra, a portable fault injection environment for testing implementations of distributed protocols. The paper focuses on architectural features of orchestra that provide portability, minimize intrusiveness on target protocols, and support testing of real-time systems. orchestra is based on a simple yet powerful framework, called script-driven probing and fault injection, for the evaluation and validation of the fault-tolerance and timing characteristics of distributed protocols. orchestra was initially developed on the Real-Time Mach operating system and later ported to other platforms including Solaris and SunOS, and has been used to conduct extensive experiments on several protocol implementations. A novel feature of the Real-Time Mach implementation of orchestra is that it utilizes certain features of the Real-Time Mach operating system to quantify and compensate for intrusiveness of the fault injection mechanism. In addition to describing the overall orc...
Experiments on Six Commercial TCP Implementations Using a Software Fault Injection Tool
- Software Practice and Experience
, 1997
"... TCP, the de facto standard transport protocol in today’s operating systems, is a very robust protocol that adapts to various network characteristics, packet loss, link congestion, and even significant differences in vendor implementations. This paper describes a set of experiments performed on six d ..."
Abstract
-
Cited by 40 (2 self)
- Add to MetaCart
TCP, the de facto standard transport protocol in today’s operating systems, is a very robust protocol that adapts to various network characteristics, packet loss, link congestion, and even significant differences in vendor implementations. This paper describes a set of experiments performed on six different vendor TCP implementations using ORCHESTRA,a tool for testing and fault injection of communication protocols. These experimentsuncoveredviolations of the TCP protocol specification, and illustrated differences in the philosophies of various vendors in their implementations of TCP. The paper summarizes several lessons learned about the TCP implementations through these experiments. KEY WORDS: TCP; distributed systems; communication protocols; fault injection tool; protocol testing
Loki: A State-Driven Fault Injector for Distributed Systems
, 2000
"... Distributed applications can fail in subtle ways that depend on the state of multiple parts of a system. This complicates the validation of such systems via fault injection, since it suggests that faults should be injected based on the global state of the system. In Loki, fault injection is performe ..."
Abstract
-
Cited by 32 (5 self)
- Add to MetaCart
Distributed applications can fail in subtle ways that depend on the state of multiple parts of a system. This complicates the validation of such systems via fault injection, since it suggests that faults should be injected based on the global state of the system. In Loki, fault injection is performed based on a partial view of the global state of a distributed system, i.e., faults injected in one node of the system can depend on the state of other nodes. Once faults are injected, a post-runtime analysis, using off-line clock synchronization, is used to place events and injections on a single global timeline and to determine whether the intended faults were properly injected. Finally, experiments containing successful fault injections are used to estimate the specified measures. In addition to reviewing briefly the concepts behind Loki and its organization, we detail Loki's user interface. In particular, we describe the graphical user interfaces for specifying state machines and faults, for executing a campaign, and for verifying whether the faults were properly injected.
A Global-State-Triggered Fault Injector for Distributed System Evaluation
, 2002
"... Validation of the dependability of distributed systems via fault injection is gaining importance, because distributed systems are being increasingly used in environments with high dependability requirements. The fact that distributed systems can fail in subtle ways that depend on the state of mul ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Validation of the dependability of distributed systems via fault injection is gaining importance, because distributed systems are being increasingly used in environments with high dependability requirements. The fact that distributed systems can fail in subtle ways that depend on the state of multiple parts of the system suggests that a global-state-based fault injection mechanism should be used to validate them. However, global-state-based fault injection is challenging, since it is very difficult in practice to maintain the global state of a distributed system at runtime with minimal intrusion into the system execution. This paper presents Loki, a global-state-based fault injector, which has been designed with the goals of low intrusion, high precision, and high
Centralized Failure Injection for Distributed, Fault-Tolerant Protocol Testing
- In Proceedings of the 17th IEEE International Conference on Distributed Computing Systems (ICDCS’97
, 1997
"... We describe a centralized approach to testing that distributed fault-tolerant protocols satisfy their safety and timeliness specifications in the presence of the very failures they are designed to tolerate. Cesium is a testing environment based on the centralized simulation of distributed executions ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
(Show Context)
We describe a centralized approach to testing that distributed fault-tolerant protocols satisfy their safety and timeliness specifications in the presence of the very failures they are designed to tolerate. Cesium is a testing environment based on the centralized simulation of distributed executions and failures. Processes are run in a single address space while providing the appearance of a truly distributed execution. The human tester can force the occurrence of arbitrary failures and security attacks. The implementations under test are not instrumented for testing purposes, and their source codes need not be available. We prove that Cesium can execute exactly the set of runs feasible in the real distributed system being simulated. We also show that there are safety and timeliness properties in the specifications of many existing distributed protocols that cannot be tested in practical distributed systems. All of these properties can, however, be accurately tested by Cesium without ...
Orchestra: A probing and fault injection environment for testing protocol implementations
- in Proc. of Computer Performance and Dependability Symposium
, 1996
"... ..."
Experimental Evaluation of the Unavailability Induced by a Group Membership Protocol
- IN DEPENDABLE COMPUTING EDCC-4: PROCEEDINGS OF THE 4TH EUROPEAN DEPENDABLE COMPUTING CONFERENCE
, 2002
"... Group communication is an important paradigm for building highly available distributed systems. However, group membership operations often require the system to block message tra#c, causing system services to become unavailable. This makes it important to quantify the unavailability induced by m ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
(Show Context)
Group communication is an important paradigm for building highly available distributed systems. However, group membership operations often require the system to block message tra#c, causing system services to become unavailable. This makes it important to quantify the unavailability induced by membership operations. This paper experimentally evaluates the blocking behavior of the group membership protocol of the Ensemble group communication system using a novel global-state-based fault injection technique. In doing so, we demonstrate how a layered distributed protocol such as the Ensemble group membership protocol can be modeled in terms of a state machine abstraction, and show how the resulting global state space can be used to specify fault triggers and define important measures on the system. Using this approach, we evaluate the cost associated with important states of the protocol under varying workload and group size. We also evaluate the sensitivity of the protocol to the occurrence of a second correlated crash failure during its operation.
Group Communication Protocols under Errors
- in 1992 (Swiss Federal Statistical Office
, 2003
"... this paper provides a systematic, experimental study of GCS protocols under a variety of error models. The targeted system is Ensemble [7], which is a popular GCS developed at Cornell University. Ensemble was written in the OCAML dialect of the ML language so that it would be amenable to automated p ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
this paper provides a systematic, experimental study of GCS protocols under a variety of error models. The targeted system is Ensemble [7], which is a popular GCS developed at Cornell University. Ensemble was written in the OCAML dialect of the ML language so that it would be amenable to automated proof checking---an automated construction of its formal specifications from the source code is presented in [8]. To the best of our knowledge, no previous work has pursued a thorough characterization of Ensemble's behavior under real errors; notwithstanding, the system has been widely used