Results 1 - 10
of
51
AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects
, 1998
"... Building dependable distributed systems from commercial off-the-shelf components is of growing practical importance. For both cost and production reasons, there is interest in approaches and architectures that facilitate building such systems. The AQuA architecture is one such approach; its goal is ..."
Abstract
-
Cited by 119 (19 self)
- Add to MetaCart
Building dependable distributed systems from commercial off-the-shelf components is of growing practical importance. For both cost and production reasons, there is interest in approaches and architectures that facilitate building such systems. The AQuA architecture is one such approach; its goal is to provide adaptive fault tolerance to CORBA applications by replicating objects. The AQuA architecture allows application programmers to request desired levels of dependability during applications ' runtimes. It provides fault tolerance mechanisms to ensure that a CORBA client can always obtain reliable services, even if the CORBA server object that provides the desired services suffers from crash failures and value faults. AQuA includes a replicated dependability manager that provides dependability management by configuring the system in response to applications ’ requests and changes in system resources due to faults. It uses Maestro/Ensemble to provide group communication services. It contains a gateway to intercept standard CORBA IIOP messages to allow any
Exploring Failure Transparency and the Limits of Generic Recovery
- In Proc. 4th USENIX Symposium on Operating Systems Design and Implementation
, 2000
"... Abstract: We explore the abstraction of failure transparency in which the operating system provides the illusion of failure-free operation. To provide failure transparency, an operating system must recover applications after hardware, operating system, and application failures, and must do so withou ..."
Abstract
-
Cited by 46 (7 self)
- Add to MetaCart
Abstract: We explore the abstraction of failure transparency in which the operating system provides the illusion of failure-free operation. To provide failure transparency, an operating system must recover applications after hardware, operating system, and application failures, and must do so without help from the programmer or unduly slowing failure-free performance. We describe two invariants that must be upheld to provide failure transparency: one that ensures sufficient application state is saved to guarantee the user cannot discern failures, and another that ensures sufficient application state is lost to allow recovery from failures affecting application state. We find that several real applications get failure transparency in the presence of simple stop failures with overhead of 0-12%. Less encouragingly, we find that applications violate one invariant in the course of upholding the other for more than 90 % of application faults and 3-15% of operating system faults, rendering transparent recovery impossible for these cases. 1.
Understanding and dealing with operator mistakes in internet services
- In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI ’04
, 2004
"... Operator mistakes are a significant source of unavailability in modern Internet services. In this paper, we first characterize these mistakes by performing an extensive set of experiments using human operators and a realistic three-tier auction service. The mistakes we observed range from software m ..."
Abstract
-
Cited by 42 (12 self)
- Add to MetaCart
Operator mistakes are a significant source of unavailability in modern Internet services. In this paper, we first characterize these mistakes by performing an extensive set of experiments using human operators and a realistic three-tier auction service. The mistakes we observed range from software misconfiguration, to fault misdiagnosis, to incorrect software restarts. We next propose to validate operator actions before they are made visible to the rest of the system. We demonstrate how to accomplish this task via the creation of a validation environment that is an extension of the online system, where components can be validated using real workloads before they are migrated into the running service. We show that our prototype validation system can detect 66 % of the operator mistakes that we have observed. 1
Improving Availability with Recursive Micro-Reboots: A Soft-State System Case Study
, 2003
"... Even after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover. All software fails at some point, so systems must be able to ..."
Abstract
-
Cited by 35 (4 self)
- Add to MetaCart
Even after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover. All software fails at some point, so systems must be able to recover from failures. Recovery itself can fail too, so systems must know how to intelligently retry their recovery. We present here a recursive approach, in which a minimal subset of components is recovered first
DARX - A Framework for the Fault-Tolerant Support of Agent Software
- In 14th International Symposium on Software Reliability Engineering (ISSRE’2003
, 2003
"... This paper presents DARX, our framework for building applications that provide adaptive fault tolerance. It relies on the fact that multi-agent platforms constitute a very strong basis for decentralized software that is both flexible and scalable, and makes the assumption that the relative importanc ..."
Abstract
-
Cited by 29 (3 self)
- Add to MetaCart
This paper presents DARX, our framework for building applications that provide adaptive fault tolerance. It relies on the fact that multi-agent platforms constitute a very strong basis for decentralized software that is both flexible and scalable, and makes the assumption that the relative importance of each agent varies during the course of the computation. DARX regroups solutions which facilitate the creation of multi-agent applications in a large-scale context. Its most important feature is adaptive replication: replication strategies are applied on a per-agent basis with respect to transient environment characteristics such as the importance of the agent for the computation, the network load or the mean time between failures.
NFTAPE: A Framework for Assessing Dependability in Distributed Systems with Lightweight Fault Injectors
- In Proceedings of the IEEE International Computer Performance and Dependability Symposium
, 2000
"... Many fault injection tools are available for dependability assessment. Although these tools are good at injecting a single fault model into a single system, they suffer from two main limitations for use in distributed systems: (1) no single tool is sufficient for injecting all necessary fault models ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
Many fault injection tools are available for dependability assessment. Although these tools are good at injecting a single fault model into a single system, they suffer from two main limitations for use in distributed systems: (1) no single tool is sufficient for injecting all necessary fault models; (2) it is difficult to port these tools to new systems. NFTAPE, a tool for composing automated fault injection experiments from available lightweight fault injectors, triggers, monitors, and other components, helps to solve these problems. We have conducted experiments using NFTAPE with several types of lightweight fault injectors, including driver-based, debugger-based, target-specific, simulation-based, hardware-based, and performance-fault injections. Two example experiments are described in this paper. The first uses a hardware fault injector with a Myrinet LAN; the other uses a Software Implemented Fault Injection (SWIFI) fault injector to target a spaceimaging application. Keywords...
Automatic Instruction-Level Software-Only Recovery Methods
- In International Conference on Dependable Systems and Networks (DSN’06
, 2006
"... As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Computer architects have typically addressed reliability issues by adding redundant hardware, but these techniques are often too expensive to be used widely. ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Computer architects have typically addressed reliability issues by adding redundant hardware, but these techniques are often too expensive to be used widely. Software-only reliability techniques have shown promise in their ability to protect against soft-errors without any hardware overhead. However, existing low-level software-only fault tolerance techniques have only addressed the problem of detecting faults, leaving recovery largely unaddressed. In this paper, we present the concept, implementation, and evaluation of automatic, instruction-level, software-only recovery techniques, as well as various specific techniques representing different trade-offs between reliability and performance. Our evaluation shows that these techniques fulfill the promises of instruction-level, software-only fault tolerance by offering a wide range of flexible recovery options. 1
Containment Units: A Hierarchically Composable Architecture for Adaptive Systems
- In Proceedings of the 10th International Symposium on the Foundations of Software Engineering
, 2002
"... Software is increasingly expected to run in a variety of environments. The environments themselves are often dynamically changing when using mobile computers or embedded systems, for example. Network bandwidth, available power, or other physical conditions may change, necessitating the use of altern ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Software is increasingly expected to run in a variety of environments. The environments themselves are often dynamically changing when using mobile computers or embedded systems, for example. Network bandwidth, available power, or other physical conditions may change, necessitating the use of alternative algorithms within the software, and changing resource mixes to support the software. We present Containment Units as a software architecture useful for recognizing environmental changes and dynamically reconfiguring software and resource allocations to adapt to those changes. We present examples of Containment Units used within robotics along with the results of actual executions, and the application of static analysis to obtain assurances that those Containment Units can be expected to demonstrate the robustness for which they were designed.
Configurable and Reconfigurable Group Services in a Component Based Middleware Environment
- PROC. INTERNATIONAL SRDS (SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS) WORKSHOP ON DEPENDABLE AND GROUP COMMUNICATION (DSMGC
, 2000
"... ... importance of group based distributed applications such as media dissemination, computer supported collaborative work or fault tolerance through replication. However, most distributed object based middleware platforms, which are increasingly being used as an implementation environment for such a ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
... importance of group based distributed applications such as media dissemination, computer supported collaborative work or fault tolerance through replication. However, most distributed object based middleware platforms, which are increasingly being used as an implementation environment for such applications, fail to provide suitable support for group applications in their full generality. In this paper we describe a component based approach to the provision of group services in a middleware environment in which application tailored group services can be built by defining particular configurations of components or by incrementally modifying existing configurations. In addition, our approach uses reflective capabilities of the middleware platform to support the run-time reconfiguration of existing and running group applications.
An Architectural Framework for Providing Reliability and Security Support
- In DSN
, 2004
"... This paper explores hardware-implemented error-detection and security mechanisms embedded as modules in a hardware-level framework called the Reliability and Security Engine (RSE), which is implemented as an integral part of a modern microprocessor. The RSE interacts with the processor through an in ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
This paper explores hardware-implemented error-detection and security mechanisms embedded as modules in a hardware-level framework called the Reliability and Security Engine (RSE), which is implemented as an integral part of a modern microprocessor. The RSE interacts with the processor through an input/output interface. The CHECK instruction, a special extension of the instruction set architecture of the processor is the interface of the application with the RSE. The detection mechanisms described here in detail are: (1) the Memory Layout Randomization (MLR) Module, which randomizes the memory layout of a process in order to foil attackers who assume a fixed system layout, thus protecting against many security threats, (2) the Data Dependency Tracking (DDT) Module, which tracks the dependencies among threads of a process and maintains checkpoints of shared memory pages in order to rollback the threads when an offending (potentially malicious) thread is terminated, (3) the Instruction Checker Module (ICM), which checks an instruction for its validity or the control-flow of the program just as the instruction enters the pipeline for execution, and (4) Adaptive Heartbeat Monitor (AHBM), which enables heart-beating for checking the liveness of operating systems and/or application processes/threads. Performance simulations for the studied modules indicate low overhead of the proposed solutions. 1

