Results 1 - 10
of
95
An Empirical Study of Operating System Errors
, 2001
"... We present a study of operating system errors found by automatic, static, compiler analysis applied to the Linux and OpenBSD kernels. Our approach differs from previ-ous studies that consider errors found by manual inspec-tion of logs, testing, and surveys because static analysis is applied uniforml ..."
Abstract
-
Cited by 199 (5 self)
- Add to MetaCart
We present a study of operating system errors found by automatic, static, compiler analysis applied to the Linux and OpenBSD kernels. Our approach differs from previ-ous studies that consider errors found by manual inspec-tion of logs, testing, and surveys because static analysis is applied uniformly to the entire kernel source, though our approach necessarily considers a less comprehensive variety of errors than previous studies. In addition, au-tomation allows us to track errors over multiple versions of the kernel source to estimate how long errors remain in the system before they are fixed. We found that device drivers have error rates up to three to seven times higher than the rest of the ker-nel. We found that the largest quartile of functions have error rates two to six times higher than the small-est quartile. We found that the newest quartile of files have error rates up to twice that of the oldest quartile, which provides evidence that code "hardens " over time. Finally, we found that bugs remain in the Linux kernel an average of 1.8 years before being fixed. 1
Efficient Detection of All Pointer and Array Access Errors
, 1994
"... We present a pointer and array access checking technique that provides complete error coverage through a simple set of program transformations. Our technique, based on an extended safe pointer representation, has a number of novel aspects. Foremost, it is the first technique that detects all spatial ..."
Abstract
-
Cited by 195 (1 self)
- Add to MetaCart
We present a pointer and array access checking technique that provides complete error coverage through a simple set of program transformations. Our technique, based on an extended safe pointer representation, has a number of novel aspects. Foremost, it is the first technique that detects all spatial and temporal access errors. Its use is not limited by the expressiveness of the language; that is, it can be applied successfully to compiled or interpreted languages with subscripted and mutable pointers, local references, and explicit and typeless dynamic storage management, e.g., C. Because it is a source level transformation, it is amenable to both compile- and run-time optimization. Finally, its performance, even without compile-time optimization, is quite good. We implemented a prototype translator for the C language and analyzed the checking overheads of six non-trivial, pointer intensive programs. Execution overheads range from 130 % to 540%; with text and data size overheads typically below 100%.
The Rio File Cache: Surviving Operating System Crashes
- In Proc. 7th Intl. Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS
, 1996
"... Abstract: One of the fundamental limits to high-performance, high-reliability file systems is memory’s vulnerability to system crashes. Because memory is viewed as unsafe, systems periodically write data back to disk. The extra disk traffic lowers performance, and the delay period before data is saf ..."
Abstract
-
Cited by 105 (13 self)
- Add to MetaCart
Abstract: One of the fundamental limits to high-performance, high-reliability file systems is memory’s vulnerability to system crashes. Because memory is viewed as unsafe, systems periodically write data back to disk. The extra disk traffic lowers performance, and the delay period before data is safe lowers reliability. The goal of the Rio (RAM I/O) file cache is to make ordinary main memory safe for persistent storage by enabling memory to survive operating system crashes. Reliable memory enables a system to achieve the best of both worlds: reliability equivalent to a write-through file cache, where every write is instantly safe, and performance equivalent to a pure write-back cache, with no reliability-induced writes to disk. To achieve reliability, we protect memory during a crash and restore it during a reboot (a “warm ” reboot). Extensive crash tests show that even without protection, warm reboot enables memory to achieve reliability close to that of a write-through file system while performing 20 times faster. Rio makes all writes immediately permanent, yet performs faster than systems that lose 30 seconds of data on a crash: 35% faster than a standard delayed-write file system and 8 % faster than a system that delays both data and metadata. For applications that demand even higher levels of reliability, Rio’s optional protection mechanism makes memory even safer than a write-through file system while while lowering performance 20 % compared to a pure write-back system. 1
Microreboot - A Technique for Cheap Recovery
, 2004
"... A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we sep ..."
Abstract
-
Cited by 94 (2 self)
- Add to MetaCart
A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we separate process recovery from data recovery to enable microrebooting -- a fine-grain technique for surgically recovering faulty application components, without disturbing the rest of the application.
Chameleon: A Software Infrastructure For Adaptive Fault Tolerance In Distributed Systems
, 1998
"... The project has benefited through some demonstrations and presentations given to people from the computer industry, and the discussions they generated. Of them, mention must be made of Pankaj Mehra of Tandem Labs, Roger Lee, Robert Ferraro and Jagdish Patel of the Jet Propulsion Laboratory, Larry Ja ..."
Abstract
-
Cited by 80 (11 self)
- Add to MetaCart
The project has benefited through some demonstrations and presentations given to people from the computer industry, and the discussions they generated. Of them, mention must be made of Pankaj Mehra of Tandem Labs, Roger Lee, Robert Ferraro and Jagdish Patel of the Jet Propulsion Laboratory, Larry Jack and Chet Markiewicz of Honeywell Inc., and Haim Levendel of Motorola. iii TABLE OF CONTENTS CHAPTER PAGE 1 INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2 RELATED WORK : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 3 CHAMELEON OVERVIEW : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14 3.1 Behavioral Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15 3.1.1 Initialization of the Chameleon Environment : : : : : : : : : : : : : : : : 15 3.1.2 Interpreting User-Specified Depen
Treating bugs as allergies -- a safe method to survive software failures
- IN SOSP
, 2005
"... Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Previous approaches for surviving software failures suffer from several limitations, including requiring application restructuring, failing to address deterministic software bugs, unsafely spe ..."
Abstract
-
Cited by 69 (6 self)
- Add to MetaCart
Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Previous approaches for surviving software failures suffer from several limitations, including requiring application restructuring, failing to address deterministic software bugs, unsafely speculating on program execution, and re-quiring a long recovery time. This paper
Hive: Fault Containment for Shared-Memory Multiprocessors
, 1995
"... Reliability and scalability are major concerns when designing operating systems for large-scale shared-memory multiprocessors. In this paper we describe Hive, an operating system with a novel kernel architecture that addresses these issues Hive is structured as an internal distributed system of inde ..."
Abstract
-
Cited by 65 (8 self)
- Add to MetaCart
Reliability and scalability are major concerns when designing operating systems for large-scale shared-memory multiprocessors. In this paper we describe Hive, an operating system with a novel kernel architecture that addresses these issues Hive is structured as an internal distributed system of independent kernels called cells. This improves reliabihty because a hardwme or software fault damages only one cell rather than the whole system, and improves scalability because few kernel resources are shared by processes running on different cells. The Hive prototype is a complete implementation of UNIX SVR4 and is targeted to run on the Stanford FLASH multiprocessor. This paper focuses on Hive’s solutlon to the following key challenges: ( 1) fault containment, i.e. confining the effects of hardware or software faults to the cell where they occur, and (2) memory sharing among cells, which is requmed to achieve application performance competitive with other multiprocessor operating systems. Fault containment in a shared-memory multiprocessor requmes defending each cell against erroneous writes caused by faults in other cells. Hive prevents such damage by using the FLASH jirewzdl, a write permission bit-vector associated with each page of memory, and by discarding potentially corrupt pages when a fault is detected. Memory sharing is provided through a unified file and virtual memory page cache across the cells, and through a umfied free page frame pool. We report early experience with the system, including the results of fault injection and performance experiments using SimOS, an accurate simulator of FLASH, The effects of faults were contained to the cell in which they occurred m all 49 tests where we injected fail-stop hardware faults, and in all 20 tests where we injected kernel data corruption. The Hive prototype executes test workloads on a four-processor four-cell system with between 0’%.and 11YOslowdown as compared to SGI IRIX 5.2 (the version of UNIX on which it is based).
SafeDrive: Safe and recoverable extensions using language-based techniques
- In OSDI’06
, 2006
"... We present SafeDrive, a system for detecting and recovering from type safety violations in software extensions. SafeDrive has low overhead and requires minimal changes to existing source code. To achieve this result, SafeDrive uses a novel type system that provides finegrained isolation for existing ..."
Abstract
-
Cited by 58 (4 self)
- Add to MetaCart
We present SafeDrive, a system for detecting and recovering from type safety violations in software extensions. SafeDrive has low overhead and requires minimal changes to existing source code. To achieve this result, SafeDrive uses a novel type system that provides finegrained isolation for existing extensions written in C. In addition, SafeDrive tracks invariants using simple wrappers for the host system API and restores them when recovering from a violation. This approach achieves finegrained memory error detection and recovery with few code changes and at a significantly lower performance cost than existing solutions based on hardware-enforced domains, such as Nooks [33], L4 [21], and Xen [13], or software-enforced domains, such as SFI [35]. The principles used in SafeDrive can be applied to any large system with loadable, error-prone extension modules. In this paper we describe our experience using SafeDrive for protection and recovery of a variety of Linux device drivers. In order to apply SafeDrive to these device drivers, we had to change less than 4 % of the source code. SafeDrive recovered from all 44 crashes due to injected faults in a network card driver. In experiments with 6 different drivers, we observed increases in kernel CPU utilization of 4–23 % with no noticeable degradation in end-to-end performance. 1
Detecting Application-Level Failures in Component-based Internet Services
, 2004
"... Pinpoint is an application-generic framework for using statistical learning techniques to detect and localize likely application-level failures in component-based Internet services. Assuming that most of the system is working most of the time, Pinpoint looks for anomalies in low-level behaviors that ..."
Abstract
-
Cited by 53 (10 self)
- Add to MetaCart
Pinpoint is an application-generic framework for using statistical learning techniques to detect and localize likely application-level failures in component-based Internet services. Assuming that most of the system is working most of the time, Pinpoint looks for anomalies in low-level behaviors that are likely to reflect high-level application faults, and correlates these anomalies to their potential causes within the system. In our experiments, Pinpoint correctly detected and localized over 70-88% of the faults, depending on the type of fault, we injected into our testbed system, as compared to the 50-70% detected by current techniques. By demonstrating the applicability of statistical learning and providing an application-generic platform on which additional machine learning techniques can be applied to the problem of fast failure detection, we hope to hasten the adoption of statistical approaches to dependability for complex software systems.
The Recovery Box: Using Fast Recovery to Provide High Availability in the UNIX Environment
- In Proceedings USENIX Summer Conference
, 1992
"... As organizations with high system availability requirements move to UNIX, the elimination of down-time in the UNIX environment becomes a more important issue. Designing for fast recovery, rather than crash prevention, can provide low-cost highlyavailable systems without sacrificing performance or si ..."
Abstract
-
Cited by 52 (2 self)
- Add to MetaCart
As organizations with high system availability requirements move to UNIX, the elimination of down-time in the UNIX environment becomes a more important issue. Designing for fast recovery, rather than crash prevention, can provide low-cost highlyavailable systems without sacrificing performance or simplicity. In Sprite, a UNIX-like distributed operating system, we accomplish this fast recovery in part through the use of a recovery box: a stable area of memory in which the system stores carefully selected pieces of system state, and from which the system can be regenerated quickly. Error detection using checksums allows the system to revert to its traditional reboot sequence if the recovery box data is corrupted during system failure. Recent statistics about the types and frequencies of operating system failures indicate that fast recovery using the recovery box will be possible most of the time. Using our recovery box implementation, a Sprite file server recovers in 26 seconds and a da...

