Results 1 - 10
of
716
ReVirt: Enabling Intrusion Analysis through Virtual-Machine Logging and Replay
- In Proceedings of the 2002 Symposium on Operating Systems Design and Implementation (OSDI
, 2002
"... Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. Current ..."
Abstract
-
Cited by 469 (26 self)
- Add to MetaCart
(Show Context)
Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. Current system loggers have two problems: they depend on the integrity of the operating system being logged, and they do not save sufficient information to replay and analyze attacks that include any non-deterministic events. ReVirt removes the dependency on the target operating system by moving it into a virtual machine and logging below the virtual machine. This allows ReVirt to replay the system’s execution before, during, and after an intruder compromises the system, even if the intruder replaces the target operating system. ReVirt logs enough information to replay a long-term execution of the virtual machine instruction-by-instruction. This enables it to provide arbitrarily detailed observations about what transpired on the system, even in the presence of non-deterministic attacks and executions. ReVirt adds reasonable time and space overhead. Overheads due to virtualization are imperceptible for interactive use and CPU-bound workloads, and 13-58 % for kernel-intensive workloads. Logging adds 0-8 % overhead, and logging
Vigilante: End-to-End Containment of Internet Worm Epidemics
, 2008
"... Worm containment must be automatic because worms can spread too fast for humans to respond. Recent work proposed network-level techniques to automate worm containment; these techniques have limitations because there is no information about the vulnerabilities exploited by worms at the network level. ..."
Abstract
-
Cited by 304 (6 self)
- Add to MetaCart
(Show Context)
Worm containment must be automatic because worms can spread too fast for humans to respond. Recent work proposed network-level techniques to automate worm containment; these techniques have limitations because there is no information about the vulnerabilities exploited by worms at the network level. We propose Vigilante, a new end-to-end architecture to contain worms automatically that addresses these limitations. In Vigilante, hosts detect worms by instrumenting vulnerable programs to analyze infection attempts. We introduce dynamic data-flow analysis: a broad-coverage host-based algorithm that can detect unknown worms by tracking the flow of data from network messages and disallowing unsafe uses of this data. We also show how to integrate other host-based detection mechanisms into the Vigilante architecture. Upon detection, hosts generate self-certifying alerts (SCAs), a new type of security alert that can be inexpensively verified by any vulnerable host. Using SCAs, hosts can cooperate to contain an outbreak, without having to trust each other. Vigilante broadcasts SCAs over an overlay network that propagates alerts rapidly and resiliently. Hosts receiving an SCA protect themselves by generating filters with vulnerability condition slicing: an algorithm that performs dynamic analysis of the vulnerable program to identify control-flow conditions that lead
BugNet: Continuously recording program execution for deterministic replay debugging
- In ISCA
, 2005
"... Significant time is spent by companies trying to reproduce and fix the bugs that occur for released code. To assist developers, we propose the BugNet architecture to continuously record information on production runs. The information collected before the crash of a program can be used by the develop ..."
Abstract
-
Cited by 182 (16 self)
- Add to MetaCart
(Show Context)
Significant time is spent by companies trying to reproduce and fix the bugs that occur for released code. To assist developers, we propose the BugNet architecture to continuously record information on production runs. The information collected before the crash of a program can be used by the developers working in their execution environment to deterministically replay the last several million instructions executed before the crash. BugNet is based on the insight that recording the register file contents at any point in time, and then recording the load values that occur after that point can enable deterministic replaying of a program’s execution. BugNet focuses on being able to replay the application’s execution and the libraries it uses, but not the operating system. But our approach provides the ability to replay an application’s execution across context switches and interrupts. Hence, BugNet obviates the need for tracking program I/O, interrupts and DMA transfers, which would have otherwise required more complex hardware support. In addition, BugNet does not require a final core dump of the system state for replaying, which significantly reduces the amount of data that must be sent back to the developer. 1
Diskless Checkpointing
, 1997
"... Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkp ..."
Abstract
-
Cited by 158 (3 self)
- Add to MetaCart
Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.
Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging
- In USENIX Annual Technical Conference, General Track
, 2004
"... Unfortunately, finding software bugs is a very challenging task because many bugs are hard to reproduce. While debugging a program, it would be very useful to rollback a crashed program to a previous execution point and deterministically re-execute the "buggy " code region. However ..."
Abstract
-
Cited by 155 (7 self)
- Add to MetaCart
Unfortunately, finding software bugs is a very challenging task because many bugs are hard to reproduce. While debugging a program, it would be very useful to rollback a crashed program to a previous execution point and deterministically re-execute the "buggy " code region. However, most previous work on rollback and replay support was designed to survive hardware or operating system failures, and is therefore too heavyweight for the fine-grained rollback and replay needed for software debugging. This paper presents Flashback, a lightweight OS extension that provides fine-grained rollback and replay to help debug software. Flashback uses shadow processes to efficiently roll back in-memory state of a process, and logs a process ' interactions with the system to support deterministic replay. Both shadow processes and logging of system calls are implemented in a lightweight fashion specifically designed for the purpose of software debugging. We have implemented a prototype of Flashback in the Linux operating system. Our experimental results with micro-benchmarks and real applications show that Flashback adds little overhead and can quickly roll back a debugged program to a previous execution point and deterministically replay from that point.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery
- In Proceedings of the 29th Annual International Symposium on Computer Architecture
, 2002
"... We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint~recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multi-ple, globally consistent checkpoints of the state of a shared memo ..."
Abstract
-
Cited by 137 (10 self)
- Add to MetaCart
(Show Context)
We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint~recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multi-ple, globally consistent checkpoints of the state of a shared memory muhiprocessor (i.e., processors, memor3; and coherence permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a fault is detected. SafetyNet efficiently coordinates checkpoints across the system in logical time and uses "logically atomic " coherence transactions to free checkpoints of transient coherence state. SafetyNet minimizes perfor-mance overhead by pipelining checkpoint validation with subsequent parallel execution. We illustrate SafetyNet avoiding system crashes due to either dropped coherence messages or the loss of an inter-connection network switch (and its buffered messages). Using full-system simulation of a 16-way muhiprocessor running commercial workloads, we find that SafetyNet (a) adds statistically insignificant runtime overhead in the common-case of fault-free execution, and (b) avoids a crash when tolerated faults occur. 1
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
- In Supercomputing
, 2002
"... Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or min ..."
Abstract
-
Cited by 136 (9 self)
- Add to MetaCart
(Show Context)
Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes.
Execution replay of multiprocessor virtual machines
- In VEE
, 2008
"... Abstract Execution replay of virtual machines is a technique which has many important applications, including debugging, fault-tolerance, and security. Execution replay for single processor virtual machines is well-understood, and available commercially. With the advancement of multi-core architect ..."
Abstract
-
Cited by 124 (4 self)
- Add to MetaCart
(Show Context)
Abstract Execution replay of virtual machines is a technique which has many important applications, including debugging, fault-tolerance, and security. Execution replay for single processor virtual machines is well-understood, and available commercially. With the advancement of multi-core architectures, however, multiprocessor virtual machines are becoming more important. Our system, SMP-ReVirt, is the first system to log and replay a multiprocessor virtual machine on commodity hardware. We use hardware page protection to detect and accurately replay sharing between virtual cpus of a multi-cpu virtual machine, allowing us to replay the entire operating system and all applications. We have tested our system on a variety of workloads, and find that although sharing under SMPReVirt is expensive, for many workloads and applications, including debugging, the overhead is acceptable.
ReVive: CostEffective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
- In ISCA-02
, 2002
"... This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all me ..."
Abstract
-
Cited by 120 (13 self)
- Add to MetaCart
(Show Context)
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the permanent loss of an entire node. To maintain high performance, ReVive includes specialized hardware that performs frequent operations in the background, such as log and parity updates. To keep the cost low, more complex checkpointing and recovery functions are performed in software, while the hardware modifications are limited to the directory controllers of the machine. Our simulation results on a 16-processor system indicate that the average error-free execution time overhead of using ReVive is only 6.3%, while the achieved availability is better than 99.999 % even when the errors occur as often as once per day. 1
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
- in Proceedings, LACSI Symposium, Sante Fe
, 2003
"... As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback ..."
Abstract
-
Cited by 109 (10 self)
- Add to MetaCart
(Show Context)
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernellevel process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible communication performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI. 1