(Enter summary)
Abstract: Even after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover. All software fails at some point, so systems must be able to recover from failures. Recovery itself can fail too, so systems must know how to intelligently retry their recovery. We present here a recursive approach, in which a minimal subset of components is... (Update)
Context of citations to this paper: More
.... one or more components have failed, it instructs a recovery agent, inside the JBoss application server, to micro reboot the suspect components [2]. To keep clients from failing while parts of the system are rebooting, our stall proxy will intercept new client connections and...
Cited by: More
A Survey of Fault-Tolerance and Fault-Recovery Techniques in.. - Treaster (2005)
(Correct)
Crash-Only Software - Candea, Fox (2003)
(Correct)
JAGR: An Autonomous Self-Recovering Application Server - Candea, Kiciman, Zhang.. (2003)
(Correct)
Similar documents (at the sentence level):
10.8%: Reducing Recovery Time in a Small Recursively.. - Candea, Cutler, Fox, .. (2002)
(Correct)
Active bibliography (related documents): More All
2.2: Recursive Restartability: Turning the Reboot Sledgehammer into.. - Candea, Fox (2001)
(Correct)
1.2: Session State: Beyond Soft State - Benjamin Ling Emre (2004)
(Correct)
1.1: Designing for High Availability and Measurability - Candea, Fox (2001)
(Correct)
Similar documents based on text: More All
1.9: When Does Fast Recovery Trump High Reliability? - Fox, Patterson
(Correct)
0.2: Toward Recovery-Oriented Computing - Fox (2002)
(Correct)
0.2: Vassal: Loadable Scheduler Support for Multi-Policy Scheduling - Candea, Jones (1998)
(Correct)
Related documents from co-citation: More All
3: Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel
- Candea, Fox
2: Fail-stutter fault tolerance (context) - Arpaci-Dusseau, Arpaci-Dusseau - 2001
BibTeX entry: (Update)
G. Candea, J. Cutler, and A. Fox. Improving availability with recursive micro-reboots: A soft-state system case study. Performance Evaluation Journal, Summer 2003. to appear. http://citeseer.ist.psu.edu/candea03improving.html More
@misc{ candea03improving,
author = "G. Candea and J. Cutler and A. Fox",
title = "Improving availability with recursive micro-reboots: A soft-state system
case study",
text = "G. Candea, J. Cutler, and A. Fox. Improving availability with recursive
micro-reboots: A soft-state system case study. Performance Evaluation Journal,
Summer 2003. to appear.",
year = "2003",
url = "citeseer.ist.psu.edu/candea03improving.html" }
Citations (may not include all citations):
901
Transaction processing: concepts and techniques (context) - Gray, Reuter - 1993
833
A reliable multicast framework for light-weight sessions and..
- Floyd, Jacobson et al. - 1995
733
RSVP: A New Resource Reservation Protocol
- Zhang, Deering et al. - 1993
444
Mach: A new kernel foundation for UNIX development (context) - Accetta, Baron et al. - 1986
373
The Design and Implementation of a Log-Structured File Syste..
- Rosenblum, Ousterhout - 1991
345
Notes on data base operating systems (context) - Gray - 1978
301
End-to-end routing behavior in the Internet
- Paxson - 1996
253
Programming Perl (context) - Wall, Schwartz - 1991
235
Practical Byzantine fault tolerance
- Castro, Liskov - 1999
200
Cluster-based scalable network services
- Fox, Gribble et al. - 1997
193
The Mythical Man-Month (context) - Brooks - 1995
161
The packet filter: An efficient mechanism for user-level net..
- Mogul, Rashid et al. - 1987
149
The synchronization of periodic routing messages
- Floyd, Jacobson - 1994
144
The transaction concept: Virtues and limitations (context) - Gray - 1981
139
Recursive functions of symbolic expressions and their comput..
- McCarthy - 1959
123
Leases: An efficient fault-tolerant mechanism for distribute.. (context) - Gray, Cheriton - 1989
106
Reliable Computer Systems: Design and Evaluation (context) - Siewiorek, Swarz - 1998
84
distributed data structures for internet service constructio.. (context) - Gribble, Brewer et al. - 2000
83
The design of the Postgres storage system
- Stonebraker - 1987
79
Why do computers stop and what can be done about it
- Gray - 1986
76
Software---Practice and Experience (context) - Wirth, language - 1988
75
Lessons from giant-scale services
- Brewer - 2001
73
File system design for an NFS file server appliance
- Hitz, Lau et al. - 1994
73
An analysis of Internet content delivery systems
- Saroiu, Gummadi et al. - 2002
69
DISCO: running commodity operating systems on scalable multi..
- Bugnion, Devine et al. - 1997
68
A NonStop kernel (context) - Bartlett - 1981
66
The Rio file cache: Surviving operating system crashes
- Chen, Ng et al. - 1996
62
Scale and performance in the Denali isolation kernel
- Whitaker, Shaw et al. - 2002
57
Software implemented fault tolerance: Technologies and exper.. (context) - Huang, Kintala - 1993
56
Checkpointing and its applications
- Wang, Huang et al. - 1995
55
Simula---an Algol-based simulation language (context) - Dahl, Nygaard - 1966
55
Fault-Tolerant Computer System Design (context) - Pradhan - 1995
49
Survey of virtual machine research (context) - Goldberg - 1974
45
Recursive restartability: Turning the reboot sledgehammer in..
- Candea, Fox - 2001
44
Concurrent error detection using watchdog processors---a sur.. (context) - Mahmood, McCluskey - 1988
43
Hive: Fault containment for shared-memory multiprocessors (context) - Chapin, Rosenblum et al. - 1995
39
Software Fault Tolerance (context) - Lyu - 1995
38
An empirical study of operating systems errors
- Chou, Yang et al. - 2001
37
Chameleon: a software infrastructure for adaptive fault tole..
- Kalbarczyk, Iyer et al. - 1999
36
Software rejuvenation: Analysis (context) - Huang, Kintala et al. - 1995
35
Pinpoint: Problem determination in large (context) - Chen, Kiciman et al. - 2002
34
Free transactions with Rio Vista (context) - Lowell, Chen - 1997
31
The performance of -kernel-based systems (context) - Hartig, Hohmuth et al.
30
Berkeley DB (context) - Olson, Bostic et al. - 1999
28
Recovery oriented computing (context) - Patterson, Brown et al. - 2002
27
Measuring system and software reliability using an automated.. (context) - Murphy, Gent - 1995
26
The Recovery Box: Using fast recovery to provide high availa..
- Baker, Sullivan - 1992
25
sparse mode protocol: Specification (context) - Deering, Estrin et al. - 1996
25
Self-monitoring and self-adapting operating systems
- Seltzer, Small - 1997
25
Rollback and recovery strategies for computer programs (context) - Chandy, Ramamoorthy - 1972
25
an and S. McCanne. A model, analysis, and protocol framework.. (context) - Ram - 1999
24
Information and control in gray-box systems (context) - Arpaci-Dusseau, Arpaci-Dusseau - 2001
23
Analysis of software rejuvenation using Markov regenerative ..
- Garg, Puliafito et al. - 1995
21
Heartbeat: a timeout-free failure detector for quiescent rel..
- Aguilera, Chen et al. - 1997
18
Exploring failure transparency and the limits of generic rec.. (context) - Lowell, Chandra et al. - 2000
18
Why do internet services fail (context) - Oppenheimer, Ganapathi et al. - 2003
16
Undo for operators: Building an undoable e-mail store
- Brown, Patterson - 2003
13
com---Facing a world crisis (context) - LeFebvre - 2001
13
A methodology for detection and estimation of software aging (context) - Garg, Moorsel et al. - 1998
12
Personal Communication (context) - Brewer - 2001
12
Personal Communication (context) - Brewer - 2000
11
Minimizing completion time of a program by checkpointing and.. (context) - Garg, Huang et al. - 1996
9
Beyond fault tolerance (context) - Chou - 1997
9
Crash-only software
- Candea, Fox - 2003
9
parallel enterprise server G5 fault tolerance: A historical .. (context) - Spainhower, Gregg - 1999
9
Fail-stutter fault tolerance (context) - Arpaci-Dusseau, Arpaci-Dusseau - 2001
8
Recovery blocks in action: A system supporting high reliabil.. (context) - Anderson, Kerr - 1976
7
What really happened on Mars (context) - Reeves - 1998
7
Automatic failure-path inference: A generic introspection te.. (context) - Candea, Delgado et al.
7
Fast-Start: Quick fault recovery in Oracle (context) - Lahiri, Ganesh et al. - 2001
7
Optimizing preventative service of software products (context) - Adams - 1984
6
Using fault model enforcement to improve availability
- Nagaraja, Bianchini et al. - 2002
6
JAGR: An autonomous self-recovering application server
- Candea, Keyani et al. - 2003
5
Distributed computing with BEAWebLogic server
- Jacobs - 2003
5
Making smart investments to reduce unplanned downtime (context) - Scott - 1999
4
Luna: A flexible Java protection system (context) - Hawblitzel, von Eicken - 2002
4
JEE platform specification (context) - EE, http et al. - 2002
4
SAPPHIRE - Stanford's first amateur satellite (context) - Swartwout, Twiggs - 1998
4
Personal communication (context) - Pal - 2002
4
The role of Linux in reducing the cost of enterprise computi.. (context) - Gillen, Kusnetzki et al. - 2002
4
System reliability and availability drivers of Tru64 UNIX (context) - Murphy, Davies - 1999
4
The smart ship is not enough (context) - DiGiorgio - 1998
3
Experimental evaluation of the REE SIFT environment for spac.. (context) - Whisnant, Iyer et al. - 2002
3
Increasing relevance of memory hardware errors (context) - Milojicic, Messer et al. - 2000
3
Multitasking without comprimise: A virtual machine evolution (context) - Czajkowski, Daynes - 2001
3
General Accounting Office (context) - defense, led et al. - 1992
3
Computer Architecture: A Quatitative Approach (context) - Hennessy, Patterson - 2002
3
Reducing the Cost of Spacecraft Ground Systems and Operation.. (context) - Miau, Holdaway - 2000
3
The Fibre Channel Consultant: A Comprehensive Introduction (context) - Kembel - 1998
2
Presentation at IFIP WG 10 (context) - Spainhower, systems et al. - 2002
2
Modeling of online service availability perceived by Web use..
- Xie, Sun et al. - 2002
2
Reliability on the cheap: How I learned to stop worrying and.. (context) - Acharya - 2002
2
Springer Verlag (context) - Cousot, Analysis - 2001
2
Sustainable infrastructures: How IT services can address the.. (context) - Adams, Igou et al. - 2001
2
IBM director software rejuvenation (context) - Machines - 2001
1
Unicenter CA-SYSVIEW realtime performance management (context) - Associates - 2002
1
Verification and Validation of Modern Software-Intensive Sys.. (context) - Schulmeyer, MacKenzie - 2000
1
Applying the lessons of Internet services to space systems (context) - Cutler, Fox et al. - 2001
1
The case for a middle tier storage layer (context) - Ling, Fox - 2003
1
Tivoli monitoring resource model reference (context) - Machines - 2002
1
Integrated and correlated enterprise management with the ope.. (context) - Packard - 2002
1
Design of recovery strategies for a fault-tolerant no (context) - Willett - 1982
1
IRIX Checkpoint and Restart Operation Guide (context) - Tuthill, Johnson et al. - 1999
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC