See this document in CiteSeerX!

Improving Availability with Recursive Micro-Reboots: A Soft-State System Case Study (2003)  (Make Corrections)  (3 citations)
George Candea, James Cutler, Armando Fox



  Home/Search   Context   Related

 
View or download:
stanford.edu/~candea/pape...perfeval.ps
Cached:  PS.gz  PS  PDF   Image  Update  Help

From:  stanford.edu/~candea/pape...index (more)
(Enter author homepages)

Rate this article: (best)
  Comment on this article  
(Enter summary)

Abstract: Even after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover. All software fails at some point, so systems must be able to recover from failures. Recovery itself can fail too, so systems must know how to intelligently retry their recovery. We present here a recursive approach, in which a minimal subset of components is... (Update)

Context of citations to this paper:   More

.... one or more components have failed, it instructs a recovery agent, inside the JBoss application server, to micro reboot the suspect components [2]. To keep clients from failing while parts of the system are rebooting, our stall proxy will intercept new client connections and...

Cited by:   More
A Survey of Fault-Tolerance and Fault-Recovery Techniques in.. - Treaster (2005)   (Correct)
Crash-Only Software - Candea, Fox (2003)   (Correct)
JAGR: An Autonomous Self-Recovering Application Server - Candea, Kiciman, Zhang.. (2003)   (Correct)

Similar documents (at the sentence level):
10.8%:   Reducing Recovery Time in a Small Recursively.. - Candea, Cutler, Fox, .. (2002)   (Correct)

Active bibliography (related documents):   More   All
2.2:   Recursive Restartability: Turning the Reboot Sledgehammer into.. - Candea, Fox (2001)   (Correct)
1.2:   Session State: Beyond Soft State - Benjamin Ling Emre (2004)   (Correct)
1.1:   Designing for High Availability and Measurability - Candea, Fox (2001)   (Correct)

Similar documents based on text:   More   All
1.9:   When Does Fast Recovery Trump High Reliability? - Fox, Patterson   (Correct)
0.2:   Toward Recovery-Oriented Computing - Fox (2002)   (Correct)
0.2:   Vassal: Loadable Scheduler Support for Multi-Policy Scheduling - Candea, Jones (1998)   (Correct)

Related documents from co-citation:   More   All
3:   Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - Candea, Fox
2:   Fail-stutter fault tolerance (context) - Arpaci-Dusseau, Arpaci-Dusseau - 2001

BibTeX entry:   (Update)

G. Candea, J. Cutler, and A. Fox. Improving availability with recursive micro-reboots: A soft-state system case study. Performance Evaluation Journal, Summer 2003. to appear. http://citeseer.ist.psu.edu/candea03improving.html   More

@misc{ candea03improving,
  author = "G. Candea and J. Cutler and A. Fox",
  title = "Improving availability with recursive micro-reboots: A soft-state system
    case study",
  text = "G. Candea, J. Cutler, and A. Fox. Improving availability with recursive
    micro-reboots: A soft-state system case study. Performance Evaluation Journal,
    Summer 2003. to appear.",
  year = "2003",
  url = "citeseer.ist.psu.edu/candea03improving.html" }
Citations (may not include all citations):
901   Transaction processing: concepts and techniques (context) - Gray, Reuter - 1993
833   A reliable multicast framework for light-weight sessions and.. - Floyd, Jacobson et al. - 1995
733   RSVP: A New Resource Reservation Protocol - Zhang, Deering et al. - 1993
444   Mach: A new kernel foundation for UNIX development (context) - Accetta, Baron et al. - 1986
373   The Design and Implementation of a Log-Structured File Syste.. - Rosenblum, Ousterhout - 1991
345   Notes on data base operating systems (context) - Gray - 1978
301   End-to-end routing behavior in the Internet - Paxson - 1996
253   Programming Perl (context) - Wall, Schwartz - 1991
235   Practical Byzantine fault tolerance - Castro, Liskov - 1999
200   Cluster-based scalable network services - Fox, Gribble et al. - 1997
193   The Mythical Man-Month (context) - Brooks - 1995
161   The packet filter: An efficient mechanism for user-level net.. - Mogul, Rashid et al. - 1987
149   The synchronization of periodic routing messages - Floyd, Jacobson - 1994
144   The transaction concept: Virtues and limitations (context) - Gray - 1981
139   Recursive functions of symbolic expressions and their comput.. - McCarthy - 1959
123   Leases: An efficient fault-tolerant mechanism for distribute.. (context) - Gray, Cheriton - 1989
106   Reliable Computer Systems: Design and Evaluation (context) - Siewiorek, Swarz - 1998
84   distributed data structures for internet service constructio.. (context) - Gribble, Brewer et al. - 2000
83   The design of the Postgres storage system - Stonebraker - 1987
79   Why do computers stop and what can be done about it - Gray - 1986
76   Software---Practice and Experience (context) - Wirth, language - 1988
75   Lessons from giant-scale services - Brewer - 2001
73   File system design for an NFS file server appliance - Hitz, Lau et al. - 1994
73   An analysis of Internet content delivery systems - Saroiu, Gummadi et al. - 2002
69   DISCO: running commodity operating systems on scalable multi.. - Bugnion, Devine et al. - 1997
68   A NonStop kernel (context) - Bartlett - 1981
66   The Rio file cache: Surviving operating system crashes - Chen, Ng et al. - 1996
62   Scale and performance in the Denali isolation kernel - Whitaker, Shaw et al. - 2002
57   Software implemented fault tolerance: Technologies and exper.. (context) - Huang, Kintala - 1993
56   Checkpointing and its applications - Wang, Huang et al. - 1995
55   Simula---an Algol-based simulation language (context) - Dahl, Nygaard - 1966
55   Fault-Tolerant Computer System Design (context) - Pradhan - 1995
49   Survey of virtual machine research (context) - Goldberg - 1974
45   Recursive restartability: Turning the reboot sledgehammer in.. - Candea, Fox - 2001
44   Concurrent error detection using watchdog processors---a sur.. (context) - Mahmood, McCluskey - 1988
43   Hive: Fault containment for shared-memory multiprocessors (context) - Chapin, Rosenblum et al. - 1995
39   Software Fault Tolerance (context) - Lyu - 1995
38   An empirical study of operating systems errors - Chou, Yang et al. - 2001
37   Chameleon: a software infrastructure for adaptive fault tole.. - Kalbarczyk, Iyer et al. - 1999
36   Software rejuvenation: Analysis (context) - Huang, Kintala et al. - 1995
35   Pinpoint: Problem determination in large (context) - Chen, Kiciman et al. - 2002
34   Free transactions with Rio Vista (context) - Lowell, Chen - 1997
31   The performance of -kernel-based systems (context) - Hartig, Hohmuth et al.
30   Berkeley DB (context) - Olson, Bostic et al. - 1999
28   Recovery oriented computing (context) - Patterson, Brown et al. - 2002
27   Measuring system and software reliability using an automated.. (context) - Murphy, Gent - 1995
26   The Recovery Box: Using fast recovery to provide high availa.. - Baker, Sullivan - 1992
25   sparse mode protocol: Specification (context) - Deering, Estrin et al. - 1996
25   Self-monitoring and self-adapting operating systems - Seltzer, Small - 1997
25   Rollback and recovery strategies for computer programs (context) - Chandy, Ramamoorthy - 1972
25   an and S. McCanne. A model, analysis, and protocol framework.. (context) - Ram - 1999
24   Information and control in gray-box systems (context) - Arpaci-Dusseau, Arpaci-Dusseau - 2001
23   Analysis of software rejuvenation using Markov regenerative .. - Garg, Puliafito et al. - 1995
21   Heartbeat: a timeout-free failure detector for quiescent rel.. - Aguilera, Chen et al. - 1997
18   Exploring failure transparency and the limits of generic rec.. (context) - Lowell, Chandra et al. - 2000
18   Why do internet services fail (context) - Oppenheimer, Ganapathi et al. - 2003
16   Undo for operators: Building an undoable e-mail store - Brown, Patterson - 2003
13   com---Facing a world crisis (context) - LeFebvre - 2001
13   A methodology for detection and estimation of software aging (context) - Garg, Moorsel et al. - 1998
12   Personal Communication (context) - Brewer - 2001
12   Personal Communication (context) - Brewer - 2000
11   Minimizing completion time of a program by checkpointing and.. (context) - Garg, Huang et al. - 1996
9   Beyond fault tolerance (context) - Chou - 1997
9   Crash-only software - Candea, Fox - 2003
9   parallel enterprise server G5 fault tolerance: A historical .. (context) - Spainhower, Gregg - 1999
9   Fail-stutter fault tolerance (context) - Arpaci-Dusseau, Arpaci-Dusseau - 2001
8   Recovery blocks in action: A system supporting high reliabil.. (context) - Anderson, Kerr - 1976
7   What really happened on Mars (context) - Reeves - 1998
7   Automatic failure-path inference: A generic introspection te.. (context) - Candea, Delgado et al.
7   Fast-Start: Quick fault recovery in Oracle (context) - Lahiri, Ganesh et al. - 2001
7   Optimizing preventative service of software products (context) - Adams - 1984
6   Using fault model enforcement to improve availability - Nagaraja, Bianchini et al. - 2002
6   JAGR: An autonomous self-recovering application server - Candea, Keyani et al. - 2003
5   Distributed computing with BEAWebLogic server - Jacobs - 2003
5   Making smart investments to reduce unplanned downtime (context) - Scott - 1999
4   Luna: A flexible Java protection system (context) - Hawblitzel, von Eicken - 2002
4   JEE platform specification (context) - EE, http et al. - 2002
4   SAPPHIRE - Stanford's first amateur satellite (context) - Swartwout, Twiggs - 1998
4   Personal communication (context) - Pal - 2002
4   The role of Linux in reducing the cost of enterprise computi.. (context) - Gillen, Kusnetzki et al. - 2002
4   System reliability and availability drivers of Tru64 UNIX (context) - Murphy, Davies - 1999
4   The smart ship is not enough (context) - DiGiorgio - 1998
3   Experimental evaluation of the REE SIFT environment for spac.. (context) - Whisnant, Iyer et al. - 2002
3   Increasing relevance of memory hardware errors (context) - Milojicic, Messer et al. - 2000
3   Multitasking without comprimise: A virtual machine evolution (context) - Czajkowski, Daynes - 2001
3   General Accounting Office (context) - defense, led et al. - 1992
3   Computer Architecture: A Quatitative Approach (context) - Hennessy, Patterson - 2002
3   Reducing the Cost of Spacecraft Ground Systems and Operation.. (context) - Miau, Holdaway - 2000
3   The Fibre Channel Consultant: A Comprehensive Introduction (context) - Kembel - 1998
2   Presentation at IFIP WG 10 (context) - Spainhower, systems et al. - 2002
2   Modeling of online service availability perceived by Web use.. - Xie, Sun et al. - 2002
2   Reliability on the cheap: How I learned to stop worrying and.. (context) - Acharya - 2002
2   Springer Verlag (context) - Cousot, Analysis - 2001
2   Sustainable infrastructures: How IT services can address the.. (context) - Adams, Igou et al. - 2001
2   IBM director software rejuvenation (context) - Machines - 2001
1   Unicenter CA-SYSVIEW realtime performance management (context) - Associates - 2002
1   Verification and Validation of Modern Software-Intensive Sys.. (context) - Schulmeyer, MacKenzie - 2000
1   Applying the lessons of Internet services to space systems (context) - Cutler, Fox et al. - 2001
1   The case for a middle tier storage layer (context) - Ling, Fox - 2003
1   Tivoli monitoring resource model reference (context) - Machines - 2002
1   Integrated and correlated enterprise management with the ope.. (context) - Packard - 2002
1   Design of recovery strategies for a fault-tolerant no (context) - Willett - 1982
1   IRIX Checkpoint and Restart Operation Guide (context) - Tuthill, Johnson et al. - 1999

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC