MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Reducing recovery time in a small recursively restartable system (2002) [6 citations — 0 self]

Download:
pdf
by George C, James Cutler, O Fox, Rushabh Doshi, Priyank Garg, Rakesh Gowda
In DSN
http://roc.cs.berkeley.edu/papers/rr_mercury.pdf
Add To MetaCart

Abstract:

We present ideas on how to structure software systems for high availability by considering MTTR/MTTF characteristics of components in addition to the traditional criteria, such as functionality or state sharing. Recursive restartability (RR), a recently proposed technique for achieving high availability, exploits partial restarts at various levels within complex software infrastructures to recover from transient failures and rejuvenate software components. Here we refine the original proposal and apply the RR philosophy to Mercury, a COTS-based satellite ground station that has been in operation for over 2 years. We develop three techniques for transforming component group boundaries such that time-to-recover is reduced, hence increasing system availability. We also further RR by defining the notions of an oracle, restart group and restart policy, while showing how to reason about system properties in terms of restart groups. From our experience with applying RR to Mercury, we draw design guidelines and lessons for the systematic application of recursive restartability to other software systems amenable to RR.

Citations

252 A New Kernel Foundation For UNIX Development – Mach - 1986
102 Lessons from Giant-Scale Services – Brewer - 2001
90 Software rejuvenation: Analysis, module and applications – Huang, Kintala, et al. - 1995
76 Recursive restartability: Turning the reboot sledgehammer into a scalpel – Candea, Fox - 2001
47 Denali: Lightweight Virtual Machines for Distributed and Networked Applications – Whitaker, Shaw, et al. - 2002
38 Analysis of software rejuvenation using Markov regenerative stochastic Petri net – Garg, Puliafito, et al. - 1995
26 Recovery oriented computing (roc): Motivation, definition, techniques – Patterson, Brown, et al. - 2002
11 The event heap: An enabling infrastructure for interactive workspaces – Johanson, Fox, et al. - 2001
6 What really happened on Mars? RISKS-19.49 – Reeves - 1998
5 The smart ship is not enough – DiGiorgio - 1998
5 SAPPHIRE - Stanford's first amateur satellite – Swartwout, Twiggs - 1998
4 A utility-centered approach to building dependable infrastructure services – Candea, Fox - 2002
3 editors. Reducing the – Miau, Holdaway - 2000
1 Making sound tradeoffs in state management – Candea, Fox - 2002