See this document in CiteSeerX!

Accepting Failure: Availability through Repair-centric System Design (2001)  (Make Corrections)  (1 citation)
Aaron Brown



  Home/Search   Context   Related

 
View or download:
berkeley.edu/~abro...qualsproposal.pdf
Cached:  PS.gz  PS  PDF   Image  Update  Help

From:  berkeley.edu (more)
(Enter author homepages)

Rate this article: (best)
  Comment on this article  
(Enter summary)

Abstract: Motivated by the lack of rapid improvement in the availability of Internet server systems, we introduce a new philosophy for designing highly-available systems that better reflects the realities of the Internet service environment. Our approach, denoted repair-centric system design, is based on the belief that failures are inevitable in complex, human-administered systems, and thus we focus on detecting and repairing failures quickly and effectively, rather than just trying to avoid them.... (Update)

Cited by:   More
ROC-1: Hardware Support for Recovery-Oriented Computing - Oppenheimer, Brown.. (2002)   (Correct)

Active bibliography (related documents):   More   All
1.9:   Recovery Oriented Computing (ROC): Motivation.. - Patterson, Brown, .. (2002)   (Correct)
1.3:   To Err is Human - Brown, Patterson (2001)   (Correct)
1.3:   Embracing Failure: A Case for Recovery-Oriented Computing (ROC) - Brown, Patterson (2001)   (Correct)

Similar documents based on text:   More   All
0.7:   When Does Fast Recovery Trump High Reliability? - Fox, Patterson   (Correct)
0.5:   Toward Recovery-Oriented Computing - Fox (2002)   (Correct)
0.1:   Reducing Recovery Time in a Small Recursively.. - Candea, Cutler, Fox, .. (2002)   (Correct)

BibTeX entry:   (Update)

A. Brown. Accepting failure: availability through repair-centric system design. U.C. Berkeley Qualifying Exam Proposal, 2001. http://citeseer.ist.psu.edu/brown01accepting.html   More

@misc{ brown01accepting,
  author = "A. Brown",
  title = "Accepting failure: availability through repair-centric system design",
  text = "A. Brown. Accepting failure: availability through repair-centric system
    design. U.C. Berkeley Qualifying Exam Proposal, 2001.",
  year = "2001",
  url = "citeseer.ist.psu.edu/brown01accepting.html" }
Citations (may not include all citations):
1575   Computer Architecture: A Quantitative Approach (context) - Hennessy, Patterson - 2001
200   Cluster-based Scalable Network Services - Fox, Gribble et al. - 1997
199   The Paradyn Parallel Performance Measurement Tool - Miller, Callaghan - 1995
189   ARIES: A Transaction Recovery Method Supporting Fine-Granula.. (context) - Mohan, Haderle et al. - 1992
180   A Survey of Rollback-Recovery Protocols in Message-Passing S.. - Elnozahy, Johnson et al. - 1996
105   The Ninja architecture for robust Internet-scale systems and.. - Gribble, Welsh et al.
79   Why Do Computers Stop and What Can Be Done About It - Gray - 1986
68   ACM Transactions on Computer Systems (context) - Borg, Blau et al. - 1989
66   The Rio File Cache: Surviving Operating System Crashes - Chen, Ng et al.
45   Recursive Restartability: Turning the Reboot Sledgehammer in.. - Candea, Fox
44   Normal Accidents: Living with High-Risk Technologies (context) - Perrow - 1999
36   Software Rejuvenation: Analysis (context) - Huang, Kintala et al.
27   Measuring System and Software Reliability using an Automated.. (context) - Murphy, Gent - 1995
27   Towards Availability Benchmarks: A Case Study of Software RA.. (context) - Brown, Patterson
26   High Speed and Robust Event Correlation (context) - Yemini, Kliger - 1996
24   Virtual Services: A New Abstraction for Server Consolidation - Reumann, Mehra et al.
19   and Scalable Tolerant Systems (context) - Fox, Brewer et al. - 1999
18   Exploring Failure Transparency and the Limits of Generic Rec.. (context) - Lowell, Chandra et al. - 2000
17   Fault Isolation and Event Correlation for Integrated Fault M.. (context) - tker, Paterok - 1997
17   Integrated Event Management: Event Correlation Using Depende.. - Gruschke - 1998
12   An Active Approach to Characterizing Dynamic Dependencies fo.. - Brown, Kar et al. - 2001
11   Probabilistic modeling of computer system availability (context) - Goyal, Lavenberg et al. - 1987
10   Managing Application Services over Service Provider Networks.. - Kar, Keller et al.
9   Low-Overhead Recovery for General Applications (context) - Lowell, Chen et al. - 1998
9   An Alarm Correlation and Fault Identification Scheme Based o.. (context) - Choi, Choi et al.
7   Human Factors: Understanding People-System Relationships (context) - Kantowitz, Sorkin - 1983
6   Auto-diagnosis of Field Problems in an Appliance Operating S.. (context) - Banga
6   Characterizing Large Storage Systems: Error Behavior and Per.. (context) - Talagala - 1999
5   No Time for DOWNTIMEIT Managers feel the heat to prevent out.. (context) - Sweeney - 2000
5   Prevention of Online Crashes is No Easy Fix (context) - Menn - 1999
5   the Necessity of On-line-BIST in Safety-Critical Application.. (context) - Steininger, Scherrer
5   A Fault-Tolerant CMOS Mainframe (context) - Spainhower, Gregg
4   Specifying Reliability in the Disk Drive Industry: No More M.. (context) - Elerath
4   They Write the Right Stuff (context) - Fishman - 1996
4   Human Performance: What Improvement from Human Reliability A.. (context) - Pope - 1986
3   Personal communication (context) - Bartlett - 2001
3   Dependability at the User Interface (context) - Maxion, deChambeau - 1995
3   business redefines infrastructure needs (context) - Fisher - 2000
3   File System Design for an NFS Server Appliance (context) - Hitz, Lau et al. - 1995
3   RecoveryServiceability System Test Improvement IBM ES Based.. (context) - Merenda, System et al.
2   Mitigating Operator-Induced Unavailability by Matching Impre.. (context) - Maxion, Syme - 1996
2   Field Experience in Maintenance (context) - Christensen, Howard
2   Towards Availability and Maintainability Benchmarks: A Case .. (context) - Brown - 2001
2   New Problems in Fault-Tolerant Computing (context) - Goldberg
2   Human Detection and Diagnosis of System Failures: Proceeding.. (context) - Rasmussen, Rouse - 1981
1   Rules: Advice on government (context) - Rumsfeld - 2001
1   San Francisco: Morgan-Kauffmann (context) - Gray, Reuter et al. - 1993
1   Developing Reliable Software (context) - Keene, Lane et al.
1   The Use of Flow Models for Automated Plant Diagnosis (context) - Lind
1   Human Behavior Modeling in Train Control Systems (context) - Joshi, Kaufman et al.
1   High-Availability Transaction Processing: Practical Experien.. (context) - Bowles, Dobbins
1   Failure Detection in Dynamic Systems (context) - Wickens, Kessel
1   Design for Fault-Tolerance in System ES/9000 Model (context) - Spainhower, Isenberg et al.
1   A Protocol-centric Design for Architecting Large Storage Sys.. (context) - Howard, Berube et al.
1   A Fault-Finding Training Programme for Continuous Plant Oper.. (context) - Marshall, Shepherd
1   Training for Fault Diagnosis in an Industrial Process Plant (context) - Duncan
1   The role of paper flight strips in air traffic control (context) - MacKay, safer - 1999

Documents on the same site (http://roc.cs.berkeley.edu):   More
Architecture, operation, and dependability of large-scale .. - Oppenheimer, Patterson (2002)   (Correct)
Availability Benchmarking of a Database System - Brown   (Correct)
Recovery Oriented Computing (ROC): Motivation.. - Patterson, Brown, .. (2002)   (Correct)

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC