(Enter summary)
Abstract: Motivated by the lack of rapid improvement in the availability of Internet server systems,
we introduce a new philosophy for designing highly-available systems that better reflects
the realities of the Internet service environment. Our approach, denoted repair-centric
system design, is based on the belief that failures are inevitable in complex, human-administered
systems, and thus we focus on detecting and repairing failures quickly and effectively,
rather than just trying to avoid them.... (Update)
Cited by: More
ROC-1: Hardware Support for Recovery-Oriented Computing - Oppenheimer, Brown.. (2002)
(Correct)
Active bibliography (related documents): More All
1.9: Recovery Oriented Computing (ROC): Motivation.. - Patterson, Brown, .. (2002)
(Correct)
1.3: To Err is Human - Brown, Patterson (2001)
(Correct)
1.3: Embracing Failure: A Case for Recovery-Oriented Computing (ROC) - Brown, Patterson (2001)
(Correct)
Similar documents based on text: More All
0.7: When Does Fast Recovery Trump High Reliability? - Fox, Patterson
(Correct)
0.5: Toward Recovery-Oriented Computing - Fox (2002)
(Correct)
0.1: Reducing Recovery Time in a Small Recursively.. - Candea, Cutler, Fox, .. (2002)
(Correct)
BibTeX entry: (Update)
A. Brown. Accepting failure: availability through repair-centric system design. U.C. Berkeley Qualifying Exam Proposal, 2001. http://citeseer.ist.psu.edu/brown01accepting.html More
@misc{ brown01accepting,
author = "A. Brown",
title = "Accepting failure: availability through repair-centric system design",
text = "A. Brown. Accepting failure: availability through repair-centric system
design. U.C. Berkeley Qualifying Exam Proposal, 2001.",
year = "2001",
url = "citeseer.ist.psu.edu/brown01accepting.html" }
Citations (may not include all citations):
1575
Computer Architecture: A Quantitative Approach (context) - Hennessy, Patterson - 2001
200
Cluster-based Scalable Network Services
- Fox, Gribble et al. - 1997
199
The Paradyn Parallel Performance Measurement Tool
- Miller, Callaghan - 1995
189
ARIES: A Transaction Recovery Method Supporting Fine-Granula.. (context) - Mohan, Haderle et al. - 1992
180
A Survey of Rollback-Recovery Protocols in Message-Passing S..
- Elnozahy, Johnson et al. - 1996
105
The Ninja architecture for robust Internet-scale systems and..
- Gribble, Welsh et al.
79
Why Do Computers Stop and What Can Be Done About It
- Gray - 1986
68
ACM Transactions on Computer Systems (context) - Borg, Blau et al. - 1989
66
The Rio File Cache: Surviving Operating System Crashes
- Chen, Ng et al.
45
Recursive Restartability: Turning the Reboot Sledgehammer in..
- Candea, Fox
44
Normal Accidents: Living with High-Risk Technologies (context) - Perrow - 1999
36
Software Rejuvenation: Analysis (context) - Huang, Kintala et al.
27
Measuring System and Software Reliability using an Automated.. (context) - Murphy, Gent - 1995
27
Towards Availability Benchmarks: A Case Study of Software RA.. (context) - Brown, Patterson
26
High Speed and Robust Event Correlation (context) - Yemini, Kliger - 1996
24
Virtual Services: A New Abstraction for Server Consolidation
- Reumann, Mehra et al.
19
and Scalable Tolerant Systems (context) - Fox, Brewer et al. - 1999
18
Exploring Failure Transparency and the Limits of Generic Rec.. (context) - Lowell, Chandra et al. - 2000
17
Fault Isolation and Event Correlation for Integrated Fault M.. (context) - tker, Paterok - 1997
17
Integrated Event Management: Event Correlation Using Depende..
- Gruschke - 1998
12
An Active Approach to Characterizing Dynamic Dependencies fo..
- Brown, Kar et al. - 2001
11
Probabilistic modeling of computer system availability (context) - Goyal, Lavenberg et al. - 1987
10
Managing Application Services over Service Provider Networks..
- Kar, Keller et al.
9
Low-Overhead Recovery for General Applications (context) - Lowell, Chen et al. - 1998
9
An Alarm Correlation and Fault Identification Scheme Based o.. (context) - Choi, Choi et al.
7
Human Factors: Understanding People-System Relationships (context) - Kantowitz, Sorkin - 1983
6
Auto-diagnosis of Field Problems in an Appliance Operating S.. (context) - Banga
6
Characterizing Large Storage Systems: Error Behavior and Per.. (context) - Talagala - 1999
5
No Time for DOWNTIMEIT Managers feel the heat to prevent out.. (context) - Sweeney - 2000
5
Prevention of Online Crashes is No Easy Fix (context) - Menn - 1999
5
the Necessity of On-line-BIST in Safety-Critical Application.. (context) - Steininger, Scherrer
5
A Fault-Tolerant CMOS Mainframe (context) - Spainhower, Gregg
4
Specifying Reliability in the Disk Drive Industry: No More M.. (context) - Elerath
4
They Write the Right Stuff (context) - Fishman - 1996
4
Human Performance: What Improvement from Human Reliability A.. (context) - Pope - 1986
3
Personal communication (context) - Bartlett - 2001
3
Dependability at the User Interface (context) - Maxion, deChambeau - 1995
3
business redefines infrastructure needs (context) - Fisher - 2000
3
File System Design for an NFS Server Appliance (context) - Hitz, Lau et al. - 1995
3
RecoveryServiceability System Test Improvement IBM ES Based.. (context) - Merenda, System et al.
2
Mitigating Operator-Induced Unavailability by Matching Impre.. (context) - Maxion, Syme - 1996
2
Field Experience in Maintenance (context) - Christensen, Howard
2
Towards Availability and Maintainability Benchmarks: A Case .. (context) - Brown - 2001
2
New Problems in Fault-Tolerant Computing (context) - Goldberg
2
Human Detection and Diagnosis of System Failures: Proceeding.. (context) - Rasmussen, Rouse - 1981
1
Rules: Advice on government (context) - Rumsfeld - 2001
1
San Francisco: Morgan-Kauffmann (context) - Gray, Reuter et al. - 1993
1
Developing Reliable Software (context) - Keene, Lane et al.
1
The Use of Flow Models for Automated Plant Diagnosis (context) - Lind
1
Human Behavior Modeling in Train Control Systems (context) - Joshi, Kaufman et al.
1
High-Availability Transaction Processing: Practical Experien.. (context) - Bowles, Dobbins
1
Failure Detection in Dynamic Systems (context) - Wickens, Kessel
1
Design for Fault-Tolerance in System ES/9000 Model (context) - Spainhower, Isenberg et al.
1
A Protocol-centric Design for Architecting Large Storage Sys.. (context) - Howard, Berube et al.
1
A Fault-Finding Training Programme for Continuous Plant Oper.. (context) - Marshall, Shepherd
1
Training for Fault Diagnosis in an Industrial Process Plant (context) - Duncan
1
The role of paper flight strips in air traffic control (context) - MacKay, safer - 1999
Documents on the same site (http://roc.cs.berkeley.edu): More
Architecture, operation, and dependability of large-scale .. - Oppenheimer, Patterson (2002)
(Correct)
Availability Benchmarking of a Database System - Brown
(Correct)
Recovery Oriented Computing (ROC): Motivation.. - Patterson, Brown, .. (2002)
(Correct)
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC