See this document in CiteSeerX!

A large-scale study of failures in high-performance computing systems  (Make Corrections)  
Bianca Schroeder Garth A. Gibson Computer Science Department, Carnegie Mellon ...



  Home/Search   Context   Related

 
View or download:
cmu.edu/PDLFTP/stray/dsn06.pdf
Cached:  PS.gz  PS  PDF   Image  Update  Help

From:  cmu.edu/Publications/index (more)
(Enter author homepages)

Rate this article: (best)
  Comment on this article  
(Enter summary)

Abstract: Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations is publicly available. This paper analyzes failure data recently made publicy available by one of the largest high-performance computing sites. The data has been collected over the past 9 years at Los Alamos National Laboratory and includes 23000 failures recorded on more than 20 different systems, mostly large clusters of SMP and... (Update)

Active bibliography (related documents):   More   All
4.9:   A Large-Scale Study of Failures in High-Performance.. - Bianca Schroeder Garth   (Correct)
0.4:   Networked Windows NT System Field Failure Data Analysis - Xu, Kalbarczyk, Iyer (1999)   (Correct)
0.3:   Using Fault Injection and Modeling to Evaluate the .. - Nagaraja, Li.. (2003)   (Correct)

Similar documents based on text:
0.0:   Unknown -   (Correct)

BibTeX entry:   (Update)

@misc{ garth-largescale,
  author = "Bianca Schroeder Garth",
  title = "A Large-Scale Study of Failures in High-Performance Computing Systems",
  url = "citeseer.ist.psu.edu/746871.html" }
Citations (may not include all citations):
454   Self-similarity through high-variability: statistical analys.. - Willinger, Taqqu et al. - 1997
79   Why do computers stop and what can be done about it - Gray - 1986
46   The condor distributed processing system (context) - Tannenbaum, Litzkow - 1995
38   A longitudinal survey of internet host reliability - Long, Muir et al. - 1995
27   Measuring system and software reliability using an automated.. (context) - Murphy, Gent - 1995
26   A census of tandem system availability between (context) - Gray - 1985
25   Failure data analysis of a LAN of Windows NT based computers (context) - Kalyanakrishnam, Kalbarczyk et al. - 1999
23   A case for two-level distributed recovery schemes - Vaidya - 1995
20   Experimental assessment of workstation failures and their im.. - Plank, Elwasif - 1998
20   Measurement and modeling of computer reliability as affected.. (context) - Iyer, Rossetti et al. - 1986
18   Why do internet services fail (context) - Oppenheimer, Ganapathi et al. - 2003
11   Improving cluster availability using workstation validation - Heath, Martin et al. - 2002
10   Performance analysis of two time-based coordinated checkpoin.. - Kavanaugh, Sanders - 1997
9   Error log analysis: Statistical modeling and heuristic trend.. (context) - Lin, Siewiorek - 1990
7   Networked Windows NT system field failure data analysis - Xu, Kalbarczyk et al. - 1999
6   Checkpointing in distributed computing systems (context) - Wong, Franklin - 1996
5   Modeling machine availability in enterprise and wide-area di.. - Nurmi, Brevik et al. - 2005
4   Analysis of workload influence on dependability (context) - Meyer, Wei - 1988
2   Subtleties in tolerating correlated failures (context) - Nath, Yu et al. - 2006
2   Lifecycle analysis using software defects per million (context) - Mullen - 2005
2   Performance implications of failures in large-scale cluster .. (context) - Zhang, Squillante et al. - 2004
2   and reliability of digital computing systems (context) - Castillo, Siewiorek et al. - 1981
2   Failure data analysis of a large-scale heterogeneous server .. (context) - Sahoo, Sivasubramaniam et al. - 2004
2   Failure analysis and modelling of a VAX cluster system (context) - Tang, Iyer et al. - 1990
2   eduFailureData and httpwww (context) - data, is et al. - 2006

Documents on the same site (http://www.pdl.cmu.edu/Publications/index.html):   More
Blurring the Line Between OSes and Storage Devices - Ganger (2001)   (Correct)
Compiler-Based I/O Prefetching for Out-of-Core Applications - Brown, Mowry, Krieger (2001)   (Correct)
My cache or yours? Making storage more exclusive - Wong, Wilkes (2002)   (Correct)

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC