Results 1 - 10
of
88
Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?
, 2007
"... Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance ..."
Abstract
-
Cited by 245 (9 self)
- Add to MetaCart
(Show Context)
Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. About 100,000 disks are covered by this data, some for an entire lifetime of five years. The data include drives with SCSI and FC, as well as SATA interfaces. The mean time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%. We find that in the field, annual disk replacement rates typically exceed 1%, with 2-4 % common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF. We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years. Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks. Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.
A large-scale study of failures in highperformance computing systems
- in Proceedings of the International Conference on Dependable Systems and Network, DSN
, 2006
"... Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations is publicly available. This paper analyzes failure data recently made publicy available by one of the largest high-performance computing ..."
Abstract
-
Cited by 206 (7 self)
- Add to MetaCart
(Show Context)
Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations is publicly available. This paper analyzes failure data recently made publicy available by one of the largest high-performance computing sites. The data has been collected over the past 9 years at Los Alamos National Laboratory and includes 23000 failures recorded on more than 20 different systems, mostly large clusters of SMP and NUMA nodes. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find for example that average failure rates differ wildly across systems, ranging from 20–1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution. 1
Understanding Failures in Petascale Computers
"... With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interrup ..."
Abstract
-
Cited by 103 (6 self)
- Add to MetaCart
(Show Context)
With petascale computers only a year or two away there is a pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interruptions that even smaller terascale computers experience. The information that is available suggests that application interruptions will become far more common in the coming decade, and the largest applications may surrender large fractions of the computer’s resources to taking checkpoints and restarting from a checkpoint after an interruption. This paper reviews sources of failure information for compute clusters and storage systems, projects failure rates and the corresponding decrease in application effectiveness, and discusses coping strategies such as application-level checkpoint compression and system level process-pairs fault-tolerance for supercomputing. The need for a public repository for detailed failure and interruption records is particularly concerning, as projections from one architectural family of machines to another are widely disputed. To this end, this paper introduces the Computer Failure Data Repository and issues a call for failure history data to publish in it. 1.
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems
- 10TH IEEE/ACM INTERNATIONAL CONFERENCE ON CLUSTER, CLOUD AND GRID COMPUTING
, 2010
"... With the increasing functionality and complexity of distributed systems, resource failures are inevitable. While numerous models and algorithms for dealing with failures exist, the lack of public trace data sets and tools has prevented meaningful comparisons. To facilitate the design, validation, an ..."
Abstract
-
Cited by 68 (18 self)
- Add to MetaCart
(Show Context)
With the increasing functionality and complexity of distributed systems, resource failures are inevitable. While numerous models and algorithms for dealing with failures exist, the lack of public trace data sets and tools has prevented meaningful comparisons. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA) as an online public repository of availability traces taken from diverse parallel and distributed systems. Our main contributions in this study are the following. First, we describe the design of the archive, in particular the rationale of the standard FTA format, and the design of a toolbox that facilitates automated analysis of trace data sets. Second, applying the toolbox, we present a uniform comparative analysis with statistics and models of failures in nine distributed systems. Third, we show how different interpretations of these data sets can result in different conclusions. This emphasizes the critical need for the public availability of trace data and methods for their analysis.
Exploiting Availability Prediction in Distributed Systems
, 2006
"... Loosely-coupled distributed systems have significant scale and cost advantages over more traditional architectures, but the availability of the nodes in these systems varies widely. Availability modeling is crucial for predicting per-machine resource burdens and understanding emergent, system-wide p ..."
Abstract
-
Cited by 61 (2 self)
- Add to MetaCart
(Show Context)
Loosely-coupled distributed systems have significant scale and cost advantages over more traditional architectures, but the availability of the nodes in these systems varies widely. Availability modeling is crucial for predicting per-machine resource burdens and understanding emergent, system-wide phenomena. We present new techniques for predicting availability and test them using traces taken from three distributed systems. We then describe three applications of availability prediction. The first, availability-guided replica placement, reduces object copying in a distributed data store while increasing data availability. The second shows how availability prediction can improve routing in delay-tolerant networks. The third combines availability prediction with virus modeling to improve forecasts of global infection dynamics.
Predicting bounds on queuing delay for batch-scheduled parallel machines
- In Proceedings of PPoPP 2006
, 2006
"... Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources have accounts at multiple sites and have the option of choosing at which site or s ..."
Abstract
-
Cited by 47 (10 self)
- Add to MetaCart
(Show Context)
Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources have accounts at multiple sites and have the option of choosing at which site or sites to submit a parallel job. In such a situation, the amount of time a user’s job will wait in any one batch queue can significantly impact the overall time a user waits from job submission to job completion. In this work, we explore a new method for providing end-users with predictions for the bounds on the queuing delay individual jobs will experience. We evaluate this method using batch scheduler logs for distributed-memory parallel machines that cover a 9-year period at 7 large HPC centers. Our results show that it is possible to predict delay bounds reliably for jobs in different queues, and for jobs requesting different ranges of processor counts. Using this information, scientific application developers can intelligently decide where to submit their parallel codes in order to minimize overall turnaround time. 1.
Qbets: Queue bounds estimation from time series
- In Proceedings of Job Scheduling Strategies for Parallel Processing (JSSPP
, 2007
"... Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. Because these machines are typically “spaceshared,” each job must wait in a queue until sufficient processor resources become available to service ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
(Show Context)
Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. Because these machines are typically “spaceshared,” each job must wait in a queue until sufficient processor resources become available to service it. In production computing settings, the queuing delay (experienced by users as the time between when the job is submitted and when it begins execution) is highly variable. Users often find this variability a drag on productivity as it makes planning difficult and intellectual continuity hard to maintain. In this work, we introduce an on-line system for predicting batch-queue delay and show that it generates correct and accurate bounds for queuing delay for batch jobs from 11 machines over a 9-year period. Our system comprises 4 novel and interacting components: a predictor based on nonparametric inference; an automated change-point detector; machinelearned, model-based clustering of jobs having similar characteristics; and an automatic downtime detector to identify systemic failures that affect job queuing delay. We compare the correctness and accuracy of our system against various previously used prediction techniques and show that our new method outperforms them for all machines we have available for study.
On the dynamic resource availability in grids
- IN PROCEEDINGS OF THE 8TH IEEE/ACM INTERNATIONAL CONFERENCE ON GRID COMPUTING
, 2007
"... Currently deployed grids gather together thousands of computational and storage resources for the benefit of a large community of scientists. However, the large scale, the wide geographical spread, and at times the decision of the rightful resource owners to commit the capacity elsewhere, raises s ..."
Abstract
-
Cited by 30 (11 self)
- Add to MetaCart
(Show Context)
Currently deployed grids gather together thousands of computational and storage resources for the benefit of a large community of scientists. However, the large scale, the wide geographical spread, and at times the decision of the rightful resource owners to commit the capacity elsewhere, raises serious resource availability issues. Little is known about the characteristics of the grid resource availability, and of the impact of resource unavailability on the performance of grids. In this work, we make first steps in addressing this twofold lack of information. First, we analyze a long-term availability trace and assess the resource availability characteristics of Grid’5000, an experimental grid environment of over 2,500 processors. The average utilization for the studied trace is increased by almost 5%, when availability is considered. Based on the results of the analysis, we further propose a model for grid resource availability. Our analysis and modeling results show that grid computational resources become unavailable at a high rate, negatively affecting the ability of grids to execute long jobs. Second, through trace-based simulation, we show evidence that resource availability can have a severe impact on the performance of the grid systems. The results of this step show evidence that the performance of a grid system can rise when availability is taken into consideration, and that human administration of availability change information results in 10-15 times more job failures than for an automated monitoring solution, even for a lowly utilized system.
Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems
- In Proc. of the 4th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID’04
, 2004
"... In this paper, we examine the problem of predicting machine availability in desktop and enterprise computing environments. Predicting the duration that a machine will run until it restarts (availability duration) is critically useful to application scheduling and resource characterization in federat ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
(Show Context)
In this paper, we examine the problem of predicting machine availability in desktop and enterprise computing environments. Predicting the duration that a machine will run until it restarts (availability duration) is critically useful to application scheduling and resource characterization in federated systems. We describe one parametric model fitting technique and two non-parametric prediction techniques, comparing their accuracy in predicting the quantiles of empirically observed machine availability distributions. We describe each method analytically and evaluate its precision using a synthetic trace of machine availability constructed from a known distribution. To detail their practical efficacy, we apply them to machine availability traces from three separate desktop and enterprise computing environments, and evaluate each method in terms of the accuracy with which it predicts availability in a trace driven simulation. Our results indicate that availability duration can be predicted with quantifiable confidence bounds and that these bounds can be used as conservative bounds on lifetime predictions. Moreover, a non-parametric method based on a binomial approach generates the most accurate estimates.
Fault-aware scheduling for bag-of-tasks applications on desktop grids,” in GRID ’06
- IEEE Computer Society
"... Abstract — Desktop Grids have proved to be a suitable platform for the execution of Bag-of-Tasks applications but, being characterized by a high resource volatility, require the availability of scheduling techniques able to effectively deal with resource failures and/or unplanned periods of unavaila ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
(Show Context)
Abstract — Desktop Grids have proved to be a suitable platform for the execution of Bag-of-Tasks applications but, being characterized by a high resource volatility, require the availability of scheduling techniques able to effectively deal with resource failures and/or unplanned periods of unavailability. In this paper we present a set of fault-aware scheduling policies that, rather than just tolerating faults as done by traditional fault-tolerant schedulers, exploit the information concerning resource availability to improve application performance. The performance of these strategies have been compared via simulation with those attained by traditional fault-tolerant schedulers. Our results, obtained by consider a set of realistic scenarios modeled after real Desktop Grids, show that our approach results in better application performance and resource utilization. I.