| M. Sullivan and R. Chillarege. "Software Defects and their Impact on Systems Availability: A Study of Field Failures in Operating Systems," FTCS 1991: 2-9. |
....modied the corresponding elds within a descriptor before passing it up to VIPL library. We considered three types of corrupted parameters: passing of NULL pointers, off by N for data pointers, and off by N for buffer sizes. N in all cases were in the range of 0 to 100 bytes, which was observed in [34] to be the dominant range for offset errors. 5 PRESS Behavior Under Single Fault Loads We now apply the rst phase of our methodology to evaluate the performability of PRESS. In particular, we measure and explain the behavior of all 5 versions of PRESS under single faults injected in isolation. ....
....the performability of the different versions of PRESS. Recall that to make the modeling tractable, we assume that faults in different components are not correlated and all fault arrivals are exponentially distributed. We have done our best to derive meaningful parameters from the available data [11, 12, 15, 21, 35, 34, 36]. However, data is sparse, particularly for application level errors. Thus, we examine performability for a range, once per day to once per month, of MTTFs for application level faults. In addition, because we have multiple classes of errors, we divided the application fault rate between these ....
[Article contains additional citation context not shown here]
M. Sullivan and R. Chillarege. Software Defects and their Impact on System Availability - A Study of Field Failures in Operating Systems. In Proceedings of the 21st International Symposium on Fault-Tolerant Computing (FTCS-21), pages 2--9, Montreal, Canada, 1991.
....main group to cause the automatic reboot of that node. While this is an extreme example of FME, it does improve the availability of PRESS substantially, as well as reduces the need for operator coverage. 8 Related Work There has been extensive work in analyzing faults and how they impact systems [11, 31, 17]. Studies benchmarking system behavior under fault loads include [15, 19] Unfortunately, these works do not provide a good understanding of how one would estimate overall system availability under a given fault load. There has also been a large number of system availability studies. Two ....
M. Sullivan and R. Chillarege. Software defects and their impact on system availability - a study of field failures in operating systems. 21st Int. Symp. on Fault-Tolerant Computing (FTCS-21), pages 2--9, 1991.
....2.1 Description of Faults We first describe the types of faults we inject into the operating system. Our primary goal in designing these faults is to generate a wide variety of operating system crashes. Our models are derived from studies of commer cial operating systems and databases [66] [65], 43] and from prior models used in fault injection studies [10] 39] 38] 15] The faults we inject range from low level hardware faults, such as flipping bits in memory, to highlevel software faults, such as memory allocation errors. Table 1 shows examples of how how real world programming ....
....assignment statements by changing the source or destination register. We corrupt conditional constructs by deleting branches. We also delete random instructions (both branch and nonbranch) The last and most extensive category of faults imitate specific programming errors in the operating system [65]. These are targeted more at specific programming errors than the previous fault category. Table 1 provides a summary of the programming errors we inject. The implementation details can be found in [52] We collect data on 100 crashes (each using a different random seed) for each of the 15 fault ....
[Article contains additional citation context not shown here]
M. Sullivan and R. Chillarege, "Software Defects and Their Impact on System Availability--A Study of Field Failures in Operating Systems," Proc. 1991.
....have studied the problem of fault tolerance extensively. A full treatment of this body of work is beyond the scope of this paper. Instead, we concentrate on efforts that have focused on improving the availability of cluster based services. Of course, work analyzing how faults impact systems [14, 19, 31, 32], as well as empirical measurement of actual fault rates [2, 16, 23, 18, 24] are necessary background for a model based quantification effort such as ours. Our methodology and infrastructure seem to be the first directed to quantifying the availability impact of a range of techniques as applied ....
M. Sullivan and R. Chillarege. Software Defects and their Impact on System Availability - A Study of Field Failures in Operating Systems. In Proceedings of the 21st International Symposium on Fault-Tolerant Computing (FTCS-21), pages 2-9, Montreal, Canada, 1991.
....the corresponding fields within a descriptor before passing it up to VIPL library. We considered three types of corrupted parameters: passing of NULL pointers, off by N for data pointers, and off by N for buffer sizes. N in all cases were in the range of 0 to 100 bytes, which was observed in [34] to be the dominant range for offset errors. 5 PRESS Behavior Under Single Fault Loads We now apply the first phase of our methodology to evaluate the performability of PRESS. In particular, we measure and explain the behavior of all 5 versions of PRESS under single faults injected in ....
....the performability of the different versions of PRESS. Recall that to make the modeling tractable, we assume that faults in different components are not correlated and all fault arrivals are exponentially distributed. We have done our best to derive meaningful parameters from the available data [11, 12, 15, 21, 35, 34, 36]. However, data is sparse, particularly for application level errors. Thus, we examine performability for a range, once per day to once per month, of MTTFs for application level faults. In addition, because we have multiple classes of errors, we divided the application fault rate between these ....
[Article contains additional citation context not shown here]
M. Sullivan and R. Chillarege. Software Defects and their Impact on System Availability - A Study of Field Failures in Operating Systems. In Proceedings of the 21st International Symposium on Fault-Tolerant Computing (FTCS-21), pages 2--9, Montreal, Canada, 1991.
....the performability of the different versions of PRESS. Recall that to make the modeling tractable, we assume that faults in different components are not correlated and all fault arrivals are exponentially distributed. We have done our best to derive meaningful parameters from the available data [25], 26] 27] 18] 12] 28] 17] A duration of 5 minutes was assumed for the operator intervention stage E and restart stage F. B. Evaluation Metrics Our model computes two metrics to evaluate each server. The first is the unavailability, which is the average fraction of requests dropped. ....
M. Sullivan and R. Chillarege, "Software Defects and their Impact on System Availability - A Study of Field Failures in Operating Systems," 21st Int. Symp. on Fault-Tolerant Computing (FTCS-21), pp. 2--9, 1991.
....to VIPL library. We considered three types of corrupted parameters: passing of NULL pointers, off by N for data pointers, and off by N for buffer sizes. N in all cases were in the range of 0 to 100 bytes, which has been observed by Sullivan and Chillarege to be the dominant range for offset errors [41]. 5 PRESS Behavior Under Single Fault Loads We now apply the first phase of our methodology to evaluate the performability of PRESS. In particular, we measure and explain the behavior of all 5 versions of PRESS under single faults injected in isolation. 5 5.1 Experimental Setup In all ....
....the performability of the different versions of PRESS. Recall that to make the modeling tractable, we assume that faults in different components are not correlated and all fault arrivals are exponentially distributed. We have done our best to derive meaningful parameters from the available data [41, 13, 16, 27, 42, 43, 20]. However, data is sparse, particularly for application level errors. Thus, we examine performability for a range, once per day to once per month, of MTTFs for application level faults. In addition, because we have multiple classes of errors, we divided the fault rate between these errors ....
[Article contains additional citation context not shown here]
M. Sullivan and R. Chillarege. Software defects and their impact on system availability - a study of field failures in operating systems. 21st Int. Symp. on Fault-Tolerant Computing (FTCS-21), pages 2--9, 1991.
....distributed time to failure. Those proposed in [8, 20, 23] can accommodate general distributions but only for the specific aging effect they capture. Generally distributed time to failure, as well as the service rate being an arbitrary function of time are allowed in [9] It has been noted [22] that transient failures are partly caused by overload conditions. Only the model presented in [9] captures the effect of load on aging. Existing models also differ in the measures being evaluated. In [8, 23] software with a finite mission time is considered. In the [17, 6, 7, 9] measures of ....
M. Sullivan and R. Chillarege. Software Defects and Their Impact on System Availability - A Study of Field Failures in Operating Systems. In Proc. 21st IEEE Intl. Symposium on Fault-Tolerant Computing, pages 2-9, 1991.
....faults in the field is of utmost importance for our work, as this is just the kind of faults we want to emulate by fault injection. The software dependability of Tandem systems are studied in [3, 4] and the impact of software defects on the availability of a large IBM system is presented in [11]. An important contribution to promote the collection and study of observed faults is the Orthogonal Defect Classification (ODC) 12] ODC is a classification schema for software faults (i.e. defects) in which defects are classified into non overlapping attributes and used as a source of ....
M. Sullivan and R. Chillarege, "Software defects and their impact on systems availability -- A study of field failures on operating systems", Proc. 21st Fault Tolerant Comp. Symp., FTCS-21, pp. 2-9, Jun. 1991.
....systems. This study showed that transients follow a Weibull distribution rather than occur at a constant rate as frequently assumed. Gray [3] presented results from a census of Tandem systems. Chillarege [2] presented a study of the impact of failures on customers and the fault lifetimes. Sullivan [18], 19] examined software defects occurring in operating systems and databases (based on field data) An in depth overview of experimental and analytical techniques for analysis of computer systems dependability can be found in [8] 3 Field Data and Analysis Parameters This section presents data ....
M.S. Sullivan and R. Chillarege, "Software Defects and Their Impact on System Availability --- A Study of Field Failures in Operating Systems," Proc. 21st Int. Symp. Fault-Tolerant Computing, pp. 2-9, June 1991.
....Iyer [15] looked at 200 memory dumps of eld software failures in the Tandem GUARDIAN 90 operating system collected over 1 year. They focused on the e ectiveness of fault detection and recovery, and classifying errors by type (uninitialized variables, race conditions) Sullivant and Chillarege [23] examined MVS operating system failures, classifying error causes and manifestations. They randomly sampled 250 reports (out of a population of 3000) gathered over a ve year period. Their main focus was on measuring errors caused by memory corruption versus everything else. They found that the ....
M. Sullivan and R. Chillarege. Software Defects and Their Impact on System 118 Availability { A Study of Field Failures in Operating Systems. In 21st International Symposium on Fault Tolerant Computing, June 1991.
....Iyer [15] looked at 200 memory dumps of field software failures in the Tandem GUARDIAN 90 operating system collected over 1 year. They focused on the e#ectiveness of fault detection and recovery, and classifying errors by type (uninitialized variables, race conditions) Sullivant and Chillarege [23] examined MVS operating system failures, classifying error causes and manifestations. They randomly sampled 250 reports (out of a population of 3000) gathered over a five year period. Their main focus was on measuring errors caused by memory corruption versus everything else. They found that the ....
M. Sullivan and R. Chillarege. Software Defects and Their Impact on System 118 Availability -- A Study of Field Failures in Operating Systems. In 21st International Symposium on Fault Tolerant Computing, June 1991.
....Iyer [15] looked at 200 memory dumps of field software failures in the Tandem GUARDIAN 90 operating system collected over 1 year. They focused on the effectiveness of fault detection and recovery, and classifying errors by type (uninitialized variables, race conditions) Sullivant and Chillarege [23] examined MVS operating system failures, classifying error causes and manifestations. They randomly sampled 250 reports (out of a population of 3000) gathered over a five year period. Their main focus was on measuring errors caused by memory corruption versus everything else. They found that the ....
M. Sullivan and R. Chillarege. Software Defects and Their Impact on System 118 Availability -- A Study of Field Failures in Operating Systems. In 21st International Symposium on Fault Tolerant Computing, June 1991.
....and can capture a multitude of cluster system characteristics, failure behavior, and performability measures, which we are just beginning to explore. 1. Introduction Software aging Unplanned computer system outages are more likely to be the result of software failures than of hardware failures [1, 2]. Moreover, software often exhibits an increasing failure rate over time, typically because of increasing and unbounded resource consumption, data corruption, and numerical error accumulation. This constitutes a #Copyright 2001 by International Business Machines Corporation. Copying in printed ....
....we use an Erlang approximation. The models proposed in [7, 22, 23] can accommodate general distributions, but only for the specific aging effect they capture. Generally distributed time to failure, and the service rate being an arbitrary function of time, are allowed in [24] It has been noted [2] that transient failures are partly caused by overload conditions, but only the model presented in [24] captures the effect of load on aging. In [25] an availability model of a two node cluster is described. Different failure scenarios are modeled, and the availability is analyzed. While that ....
M. Sullivan and R. Chillarege, "Software Defects and Their Impact on System Availability---A Study of Field Failures in Operating Systems," Proceedings of the 21st IEEE International Symposium on Fault-Tolerant Computing, 1991, pp. 2--9.
No context found.
M. Sullivan and R. Chillarege. "Software Defects and their Impact on Systems Availability: A Study of Field Failures in Operating Systems," FTCS 1991: 2-9.
No context found.
M. Sullivan and R. Chillarge, Software Defects and their Impact on System Availability- A Study of Field Failures in Operating Systems, Proceedings of International Symposium on Fault-Tolerant Computing (1991).
No context found.
M. Sullivan and R. Chillarge, Software Defects and their Impact on System Availability- A Study of Field Failures in Operating Systems, Proceedings of International Symposium on Fault-Tolerant Computing (1991).
No context found.
M. Sullivan and R. Chillarege. Software defects and their impact on system availability -- a study of field failures in operating systems. In Proc. 21st International Symposium on Fault-Tolerant Computing, Montreal, Canada, 1991.
No context found.
M. Sullivan and R. Chillarege. Software Defects and their Impact on System Availability | A Study of Field Failures in Operating Systems. In Proceedings of 21st Internation Symposium on Fault-Tolerant Computing, June 1991.
No context found.
M. Sullivan and R. Chillarege, "Software defects and their impact on system availability - a study of field failure in operating systems", Proc. of Fault tolerant Computing Symp., FTCS-21, June 1991, pp. 2-9.
No context found.
Mark Sullivan, Ram Chillarege. Software Defects and their Impact on System Availability A Study of Field Failures in Operating Systems. 21st International Symposium on FaultTolerant Computing (FTCS-21), 1991.
No context found.
M. Sullivan and R. Chillarege, "Software defects and their impact on systems availability -- A study of field failures on operating systems", Proceedings of the 21st IEEE Fault Tolerant Computing Symposium, FTCS-21, pp. 2-9, June 1991.
No context found.
M. Sullivan and R. Chillarege. Software defects and their impact on system availability -- a study of field failures in operating systems. In Proc. 21st International Symposium on Fault-Tolerant Computing, Montral, Canada, 1991.
No context found.
M. Sullivan and R. Chillarege, "Software defects and their impact on system availability - a study of field failures in operating systems, " 21st Int. Symp. on Fault-Tolerant Computing (FTCS-21), pp. 2--9, 1991.
No context found.
Mark Sullivan and Ram Chillarege. Software defects and their impact on system availability -- a study of field failures in operating systems. In Digest of the 21st International Symposium on Fault Tolerant Computing, pages 2--9, June 1991.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC