| J. Gray. "A Census of Tandem System Availability between 1985. |
....patterns and no deliberate attacks. Some techniques for evaluating calm day workloads are relatively well understood. Request traces have long been used to benchmark systems. And a number of studies have quantified environmental factors such as hardware, maintenance, and environmental failures [13], Internet failures [18, 15, 9, 1] and Internet performance variability [29] Several recent studies have used faultloads derived from such studies to examine end to end service availability [9, 28] To deepen system understanding under calm day scenarios, additional research is needed. On the ....
.... studies suggest that Internet routing interruption durations are heavy tailed, meaning that long interruptions are rare but account for a significant fraction of overall interruption time [15, 9] Other examples of rare but stressful events that should be considered include power outages [13], system upgrades and maintenance [13, 25, 4] and internal hardware and software failures. Deliberate black box attacks (RS2) These are scenarios where an external adversary deliberately attempts to impede service delivery by accessing the service by sending either a large number of normal ....
[Article contains additional citation context not shown here]
J. Gray. A Census of Tandem System Availability Between 1985.
....appropriate and so have proposed many alternative messaging based protocols, e.g. 28, 33] The key difference from the MPP and SAN networks is that like TCP, these protocols viewed packet loss as signaling congestion. There has been extensive work in analysing faults and how they impact systems [13, 22, 34]. However, the focus of these studies was not on the communication system. Studies benchmarking system behavior under fault loads include [20, 24] However, these works do not provide a good understanding of how one would estimate overall system availability under a given fault load. System ....
J. Gray. A Census of Tandem System Availability Between 1985.
....main group to cause the automatic reboot of that node. While this is an extreme example of FME, it does improve the availability of PRESS substantially, as well as reduces the need for operator coverage. 8 Related Work There has been extensive work in analyzing faults and how they impact systems [11, 31, 17]. Studies benchmarking system behavior under fault loads include [15, 19] Unfortunately, these works do not provide a good understanding of how one would estimate overall system availability under a given fault load. There has also been a large number of system availability studies. Two ....
J. Gray. A Census of Tandem System Availability Between 1985.
....real world programming errors can manifest themselves as some of the faults we inject in our experiments. None of the errors shown above would be caught during compilation. We concentrate on software faults because studies have shown that software has become the dominant cause of system outages [23], 24] We classify injected faults into three categories: bit flips, low level software faults, and high level software faults. Unless otherwise stated, we inject 10 faults for each run to increase the chances that a fault will be triggered. Most crashes occurred within 10 seconds from the time ....
....alternate paths to tolerate a wide variety of hardware failures [1] Finally, many papers have examined the performance advantages and uses of reliable memory [17] 6] 12] 2] 45] 46] 7. 2 Field Studies of Failures Studies have shown that software is the dominant cause of system outages [23], 24] and several studies have investigated system software errors. Sullivan and Chillarege classify software faults in the MVS operating system; in particular, they analyze faults that corrupt program memory (overlays) 65] Lee and Iyer study and classify software failures in Tandem s Guardian ....
[Article contains additional citation context not shown here]
J. Gray, "A Census of Tandem System Availability between 1985.
....code and the DBMS is an inescapable trend in the evolution of database management systems. Finally, error trends lead to a greater concern for software errors in general. A study of field errors in Tandem computers reflects the industry wide trend of improvement in hardware reliability [33]. However, this increase in reliability is not matched in Stored procedures also represent close coupling, but are usually specified in a restricted language. software. The result is that between 1985 and 1989, the percentage of errors attributed to software rose from 33 to 62 , even while the ....
....protection. We then review some other techniques from the fault tolerance literature which address similar problems. We finish by presenting the related work on recovery from logical corruption in database systems. 2.3. 1 Failure Studies The results of a study of field errors in Tandem computers [33] was reviewed in the Introduction, and showed that the percentage of errors attributable to software rose from 33 to 62 between 1985 and 1989. In his thesis, Sullivan reviews field errors recorded by IBM in internal databases for the MVS operating system and two database products, IMS and DB2 ....
J. Gray. A census of Tandem system availability between 1985.
....before they fail during normal operation. Online testing will catch those failures that are unlikely to be created in a test situation, for example those that are scale or configuration dependent. redundancy: replicating data, computational functionality, and or networking functionality [5]. Using sufficient redundancy often prevents component failures from turning into service failures. fault injection and load testing: testing error handling code and system response to overload by artificially introducing failure and overload, before deployment or in the production system [18] ....
....tools to check that low level (e.g. per component) configuration files meet constraints expressed in terms of the desired high level service behavior [13] Such tools could prevent faulty configurations in deployed systems. component isolation: increasing isolation between software components [5]. Isolation can prevent a component failure from turning into a service failure by preventing cascading failures. proactive restart: periodic prophylactic rebooting of hardware and restarting of software [7] This can prevent faulty components with latent errors due to resource leaks from ....
[Article contains additional citation context not shown here]
J. Gray. A census of Tandem system availability between 1985.
....have studied the problem of fault tolerance extensively. A full treatment of this body of work is beyond the scope of this paper. Instead, we concentrate on efforts that have focused on improving the availability of cluster based services. Of course, work analyzing how faults impact systems [14, 19, 31, 32], as well as empirical measurement of actual fault rates [2, 16, 23, 18, 24] are necessary background for a model based quantification effort such as ours. Our methodology and infrastructure seem to be the first directed to quantifying the availability impact of a range of techniques as applied ....
J. Gray. A Census of Tandem System Availability Be- tween 1985.
....appropriate and so have proposed many alternative messaging based protocols, e.g. 28, 33] The key difference from the MPP and SAN networks is that like TCP, these protocols viewed packet loss as signaling congestion. There has been extensive work in analysing faults and how they impact systems [13, 22, 34]. However, the focus of these studies was not on the communication system. Studies benchmarking system behavior under fault loads include [20, 24] However, these works do not provide a good understanding of how one would estimate overall system availability under a given fault load. System ....
J. Gray. A Census of Tandem System Availability Between 1985.
....distinguish two types of failures: temporary failures and permanent failures. They are currently distinguished simply by their duration a crash becomes permanent when a node is suspected to have failed continuously for more than two weeks. Given that the vast majority of failures are temporary [11, 3], we set two different goals. For temporary failures, we try to reduce the recovery cost. For permanent failures, we try to clean all data structures associated with the failed node so that the system runs as if the node had never existed in the first place. 6.1 Recovering from temporary ....
Jim Gray. A census of Tandem system availability between 1985.
....alternative messaging based protocols. A few of the originals include [12, 36, 40] The key difference from the MPP and SAN networks models is that these protocols, like TCP, viewed packet loss to signal congestion. There has been extensive work in analysing faults and how they impact systems [17, 41, 28]. However, the focus of these studies was not on the communication system. Studies benchmarking system behavior under fault loads include [25, 30] However, these works do not provide a good understanding of how one would estimate overall system availability under a given fault load. System ....
J. Gray. A Census of Tandem System Availability Between 1985.
....recently started a routing registry consistency project which aims to improve the consistency of the registry information [36] 8. RELATED WORK There have been numerous other studies of faults in computer systems. Notable among these are Gray s studies on failures with Tandem s computer system [15, 16], in which he discovered that the biggest causes of outages were software bugs (62 ) followed by operations (15 ) A decade ago, Danzig et al. analysed DNS traces and discovered that a large fraction of traffic was due to bugs and poor implementation choices [10] and a more recent study confirms ....
J. Gray. A Census of Tandem System Availability Between 1985.
....workstation node increases in availability with time. The discrepancy between software failure and node reboots leads to a search for other causes. Studies of Tandem machines have suggested that in addition to hardware and software, the humans operators also play a key role in generating faults [11, 12]. However, the context of these works was limited to specialized machines in tightly controlled environments. Indeed, one paper [4] argues administrator and operator errors will soon become the dominant factor in component failures. More recent work in the Windows NT context validates these ....
....predicting component failures using time as the predictor. However, our data shows that the assumed direction of the prediction behind rejuvenation techniques, that components decay over time, may be flawed, at least with respect to machines viewed in their entirety. Given our results and those in [12, 17, 25], it seems most probable that the countless number of event types that can lead to workstation failures, far outweigh the effect of individual component decay over time. However, our results did not cover timescales on the order of many years. It may be the case that if we extended our ....
J. Gray. A Census of Tandem System Availability Between 1985.
....of two months, elicited inquiries from only six system administrators. failure, but at the expense of a small bias in favor of highly available hosts. Analyses of MTTF and the causes of failure have usually been confined to specific systems. Recent studies include analyses of Tandem systems [4, 5] and the IBM XA system [8] Research covering heterogeneous systems is less common. Few such studies have appeared in the open literature, although it is certain that most companies perform reliability studies of their products internally. The difficulty in assembling sufficient data and applying ....
J. Gray, "A census of Tandem system availability between 1985.
....fault injection technologies can emulate hardware faults, either transient or permanent. However, the emulation of software faults is still a rather obscure step. Software faults are recognized as the major cause of system outages. Existing studies show a clear predominance of software faults [3, 4], and given the huge complexity of today s software the weight of software faults tends to increase, which makes clear the relevance of extending the fault injection technologies to the 1 Work partially supported by FCT, Praxis XXI, contract No. 2 2.1 TIT 1570 95 and grant No. BD 5636 95. ....
....studies available for the development phase. Nevertheless, the study of the effects of actual software faults in the field is of utmost importance for our work, as this is just the kind of faults we want to emulate by fault injection. The software dependability of Tandem systems are studied in [3, 4] and the impact of software defects on the availability of a large IBM system is presented in [11] An important contribution to promote the collection and study of observed faults is the Orthogonal Defect Classification (ODC) 12] ODC is a classification schema for software faults (i.e. ....
J. Gray, "A Census of Tandem Systems Availability Between
....their detection. Lin [12] and Tsao [23] focused on trend analysis in error logs. McConnel [17] presented results of an analysis of transient errors in computer systems. This study showed that transients follow a Weibull distribution rather than occur at a constant rate as frequently assumed. Gray [3] presented results from a census of Tandem systems. Chillarege [2] presented a study of the impact of failures on customers and the fault lifetimes. Sullivan [18] 19] examined software defects occurring in operating systems and databases (based on field data) An in depth overview of ....
J. Gray, "A Census of Tandem System Availability between
No context found.
J. Gray. "A Census of Tandem System Availability between 1985.
No context found.
J. Gray, "A census of tandem system availability between 1985.
No context found.
J. Gray, A Census of Tandem System Availability between 1985.
No context found.
J. Gray. A census of tandem system availability between 1985.
No context found.
J. Gray. A census of tandem system availability between 1985.
No context found.
J. Gray, A Census of Tandem System Availability between 1985.
No context found.
J.N. Gray. A Census of Tandem System Availability between 1985.
No context found.
J. Gray, "A Census of Tandem Systems Availability Between 1985.
No context found.
J. Gray, "A Census of Tandem System Availability between
No context found.
J. Gray, "A census of Tandem system availability between
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC