25 citations found. Retrieving documents...
M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure data analysis of a LAN of Windows NT based computers. In SRDS-18, 1999.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Evaluating the Impact of Communication.. - Nagaraja.. (2003)   (Correct)

....the performability of the different versions of PRESS. Recall that to make the modeling tractable, we assume that faults in different components are not correlated and all fault arrivals are exponentially distributed. We have done our best to derive meaningful parameters from the available data [11, 12, 15, 21, 35, 34, 36]. However, data is sparse, particularly for application level errors. Thus, we examine performability for a range, once per day to once per month, of MTTFs for application level faults. In addition, because we have multiple classes of errors, we divided the application fault rate between these ....

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure Data Analysis of a LAN of Windows NT Based Computers. In Proceedings of the 18th Symposium on Reliable and Distributed Systems (SRDS '99), 1999.


Compiler-directed Program-fault Coverage for.. - Fu, Martin.. (2003)   (1 citation)  (Correct)

....code [18] We focus on analyzing the ability of software to handle hardware and operating system faults; we leave the testing of functionality vs. requirements to traditional testing techniques. We concentrate on I O hardware faults since they are much more common than CPU or memory faults [31, 19]. We also focus on resource exhaustion faults and faults due to corruption of operating system data structures by bugs in the operating system. Our approach can be applied to software components as well as entire programs. We use compiler analyses to identify code blocks that are vulnerable to ....

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure Data Analysis of a LAN of Windows NT Based Computers. In Proceedings of the 18th Symposium on Reliable and Distributed Systems (SRDS '99), 1999.


Using Fault Injection and Modeling to Evaluate the .. - Nagaraja, Li.. (2003)   (2 citations)  (Correct)

....rutgers.edu Research mendosus . Table 5 provides a flavor of this data, listing the throughput and duration of each phase of our 7 stage model for VIA PRESS for two types of faults. The MTTFs and MTTRs shown in Table 4 were chosen based on previously reported faults and fault rates [13, 16, 32]. Note that we do not model all the faults that we can inject because there are no reliable statistics for some of them, e.g. application hangs. Finally, our environmental assumptions are that operator response time for stage E is 5 minutes and cluster reset time for stage F is 5 minutes. Recall ....

....these works do not provide a good understanding of how one would estimate overall system availability under a given fault load. There has also been a large number of system availability studies. Two approaches that are used most often include empirical measurements of actual fault rates [3, 13, 20, 16, 23] and a rich set of stochastic process models that describe system dependencies, fault likelihoods over time, and performance [10, 21, 30] Compared to these complex stochastic models, our models are much simpler, and thus more accessible to practitioners. This stems from our more limited goal of ....

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure Data Analysis of a LAN of Windows NT Based Computers. In Proceedings of the 18th Symposium on Reliable and Distributed Systems (SRDS '99), 1999.


Why do Internet services fail, and what can be done.. - Oppenheimer, Ganapathi, .. (2003)   (9 citations)  (Correct)

.... SunOS workstations but divided problem root cause coarsely into network, non disk machine problems, and disk related machine problems [21] Kalyanakrishnam studied six months of event logs from a LAN of Windows NT workstations used for mail delivery, to determine the causes of machines rebooting [9]. He found that most problems were software related, and that average downtime was two hours. In a closely related study, Xu examined a network of Windows NT workstations used for 24x7 enterprise infrastructure services, again by studying the Windows NT event log entries related to system reboots ....

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure data analysis of a LAN of Windows NT based computers. 18th IEEE Symposium on Reliable Distributed Systems, 1999.


Quantifying and Improving the Availability of.. - Nagaraja.. (2003)   (Correct)

....Application hang 2 months 3 minutes Front end failure 6 months 3 minutes Table 1: Failures and their MTTFs and MTTRs. Application hang and crash together represent an MTTF of 1 month for application failures. rates from previous works which empirically observed the fault rates of many systems [2, 16, 23, 18, 24]. We use Mendosus [21] to inject the expected fault load. Mendosus s network emulation system allows us to differentiate between intra cluster communication and client server communication when injecting network related faults. Thus, the clients are never disturbed by faults injected into the ....

....of this body of work is beyond the scope of this paper. Instead, we concentrate on efforts that have focused on improving the availability of cluster based services. Of course, work analyzing how faults impact systems [14, 19, 31, 32] as well as empirical measurement of actual fault rates [2, 16, 23, 18, 24], are necessary background for a model based quantification effort such as ours. Our methodology and infrastructure seem to be the first directed to quantifying the availability impact of a range of techniques as applied to cluster based services. One of the first works on the subject [13] argued ....

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure Data Analysis of a LAN of Windows NT Based Computers. In Proceedings of the 18th Symposium on Reliable and Distributed Systems (SRDS '99), 1999.


Evaluating the Impact of Communication.. - Nagaraja.. (2002)   (Correct)

....the performability of the different versions of PRESS. Recall that to make the modeling tractable, we assume that faults in different components are not correlated and all fault arrivals are exponentially distributed. We have done our best to derive meaningful parameters from the available data [11, 12, 15, 21, 35, 34, 36]. However, data is sparse, particularly for application level errors. Thus, we examine performability for a range, once per day to once per month, of MTTFs for application level faults. In addition, because we have multiple classes of errors, we divided the application fault rate between these ....

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure Data Analysis of a LAN of Windows NT Based Computers. In Proceedings of the 18th Symposium on Reliable and Distributed Systems (SRDS '99), 1999.


Using Fault Model Enforcement to Improve Availability - Nagaraja, Bianchini.. (2002)   (4 citations)  (Correct)

....one must further abstract away from reality. To reason about a real system, we usually model all components as fail stop. We also hope that failure rates and recovery occur with exponential distributions, even though there is strong empirical evidence against this, at least for workstations [17] [18]. III. FAULT MODEL ENFORCEMENT Given the previous description of complex computer systems, creating reasonably accurate abstractions of them seems to be an impossibly complex task. There are too many different subsystems, none of which any one person fully understands, connected by bewildering ....

....of the different versions of PRESS. Recall that to make the modeling tractable, we assume that faults in different components are not correlated and all fault arrivals are exponentially distributed. We have done our best to derive meaningful parameters from the available data [25] 26] 27] [18], 12] 28] 17] A duration of 5 minutes was assumed for the operator intervention stage E and restart stage F. B. Evaluation Metrics Our model computes two metrics to evaluate each server. The first is the unavailability, which is the average fraction of requests dropped. We use ....

M. Kalyanakrishnam, Zbigniew Kalbarczyk, and Ravishanka Iyer, "Failure Data Analysis of a LAN of Windows NT Based Computers," in Proceedings of the 18th Symposium on Reliable and Distributed Systems (SRDS '99), 1999.


Evaluating the Impact of Communication.. - Nagaraja.. (2003)   (Correct)

....the performability of the different versions of PRESS. Recall that to make the modeling tractable, we assume that faults in different components are not correlated and all fault arrivals are exponentially distributed. We have done our best to derive meaningful parameters from the available data [41, 13, 16, 27, 42, 43, 20]. However, data is sparse, particularly for application level errors. Thus, we examine performability for a range, once per day to once per month, of MTTFs for application level faults. In addition, because we have multiple classes of errors, we divided the fault rate between these errors ....

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure Data Analysis of a LAN of Windows NT Based Computers. In Proceedings of the 18th Symposium on Reliable and Distributed Systems (SRDS '99), 1999.


Using Fault Injection to Evaluate the.. - Nagaraja, Li.. (2003)   (3 citations)  (Correct)

....rutgers.edu Research mendosus . Table 5 provides a flavor of this data, listing the throughput and duration of each phase of our 7 phase model for VIA PRESS for two types of faults. The MTTFs and MTTRs shown in Table 6 were chosen based on previously reported faults and fault rates [14, 16, 28]. Note that we do not model all the faults that we can inject because there are no reli12 Phase Switch Failure Application Crash Throughput (reqs sec) Duration (secs) Throughput (reqs sec) Duration (secs) A 892.40 75 1889.10 10 B 0 3143.55 145 C 1106.70 3525 4537.60 25 D 0 4789.13 45 E ....

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure Data Analysis of a LAN of Windows NT Based Computers. In Proceedings of the 18th Symposium on Reliable and Distributed Systems (SRDS '99), 1999.


Using Fault Injection to Evaluate the.. - Nagaraja, Li.. (2003)   (3 citations)  (Correct)

....rutgers.edu Research mendosus . Table 5 provides a flavor of this data, listing the throughput and duration of each phase of our 7 phase model for VIA PRESS for two types of faults. The MTTFs and MTTRs shown in Table 6 were chosen based on previously reported faults and fault rates [14, 16, 28]. Note that we do not model all the faults that we can inject because there are no reli12 Phase Switch Failure Application Crash Throughput (reqs sec) Duration (secs) Throughput (reqs sec) Duration (secs) A 892.40 75 1889.10 10 B 0 3143.55 145 C 1106.70 3525 4537.60 25 D 0 4789.13 45 E ....

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure Data Analysis of a LAN of Windows NT Based Computers. In Proceedings of the 18th Symposium on Reliable and Distributed Systems (SRDS '99), 1999.


On Correlated Failures in Survivable Storage Systems - Bakkaloglu, Wylie, Wang.. (2002)   (7 citations)  (Correct)

....since the correlation level indicates no correlation the accuracy of the model will be low. 4.5 Related Availability Modeling Work In the literature, there are several approaches to modeling the availability of a single machine. Most of these studies are based on Markov Chain models [Garg1999][Kalyanakrishnam1999]. For example, in [Kalyanakrishnam1999] each state represents a level of functionality of the machine, such as reboot , connectivity problems , adapter problems , disk problems , shutdown etc. and the weight of transitions between states is determined empirically. There are similar ....

....no correlation the accuracy of the model will be low. 4.5 Related Availability Modeling Work In the literature, there are several approaches to modeling the availability of a single machine. Most of these studies are based on Markov Chain models [Garg1999] Kalyanakrishnam1999] For example, in [Kalyanakrishnam1999], each state represents a level of functionality of the machine, such as reboot , connectivity problems , adapter problems , disk problems , shutdown etc. and the weight of transitions between states is determined empirically. There are similar state based modeling studies for multiple ....

[Article contains additional citation context not shown here]

M. Kalyanakrishnam, Z. Kalbarczyk, R. Iyer "Failure Data Analysis of LAN of Windows NT Based Computers", Proc. of 18th Symposium on Reliable and Distributed Systems, SRDS '99, Lausanne, Switzerland, pp.178-187, 1999


Improving Cluster Availability Using Workstation Validation - Heath, Martin, Nguyen (2002)   (2 citations)  (Correct)

....However, they did not extend the result to workstation clusters in an Internet Service setting. Also, even though they found Weibull shape parameters of less than 1, they did not use this information to propose any validation based strategies for masking failures. Another study by the same group [16] produced a detailed state machine model of a workstation node, and thus requires fairly detailed knowledge of Windows NT. Another recent work examined the failure of components of an online image service [1, 21] That work focused more on the failure rates of the node components rather than on ....

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure Data Analysis of LAN of Windows NT Based Computers. In 18th Symposium on Reliable and Distributed Systems, SRDS '99, pages 178--187, 1999.


The Shape of Failure - Heath, Martin, Nguyen (2001)   (1 citation)  (Correct)

....Work This short investigation does not include an exhaustive list of related work. Rather, in this section we present recent work most closely related to our own. Perhaps the closest work related to this study is a recent work characterizing the behavior of of Microsoft Windows NT machines [3]. That study produced a detailed state machine model of a workstation node, and thus requires fairly detailed knowledge of Windows NT. Another recent work examined the failure of components of an online image service [1] That study focused more on the hardware components rather than the entire ....

M. KALYANAKRISHNAM,Z.KALBARCZYK,R.I. Failure Data Analysis of LAN of Windows NT Based Computers. In 18th Symposium on Reliable and Distributed Systems, SRDS '99 (1999), pp. 178--187. 4


Failure analysis of an ORB in presence of faults - Marsden, Fabre (2001)   (Correct)

....and measures. There are two main measurement based approaches to obtaining information on a system s dependability: Failure analysis of an ORB in the presence of faults 6 Deliverable IC3 Dependable Systems of Systems IST 1999 11585 . the observation of a large set of systems in operation, as in[Kalyanakrishnam et al. 1999] . This approach relies on error information obtained either from logs maintained by system administrators or from automatic monitoring mechanisms provided by the system. By analysing the data, one can obtain information on the nature and frequency of failures, and on the type of usage that led to ....

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure data analysis of a LAN of Windows NT based computers. In Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (SRDS '99), pages 178--189, Washington - Brussels - Tokyo, October 1999. IEEE.


The Shape of Failure - Heath, Martin, Nguyen (2001)   (1 citation)  (Correct)

....Work This short investigation does not include an exhaustive list of related work. Rather, in this section we present recent work most closely related to our own. Perhaps the closest work related to this study is a recent work characterizing the behavior of of Microsoft Windows NT machines [3]. That study produced a detailed state machine model of a workstation node, and thus requires fairly detailed knowledge of Windows NT. Another recent work examined the failure of components of an on line image service [1] That study focused more on the hardware compo nents rather than the entire ....

M. Kalyanakrishnam, Z. Kalbarczyk, R. I. Failure Data Analysis of LAN of Windows NT Based Computers. In 18th Symposium on Reliable and Distributed Systems, SRDS '99 (1999), pp. 178-187.


The Shape of Failure - Taliver Heath Richard (2001)   (1 citation)  (Correct)

....Work This short investigation does not include an exhaustive list of related work. Rather, in this section we present recent work most closely related to our own. Perhaps the closest work related to this study is a recent work characterizing the behavior of of Microsoft Windows NT machines [3]. That study produced a detailed state machine model of a workstation node, and thus requires fairly detailed knowledge of Windows NT. Another recent work examined the failure of components of an on line image service [1] That study focused more on the hardware components than of the entire ....

M. Kalyanakrishnam, Z. Kalbarczyk, R. I. Failure Data Analysis of LAN of Windows NT Based Computers. In 18th Symposium on Reliable and Distributed Systems, SRDS '99 (1999), pp. 178-187. 4


The Evolution of Dependable Computing at the.. - Iyer, Sanders, Patel, ..   Self-citation (Kalbarczyk Iyer)   (Correct)

No context found.

M. Kalyanakrishnam, Z. Kalbarczyk, R. Iyer, "Failure Data Analysis of LAN of Windows NT Based Computers," Proc. of 18th Symp. on Reliable and Distributed Systems, SRDS `99, Lausanne, Switzerland, 1999, pp. 178-187.


A Large-Scale Study of Failures in High-Performance.. - Bianca Schroeder Garth   (Correct)

No context found.

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure data analysis of a LAN of Windows NT based computers. In SRDS-18, 1999.


A Large-Scale Study of Failures in High-Performance.. - Bianca Schroeder Garth   (Correct)

No context found.

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure data analysis of a LAN of Windows NT based computers. In SRDS-18, 1999.


State Maintenance and its Impact on the.. - Gama, Nagaraja..   (Correct)

No context found.

M. Kalyanakrishnam, Zbigniew Kalbarczyk, and Ravishankar Iyer. Failure Data Analysis of a LAN of Windows NT Based Computers. In Proceedings of the 18th Symposium on Reliable and Distributed Systems (SRDS '99), October 1999.


A Dependability Benchmark for OLTP - Application Environments Marco (2003)   (Correct)

No context found.

M. Kalyanakrishnam, Z. Kalbarczyk, R. Iyer, "Failure Data Analysis of a LAN of Windows NT Based Computers", Symposium on Reliable Distributed Database Systems, SRDS18, October, Switzerland, pp. 178-187, 1999.


Assessing the Dependability of OGSA Middleware by Fault Injection - Looker, Xu (2003)   (2 citations)  (Correct)

No context found.

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer, "Failure Data Analysis of a LAN of Windows NT based Computers," presented at Reliable distributed systems, Lausanne, Switzerland, 1999.


Dependability of CORBA Systems: Service Characterization - Marsden, Fabre, Arlat (2002)   (1 citation)  (Correct)

No context found.

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure data analysis of a LAN of Windows NT based computers. In Proc. of the Symposium on Reliable Distributed Systems (SRDS'99), pages 178--189, Washington - Brussels - Tokyo, Oct. 1999. IEEE.


Characterization Approaches for CORBA Systems by Fault.. - Marsden, Fabre, Arlat (2002)   (Correct)

No context found.

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer, "Failure data analysis of a LAN of Windows NT based computers," in Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (SRDS '99), Washington - Brussels - Tokyo, Oct. 1999, pp. 178--189, IEEE.


Failure Mode Analysis of CORBA Service Implementations - Marsden, Fabre (2001)   (Correct)

No context found.

M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer, Failure data analysis of a LAN of Windows NT based computers, in Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (SRDS '99), (Washington - Brussels - Tokyo), pp. 178189, IEEE, Oct. 1999.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC