Results 1 - 10
of
23
Capturing, indexing, clustering, and retrieving system history
- In SOSP
, 2005
"... system performance, Bayesian networks, information retrieval, problem signatures We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similari ..."
Abstract
-
Cited by 65 (5 self)
- Add to MetaCart
system performance, Bayesian networks, information retrieval, problem signatures We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual "raw " values of collected measurements is
Automated Known Problem Diagnosis with Event Traces
- In EuroSys
, 2006
"... Computer problem diagnosis remains a serious challenge to users and support professionals. Traditional troubleshooting methods relying heavily on human intervention make the process inefficient and the results inaccurate even for solved problems, which contribute significantly to user’s dissatisfact ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
Computer problem diagnosis remains a serious challenge to users and support professionals. Traditional troubleshooting methods relying heavily on human intervention make the process inefficient and the results inaccurate even for solved problems, which contribute significantly to user’s dissatisfaction. We propose to use system behavior information such as system event traces to build correlations with solved problems, instead of using only vague text descriptions as in existing practices. The goal is to enable automatic identification of the root cause of a problem if it is a known one, which would further lead to its resolution. By applying statistical learning techniques to classifying system call sequences, we show our approach can achieve considerable accuracy of root cause recognition by studying four case examples.
Shadow configuration as a network management primitive
- In SIGCOMM
, 2008
"... Configurations for today’s IP networks are becoming increasingly complex. As a result, configuration management is becoming a major cost factor for network providers and configuration errors are becoming a major cause of network disruptions. In this paper, we present and evaluate the novel idea of s ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Configurations for today’s IP networks are becoming increasingly complex. As a result, configuration management is becoming a major cost factor for network providers and configuration errors are becoming a major cause of network disruptions. In this paper, we present and evaluate the novel idea of shadow configurations. Shadow configurations allow configuration evaluation before deployment and thus can reduce potential network disruptions. We demonstrate using real implementation that shadow configurations can be implemented with low overhead.
Troubleshooting Chronic Conditions in Large IP Networks
, 2008
"... Chronic network conditions are caused by performance impairing events that occur intermittently over an extended period of time. Such conditions can cause repeated performance degradation to customers, and sometimes can even turn into serious hard failures. It is therefore critical to troubleshoot a ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
Chronic network conditions are caused by performance impairing events that occur intermittently over an extended period of time. Such conditions can cause repeated performance degradation to customers, and sometimes can even turn into serious hard failures. It is therefore critical to troubleshoot and repair chronic network conditions in a timely fashion in order to ensure high reliability and performance in large IP networks. Today, troubleshooting chronic conditions is often performed manually, making it a tedious, timeconsuming and error-prone process. In this paper, we present NICE (Network-wide Information Correlation and Exploration), a novel infrastructure that enables the troubleshooting of chronic network conditions by detecting and analyzing statistical correlations across multiple data sources. NICE uses a novel circular permutation test to determine the statistical significance of correlation. It also allows flexible analysis at various spatial granularity (e.g., link, router, network level, etc.). We validate NICE using real measurement data collected at a tier-1 ISP network. The results are quite positive. We then apply NICE to troubleshoot real network issues in the tier-1 ISP network. In all three case studies conducted so far, NICE successfully uncovers previously unknown chronic network conditions, resulting in improved network operations.
Active probing approach for fault localization in computer networks
- In E2EMON’06
, 2006
"... Abstract—Active probing is an active network monitoring technique that has potential for developing effective solutions for fault localization. In this paper we use active probing to present an approach to develop tools for performing fault localization. We discuss various design issues involved and ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
Abstract—Active probing is an active network monitoring technique that has potential for developing effective solutions for fault localization. In this paper we use active probing to present an approach to develop tools for performing fault localization. We discuss various design issues involved and propose architecture for building such a tool. We describe an algorithm for probe set selection for problem detection and present simulation results to show its effectiveness. We demonstrate through analysis and experiments that active probing has the potential to greatly reduce the probe traffic and the fault diagnosis time. Keywords- active probing; fault localization; active monitoring; probe station selection; problem detection; problem determination; I.
Principle Components and Importance Ranking of Distributed Anomalies
- Machine Learning
, 2004
"... Correlations between locally averaged host observations, at different times and places, hint at information about the associations between the hosts in a network. These smoothed, pseudo-continuous time-series imply relationships with entities in the wider environment. For anomaly detection, mining t ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Correlations between locally averaged host observations, at different times and places, hint at information about the associations between the hosts in a network. These smoothed, pseudo-continuous time-series imply relationships with entities in the wider environment. For anomaly detection, mining this information might provide a valuable source of observational experience for determining comparative anomalies or rejecting false anomalies. The di#culties with distributed analysis lie in collating the distributed data and in comparing observables on di#erent hosts, in di#erent frames of reference. In the present work, we examine two methods (Principle Component Analysis and Eigenvector Centrality) that shed light on the usefulness of comparing data destined for di#erent locations in a network.
Toward Optimal Network Fault Correction via End-to-End Inference
, 2006
"... Abstract — We consider an end-to-end approach of inferring network faults that manifest in multiple protocol layers, with an optimization goal of minimizing the expected cost of correcting all faulty nodes. Instead of first checking the most likely faulty nodes as in conventional fault localization ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract — We consider an end-to-end approach of inferring network faults that manifest in multiple protocol layers, with an optimization goal of minimizing the expected cost of correcting all faulty nodes. Instead of first checking the most likely faulty nodes as in conventional fault localization problems, we prove that an optimal strategy should start with checking one of the candidate nodes, which are identified based on a potential function that we develop. We propose several efficient heuristics for inferring the best node to be checked in large-scale networks. By extensive simulation, we show that we can infer the best node in at least 95%, and that checking first the candidate nodes rather than the most likely faulty nodes can decrease the checking cost of correcting all faulty nodes by up to 25%. Index Terms — network management, network diagnosis and correction, fault localization and repair, reliability engineering. I.
Test-based diagnosis: Tree and matrix representations
- In Proc. 9th IFIP/IEEE IM
, 2005
"... A common problem encountered in many application scenarios is how to represent some prior knowledge about a system in order to determine its true state as efficiently as possible. The information is typically in the form of tests, or questions about the system. Each test can potentially reduce our u ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
A common problem encountered in many application scenarios is how to represent some prior knowledge about a system in order to determine its true state as efficiently as possible. The information is typically in the form of tests, or questions about the system. Each test can potentially reduce our uncertainty about the system’s state. The problem is to represent the information capturing the dependence between tests, their outcomes, and possible states in an efficiently navigable way to aid diagnosis. The most common such representation is a flowchart with leaf nodes corresponding to possible states, and non-leaf nodes corresponding to tests about the state. The problem with flowcharts is that they are notoriously difficult to maintain. Additional knowledge often has to be manually integrated as the system changes, making it impossible to keep track of all possible decision paths, let alone optimize the flow to maximize performance. We propose an efficient method for optimizing an existing flowchart based on a conversion to an auxiliary matrix representation. The main goal of the paper is show a synergy between the two representations in the hope that this will help practitioners choose a better strategy for their applications. We show that such a conversion suggests ways to improve both representations – ways that were not envisioned when using each representation alone. Finally, we show that the two representations are informationally equivalent in the sense that one can be transformed into the other so that if both are used as black-boxes, one would not be able to tell them apart, regardless of which state the system is in.
Algorithm Design and Application of Service-Oriented Event Correlation
"... Abstract—The timely and efficient management of faults that affect the quality of services delivered to customers is an important issue for service providers with respect to their business goals. It includes the diagnosis of service faults which deals with the localization of their root causes withi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—The timely and efficient management of faults that affect the quality of services delivered to customers is an important issue for service providers with respect to their business goals. It includes the diagnosis of service faults which deals with the localization of their root causes within subservices and resources being part of the service realization. In this paper our service-oriented event correlation approach, which uses event correlation techniques to automate the diagnosis on the service layer is detailed. Our algorithm for the hybrid rule-based/case-based correlation methodology that also includes recently proposed active probing techniques is presented as well as its prototypical implementation at the Leibniz Supercomputing Center. This implementation is not limited to a small test environment, but has been carried out for requirements of the environment of this large service provider. I.
Localization of IP Links Faults Using Overlay Measurements
- In proceedings of IEEE ICC
, 2008
"... Abstract—Accurate fault detection and localization is essential to the efficient and economical operation of ISP networks. In addition, it affects the performance of Internet applications such as VoIP, and online gaming. Fault detection algorithms typically depend on spatial correlation to produce a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—Accurate fault detection and localization is essential to the efficient and economical operation of ISP networks. In addition, it affects the performance of Internet applications such as VoIP, and online gaming. Fault detection algorithms typically depend on spatial correlation to produce a set of fault hypotheses, the size of which increases by the existence of lost and spurious symptoms, and the overlap among network paths. The network administrator is left with the task of accurately locating and verifying these fault scenarios, which is a tedious and time-consuming task. In this paper, we formulate the problem of finding a set of overlay paths that can debug the set of suspected faulty IP links. These overlay paths are chosen from the set of existing measurement paths, which will make overlay measurements meaningful and useful for fault debugging. We study the overlap among overlay paths using various real-life Internet topologies of the two major service carriers in the U.S. We found that with a reasonable number of concurrent failures, it is possible to identify the location of the IP links faults with 60 % to 95 % success rate. Finally, we identify some interesting research problems in this area. I.

