Results 1 - 10
of
64
Anomaly Detection: A Survey
, 2007
"... Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and c ..."
Abstract
-
Cited by 69 (1 self)
- Add to MetaCart
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the di®erent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
In-network outlier detection in wireless sensor networks
- In ICDCS
, 2006
"... To address the problem of unsupervised outlier detection in wireless sensor networks, we develop an algorithm that (1) is flexible with respect to the outlier definition, (2) works in-network with a communication load proportional to the outcome, and (3) reveals its outcome to all sensors. We examin ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
To address the problem of unsupervised outlier detection in wireless sensor networks, we develop an algorithm that (1) is flexible with respect to the outlier definition, (2) works in-network with a communication load proportional to the outcome, and (3) reveals its outcome to all sensors. We examine the algorithm’s performance using simulation with real sensor data streams. Our results demonstrate that the algorithm is accurate and imposes a reasonable communication load and level of power consumption. 1.
Bayesian Statistics
- in WWW', Computing Science and Statistics
, 1989
"... ∗ Signatures are on file in the Graduate School. This dissertation presents two topics from opposite disciplines: one is from a parametric realm and the other is based on nonparametric methods. The first topic is a jackknife maximum likelihood approach to statistical model selection and the second o ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
∗ Signatures are on file in the Graduate School. This dissertation presents two topics from opposite disciplines: one is from a parametric realm and the other is based on nonparametric methods. The first topic is a jackknife maximum likelihood approach to statistical model selection and the second one is a convex hull peeling depth approach to nonparametric massive multivariate data analysis. The second topic includes simulations and applications on massive astronomical data. First, we present a model selection criterion, minimizing the Kullback-Leibler distance by using the jackknife method. Various model selection methods have been developed to choose a model of minimum Kullback-Liebler distance to the true model, such as Akaike information criterion (AIC), Bayesian information criterion (BIC), Minimum description length (MDL), and Bootstrap information criterion. Likewise, the jackknife method chooses a model of minimum Kullback-Leibler distance through bias reduction. This bias, which is inevitable in model
Quantile Regression Forests
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... Random Forests were introduced as a Machine Learning tool in Breiman (2001) and have since proven to be very popular and powerful for high-dimensional regression and classification. For regression, Random Forests give an accurate approximation of the conditional mean of a response variable. It is sh ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Random Forests were introduced as a Machine Learning tool in Breiman (2001) and have since proven to be very popular and powerful for high-dimensional regression and classification. For regression, Random Forests give an accurate approximation of the conditional mean of a response variable. It is shown here that Random Forests provide information about the full conditional distribution of the response variable, not only about the conditional mean. Conditional quantiles can be inferred with Quantile Regression Forests, a generalisation of Random Forests. Quantile Regression Forests give a non-parametric and accurate way of estimating conditional quantiles for high-dimensional predictor variables. The algorithm is shown to be consistent. Numerical examples suggest that the algorithm is competitive in terms of predictive power.
Enhancing Data Analysis with Noise Removal
"... Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the result of low-level data errors that result from an imperfect data collection process, but data objects that a ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the result of low-level data errors that result from an imperfect data collection process, but data objects that are irrelevant or only weakly relevant can also significantly hinder data analysis. Thus, if the goal is to enhance the data analysis as much as possible, these objects should also be considered as noise, at least with respect to the underlying analysis. Consequently, there is a need for data cleaning techniques that remove both types of noise. Because data sets can contain large amount of noise, these techniques also need to be able to discard a potentially large fraction of the data. This paper explores four techniques intended for noise removal to enhance data analysis in the presence of high noise levels. Three of
Preventing information leaks in email
- In Proceedings of SIAM International Conference on Data Mining (SDM-07
, 2007
"... The widespread use of email has raised serious privacy concerns. A critical issue is how to prevent email information leaks, i.e., when a message is accidentally addressed to non-desired recipients. This is an increasingly common problem that can severely harm individuals and corporations — for inst ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
The widespread use of email has raised serious privacy concerns. A critical issue is how to prevent email information leaks, i.e., when a message is accidentally addressed to non-desired recipients. This is an increasingly common problem that can severely harm individuals and corporations — for instance, a single email leak can potentially cause expensive law suits, brand reputation damage, negotiation setbacks and severe financial losses. In this paper we present the first attempt to solve this problem. We begin by redefining it as an outlier detection task, where the unintended recipients are the outliers. Then we combine real email examples (from the Enron Corpus) with carefully simulated leak-recipients to learn textual and network patterns associated with email leaks. This method was able to detect email leaks in almost 82 % of the test cases, significantly outperforming all other baselines. More importantly, in a separate set of experiments we applied the proposed method to the task of finding real cases of email leaks. The result was encouraging: a variation of the proposed technique was consistently successful in finding two real cases of email leaks. Not only does this paper introduce the important problem of email leak detection, but also presents an effective solution that can be easily implemented in any email client — with no changes in the email server side.
Anomaly Detection in a Mobile Communication Network
- Proceedings of the NAACSOS
, 2006
"... Cell phone networks produce a massive volume of service usage data which, when combined with location data, can be used to pinpoint emergency situations that cause changes in network usage. Such a change may be the results of an increased number of people trying to call friends or family to tell the ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Cell phone networks produce a massive volume of service usage data which, when combined with location data, can be used to pinpoint emergency situations that cause changes in network usage. Such a change may be the results of an increased number of people trying to call friends or family to tell them what is happening or a decrease in network usage caused by people being unable to use the network. Such events are anomalies and managing emergencies effectively requires identifying anomalies quickly. This problem is difficult due to the rate at which very large volumes of data are produced. In this paper, we discuss the use of data stream clustering algorithms for anomaly detection. Contact:
Mining for misconfigured machines in grid systems
- In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
, 2006
"... Grid systems are proving increasingly useful for managing the batch computing jobs of organizations. One well-known example is Intel, whose internally developed NetBatch system manages tens of thousands of machines. The size, heterogeneity, and complexity of grid systems make them very difficult, ho ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Grid systems are proving increasingly useful for managing the batch computing jobs of organizations. One well-known example is Intel, whose internally developed NetBatch system manages tens of thousands of machines. The size, heterogeneity, and complexity of grid systems make them very difficult, however, to configure. This often results in misconfigured machines, which may adversely affect the entire system. We investigate a distributed data mining approach for detection of misconfigured machines. Our Grid Monitoring System (GMS) non-intrusively collects data from all sources (log files, system services, etc.) available throughout the grid system. It converts raw data to semantically meaningful data and stores this data on the machine it was obtained from, limiting incurred overhead and allowing scalability. Afterwards, when analysis is requested, a distributed outliers detection algorithm is employed to identify misconfigured machines. The algorithm itself is implemented as a recursive workflow of grid jobs. It is especially suited to grid systems, in which the machines might be unavailable most of the time and often fail altogether.
In-network outlier cleaning for data collection in sensor networks
- In CleanDB, Workshop in VLDB 2006
, 2006
"... Outliers are very common in the environmental data monitored by a sensor network consisting of many inexpensive, low fidelity, and frequently failed sensors. The limited battery power and costly data transmission have introduced a new challenge for outlier cleaning in sensor networks: it must be don ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Outliers are very common in the environmental data monitored by a sensor network consisting of many inexpensive, low fidelity, and frequently failed sensors. The limited battery power and costly data transmission have introduced a new challenge for outlier cleaning in sensor networks: it must be done innetwork to avoid spending energy on transmitting outliers. In this paper, we propose an in-network outlier cleaning approach, including wavelet based outlier correction and neighboring DTW(Dynamic Time Warping) distance-based outlier removal. The cleaning process is accomplished during multi-hop data forwarding process, and makes use of the neighboring relation in the hop-count based routing algorithm. Our approach guarantees that most of the outliers can be either corrected, or removed from further transmission within 2 hops. We have simulated a spatialtemporal correlated environmental area, and evaluated the outlier cleaning approach in it. The results show that our approach can effectively clean the sensing data and reduce outlier traffic. 1
Wat: Finding top-k discords in time series database
- In Proceedings of 7th SIAM International Conference on Data Mining
, 2007
"... Finding discords in time series database is an important problem in a great variety of applications, such as space shuttle telemetry, mechanical industry, biomedicine, and financial data analysis. However, most previous methods for this problem suffer from too many parameter settings which are diffi ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Finding discords in time series database is an important problem in a great variety of applications, such as space shuttle telemetry, mechanical industry, biomedicine, and financial data analysis. However, most previous methods for this problem suffer from too many parameter settings which are difficult for users. The best known approach to our knowledge that has comparatively fewer parameters still requires users to choose a word size for the compression of subsequences. In this paper, we propose a Haar wavelet and augmented trie based algorithm to mine the top-K discords from a time series database, which can dynamically determine the word size for compression. Due to the characteristics of Haar wavelet transform, our algorithm has greater pruning power than previous approaches. Through experiments with some annotated datasets, the effectiveness and efficiency of our algorithm are both attested. 1

