Results 1  10
of
1,043
Anomaly Detection: A Survey
, 2007
"... Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and c ..."
Abstract

Cited by 511 (5 self)
 Add to MetaCart
(Show Context)
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the di®erent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Perspectives on system identification
 In Plenary talk at the proceedings of the 17th IFAC World Congress, Seoul, South Korea
, 2008
"... System identification is the art and science of building mathematical models of dynamic systems from observed inputoutput data. It can be seen as the interface between the real world of applications and the mathematical world of control theory and model abstractions. As such, it is an ubiquitous ne ..."
Abstract

Cited by 160 (3 self)
 Add to MetaCart
System identification is the art and science of building mathematical models of dynamic systems from observed inputoutput data. It can be seen as the interface between the real world of applications and the mathematical world of control theory and model abstractions. As such, it is an ubiquitous necessity for successful applications. System identification is a very large topic, with different techniques that depend on the character of the models to be estimated: linear, nonlinear, hybrid, nonparametric etc. At the same time, the area can be characterized by a small number of leading principles, e.g. to look for sustainable descriptions by proper decisions in the triangle of model complexity, information contents in the data, and effective validation. The area has many facets and there are many approaches and methods. A tutorial or a survey in a few pages is not quite possible. Instead, this presentation aims at giving an overview of the “science ” side, i.e. basic principles and results and at pointing to open problem areas in the practical, “art”, side of how to approach and solve a real problem. 1.
Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures
"... The last decade has witnessed a tremendous growths of interests in applications that deal with querying and mining of time series data. Numerous representation methods for dimensionality reduction and similarity measures geared towards time series have been introduced. Each individual work introduci ..."
Abstract

Cited by 134 (23 self)
 Add to MetaCart
(Show Context)
The last decade has witnessed a tremendous growths of interests in applications that deal with querying and mining of time series data. Numerous representation methods for dimensionality reduction and similarity measures geared towards time series have been introduced. Each individual work introducing a particular method has made specific claims and, aside from the occasional theoretical justifications, provided quantitative experimental observations. However, for the most part, the comparative aspects of these experiments were too narrowly focused on demonstrating the benefits of the proposed methods over some of the previously introduced ones. In order to provide a comprehensive validation, we conducted an extensive set of time series experiments reimplementing 8 different representation methods and 9 similarity measures and their variants, and testing their effectiveness on 38 time series data sets from a wide variety of application domains. In this paper, we give an overview of these different techniques and present our comparative experimental findings regarding their effectiveness. Our experiments have provided both a unified validation of some of the existing achievements, and in some cases, suggested that certain claims in the literature may be unduly optimistic. 1.
Top 10 algorithms in data mining
, 2007
"... Abstract This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, kMeans, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining a ..."
Abstract

Cited by 113 (2 self)
 Add to MetaCart
(Show Context)
Abstract This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, kMeans, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification,
Costsensitive boosting for classification of imbalanced data
, 2007
"... Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. The significant difficulty and frequent o ..."
Abstract

Cited by 78 (1 self)
 Add to MetaCart
(Show Context)
Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. The significant difficulty and frequent occurrence of the class imbalance problem indicate the need for extra research efforts. The objective of this paper is to investigate metatechniques applicable to most classifier learning algorithms, with the aim to advance the classification of imbalanced data. The AdaBoost algorithm is reported as a successful metatechnique for improving classification accuracy. The insight gained from a comprehensive analysis of the AdaBoost algorithm in terms of its advantages and shortcomings in tacking the class imbalance problem leads to the exploration of three costsensitive boosting algorithms, which are developed by introducing cost items into the learning framework of AdaBoost. Further analysis shows that one of the proposed algorithms tallies with the stagewise additive modelling in statistics to minimize the cost exponential loss. These boosting algorithms are also studied with respect to their weighting strategies towards different types of samples, and their effectiveness in identifying rare cases through experiments on several real world medical data sets, where the class imbalance problem prevails.
Revisiting the foundations of Artificial Immune Systems: a problemoriented perspective
 Hart (Eds.) Artificial Immune Systems (Proc. ICARIS2003), LNCS 2787
, 2003
"... This paper advocates a problemoriented approach for the design of Artificial Immune Systems (AIS) for data mining. By problemoriented approach we mean that, in realworld data mining applications, the design of an AIS should take into account the characteristics of the data to be mined together wi ..."
Abstract

Cited by 63 (29 self)
 Add to MetaCart
This paper advocates a problemoriented approach for the design of Artificial Immune Systems (AIS) for data mining. By problemoriented approach we mean that, in realworld data mining applications, the design of an AIS should take into account the characteristics of the data to be mined together with the application domain: the components of the AIS – such as its representation, affinity function and immune process – should be tailored for the data and the application. This is in contrast with the majority of the literature, where a very generic AIS algorithm for data mining is developed and there is little or no concern in tailoring the components of the AIS for the data to be mined or the application domain. To support this problemoriented approach, we provide an extensive critical review of the current literature on AIS for data mining, focusing on the data mining tasks of classification and anomaly detection. We discuss several important lessons to be taken from the natural immune system to design new AIS that are considerably more adaptive than current AIS. Finally, we conclude the paper with a summary of seven limitations of current AIS for data mining and 10 suggested research directions.
Statistical significance of communities in networks
 Physical Review E
"... Community structure is one of the main structural features of networks, revealing both their internal organization and the similarity of their elementary units. Despite the large variety of methods proposed to detect communities in graphs, there is a big need for multipurpose techniques, able to ha ..."
Abstract

Cited by 58 (2 self)
 Add to MetaCart
(Show Context)
Community structure is one of the main structural features of networks, revealing both their internal organization and the similarity of their elementary units. Despite the large variety of methods proposed to detect communities in graphs, there is a big need for multipurpose techniques, able to handle different types of datasets and the subtleties of community structure. In this paper we present OSLOM (Order Statistics Local Optimization Method), the first method capable to detect clusters in networks accounting for edge directions, edge weights, overlapping communities, hierarchies and community dynamics. It is based on the local optimization of a fitness function expressing the statistical significance of clusters with respect to random fluctuations, which is estimated with tools of Extreme and Order Statistics. OSLOM can be used alone or as a refinement procedure of partitions/covers delivered by other techniques. We have also implemented sequential algorithms combining OSLOM with other fast techniques, so that the community structure of very large networks can be uncovered. Our method has a comparable performance as the best existing algorithms on artificial benchmark graphs. Several applications on real networks are shown as well. OSLOM is implemented in a freely available software
Discovering important people and objects for egocentric video summarization
 In CVPR
, 2012
"... We present a video summarization approach for egocentric or “wearable ” camera data. Given hours of video, the proposed method produces a compact storyboard summary of the camera wearer’s day. In contrast to traditional keyframe selection techniques, the resulting summary focuses on the most importa ..."
Abstract

Cited by 56 (5 self)
 Add to MetaCart
(Show Context)
We present a video summarization approach for egocentric or “wearable ” camera data. Given hours of video, the proposed method produces a compact storyboard summary of the camera wearer’s day. In contrast to traditional keyframe selection techniques, the resulting summary focuses on the most important objects and people with which the camera wearer interacts. To accomplish this, we develop region cues indicative of highlevel saliency in egocentric video—such as the nearness to hands, gaze, and frequency of occurrence—and learn a regressor to predict the relative importance of any new region based on these cues. Using these predictions and a simple form of temporal event detection, our method selects frames for the storyboard that reflect the key objectdriven happenings. Critically, the approach is neither camerawearerspecific nor objectspecific; that means the learned importance metric need not be trained for a given user or context, and it can predict the importance of objects and people that have never been seen previously. Our results with 17 hours of egocentric data show the method’s promise relative to existing techniques for saliency and summarization. 1.
Similarity Measures for Categorical Data: A Comparative Evaluation
, 2008
"... Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively wellunderstood, but for categorical data, the similarity computation is not straightforward. Several datadriven simi ..."
Abstract

Cited by 53 (3 self)
 Add to MetaCart
(Show Context)
Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively wellunderstood, but for categorical data, the similarity computation is not straightforward. Several datadriven similarity measures have been proposed in the literature to compute the similarity between two categorical data instances but their relative performance has not been evaluated. In this paper we study the performance of a variety of similarity measures in the context of a specific data mining task: outlier detection. Results on a variety of data sets show that while no one measure dominates others for all types of problems, some measures are able to have consistently high performance.
An Online Outlier Detection Technique for Wireless Sensor Networks using Unsupervised QuarterSphere Support Vector Machine
"... Abstract—The main challenge faced by outlier detection techniques designed for wireless sensor networks is achieving high detection rate and low false alarm rate while maintaining the resource consumption in the network to a minimum. In this paper, we propose an online outlier detection technique wi ..."
Abstract

Cited by 52 (2 self)
 Add to MetaCart
(Show Context)
Abstract—The main challenge faced by outlier detection techniques designed for wireless sensor networks is achieving high detection rate and low false alarm rate while maintaining the resource consumption in the network to a minimum. In this paper, we propose an online outlier detection technique with low computational complexity and memory usage based on an unsupervised centered quartersphere support vector machine for realtime environmental monitoring applications of wireless sensor networks. The proposed approach is completely local and thus saves communication overhead and scales well with increase of nodes deployed. We take advantage of spatial correlations that exist in sensor data of adjacent nodes to reduce the false alarm rate in realtime. Experiments with both synthetic and real data collected from the Intel Berkeley Research Laboratory show that our technique achieves better mining performance in terms of parameter selection using different kernel functions compared to an earlier offline outlier detection technique designed for wireless sensor networks. I.