Results 1 - 10
of
12
Anomaly Detection: A Survey
, 2007
"... Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and c ..."
Abstract
-
Cited by 69 (1 self)
- Add to MetaCart
Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the di®erent directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.
Clustering and Prediction of Mobile User Routes from Cellular Data
- in PKDD 2005. 2005
, 2005
"... Abstract. Location-awareness and prediction of future locations is an important problem in pervasive and mobile computing. In cellular systems (e.g., GSM) the serving cell is easily available as an indication of the user location, without any additional hardware or network services. With this locati ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Abstract. Location-awareness and prediction of future locations is an important problem in pervasive and mobile computing. In cellular systems (e.g., GSM) the serving cell is easily available as an indication of the user location, without any additional hardware or network services. With this location data and other context variables we can determine places that are important to the user, such as work and home. We devise online algorithms that learn routes between important locations and predict the next location when the user is moving. We incrementally build clusters of cell sequences to represent physical routes. Predictions are based on destination probabilities derived from these clusters. Other context variables such as the current time can be integrated into the model. We evaluate the model with real location data, and show that it achieves good prediction accuracy with relatively little memory, making the algorithms suitable for online use in mobile environments. 1
Mining for outliers in sequential databases
- in ICDM, 2006
"... The mining of outliers (or anomaly detection) in large databases continues to remain an active area of research with many potential applications. Over the last several years many novel methods have been proposed to efficiently and accurately mine for outliers. In this paper we propose a unique appro ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The mining of outliers (or anomaly detection) in large databases continues to remain an active area of research with many potential applications. Over the last several years many novel methods have been proposed to efficiently and accurately mine for outliers. In this paper we propose a unique approach to mine for sequential outliers using Probabilistic Suffix Trees (PST). The key insight that underpins our work is that we can distinguish outliers from non-outliers by only examining the nodes close to the root of the PST. Thus, if the goal is to just mine outliers, then we can drastically reduce the size of the PST and reduce its construction and query time. In our experiments, we show that on a real data set consisting of protein sequences, by retaining less than 5 % of the original PST we can retrieve all the
Extracting key-substring-group features for text classification
- In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’06
, 2006
"... In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like Support Vector Machine (SVM) have been quite successful in text classification with word features, it is neither effective nor efficient to apply them straightforwardly taking all substrings in the corpus as features. In this paper, we propose to partition all substrings into statistical equivalence groups, and then pick those groups which are important (in the statistical sense) as features (named keysubstring-group features) for text classification. In particular, we propose a suffix tree based algorithm that can extract such features in linear time (with respect to the total number of characters in the corpus). Our experiments on English, Chinese and Greek datasets show that SVM with key-substring-group features can achieve outstanding performance for various text classification tasks.
Annotating proteins by mining protein interaction networks
- Bioinformatics
, 2006
"... doi:10.1093/bioinformatics/btl221 ..."
Agile: A general approach to detect transitions in evolving data streams
- In ICDM
, 2004
"... In many applications such as e-commerce, system diagnosis and telecommunication services, data arrives in streams at a high speed. It is common that the underlying process generating the stream may change over time, either as a result of the fundamental evolution or in response to some external stim ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In many applications such as e-commerce, system diagnosis and telecommunication services, data arrives in streams at a high speed. It is common that the underlying process generating the stream may change over time, either as a result of the fundamental evolution or in response to some external stimulus. Detecting these changes is a very challenging problem of great practical importance. The overall volume of the stream usually far exceeds the available main memory and access to the data stream is typically performed via a linear scan in ascending order of the indices of the records. In this paper, we propose a novel approach, AGILE, to monitor streaming data and to detect distinguishable transitions of the underlying processes. AGILE has many advantages over the traditional Hidden Markov Model, e.g., AGILE only requires one scan of the data.
Substructure Clustering on Sequential 3d Object Datasets
- In Proc. of ICDE
, 2004
"... of sequential 3d objects. A sequential 3d object is a set of points located in a three dimensional space that are linked up to form a sequence. Given a set of sequential 3d objects, our aim is to find significantly large substructures which are present in many of the sequential 3d objects. Unlike tr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
of sequential 3d objects. A sequential 3d object is a set of points located in a three dimensional space that are linked up to form a sequence. Given a set of sequential 3d objects, our aim is to find significantly large substructures which are present in many of the sequential 3d objects. Unlike traditional subspace clustering methods in which objects are compared based on values in the same dimension, the matching dimensions between two 3d sequential objects are affected by both the translation and rotation of the objects and are thus not well defined. Instead, similarity between the objects are judge by computing a structural distance measurement call -337 which require proper alignment (including translation and rotation) of the objects. As the computation of is expensive, we proposed a new measure call "! $# % $# '&( which is shown experiemntally to approximate . Based on , we define a new clustering model called and devise an algorithm for discovering all maximum in a 3d sequential dataset. Experiments are conducted to illustrate the efficiency and effectiveness of our algorithm. 1
Evaluating Protein Motif Significance Measures: A Case Study on Prosite Patterns
"... Abstract — The existence of preserved subsequences in a set of related protein sequences suggests that they might play a structural and functional role in protein’s mechanisms. Due to its exploratory approach, the mining process tends to deliver a large number of motifs. Therefore it is critical to ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract — The existence of preserved subsequences in a set of related protein sequences suggests that they might play a structural and functional role in protein’s mechanisms. Due to its exploratory approach, the mining process tends to deliver a large number of motifs. Therefore it is critical to release methods that identify relevant significant motifs. Many measures of interest and significance have been proposed. However, since motifs have a wide range of applications, how to choose the appropriate significance measures is application dependent. Some measures show consistent results being highly correlated, while others show disagreements. In this paper we review existent measures and study their behavior in order to assist the selection of the most appropriate set of measures. An experimental evaluation of the measures for high quality patterns from the Prosite database is presented. I.
Relational Sequence Clustering for Aggregating Similar Agents
"... Abstract. Many clustering methods are based on flat descriptions, while data regarding real-world domains include heterogeneous objects related to each other in multiple ways. For instance, in the field of Multi-Agent System, multiple agents interact with the environment and with other agents. In th ..."
Abstract
- Add to MetaCart
Abstract. Many clustering methods are based on flat descriptions, while data regarding real-world domains include heterogeneous objects related to each other in multiple ways. For instance, in the field of Multi-Agent System, multiple agents interact with the environment and with other agents. In this case, in order to act effectively an agent should be able to recognise the behaviours adopted by other agents. Actions taken by an agent are sequential, and thus its behaviour can be expressed as a sequence of actions. Inferring knowledge about competing and/or companion agents by observing their actions is very beneficial to construct a behavioural model of the agent population. In this paper we propose a clustering method for relational sequences able to aggregate companion agent behaviours. The algorithm has been tested on a real world dataset proving its validity.
Modeling High-Level Behavior Patterns for Precise Similarity Analysis of Software
"... The analysis of software similarity has many applications such as detecting code clones, software plagiarism, code theft, and polymorphic malware. Because often source code is unavailable and code obfuscation is used to avoid detection, there has been much research on developing effective models to ..."
Abstract
- Add to MetaCart
The analysis of software similarity has many applications such as detecting code clones, software plagiarism, code theft, and polymorphic malware. Because often source code is unavailable and code obfuscation is used to avoid detection, there has been much research on developing effective models to capture runtime behavior to aid detection. Existing models focus on low-level information such as dependency or purely occurrence of function calls, and suffer from poor precision, poor scalability, or both. To overcome limitations of existing models, this paper introduces a precise and succinct behavior representation that characterizes high-level object-accessing patterns as regular expressions. We first distill a set of high-level patterns (the alphabet Σ of the regular language) based on two pieces of information: function call patterns to access objects and typestate information of the objects. Then we abstract a runtime trace of a program P into a regular expression e over the pattern alphabet Σ to produce P ’s behavior signature. We show that software instances derived from the same code exhibit similar behavior signatures and develop effective algorithms to cluster and match behavior signatures. To evaluate the effectiveness of our behavior model, we have applied it to the similarity analysis of polymorphic malware. Our results on a large malware collection demonstrate that our model is both precise and succinct for effective and scalable matching and detection of polymorphic malware.

