Results 1 - 10
of
11
Frequent Closed Sequence Mining without Candidate Maintenance
, 2007
"... Previous studies have presented convincing arguments that a frequent pattern mining algorithm should not mine all frequent patterns but only the closed ones because the latter leads to not only a more compact yet complete result set but also better efficiency. However, most of the previously develo ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Previous studies have presented convincing arguments that a frequent pattern mining algorithm should not mine all frequent patterns but only the closed ones because the latter leads to not only a more compact yet complete result set but also better efficiency. However, most of the previously developed closed pattern mining algorithms work under the candidate maintenance-andtest paradigm, which is inherently costly in both runtime and space usage when the support threshold is low or the patterns become long. In this paper, we present BIDE, an efficient algorithm for mining frequent closed sequences without candidate maintenance. It adopts a novel sequence closure checking scheme called BI-Directional Extension and prunes the search space more deeply compared to the previous algorithms by using the BackScan pruning method. A thorough performance study with both sparse and dense, real, and synthetic data sets has demonstrated that BIDE significantly outperforms the previous algorithm: It consumes an order(s) of magnitude less memory and can be more than an order of magnitude faster. It is also linearly scalable in terms of database size.
A Brief Survey on Sequence Classification
"... Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selectio ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selection techniques, the dimensionality of potential features may still be very high and the sequential nature of features is difficult to capture. This makes sequence classification a more challenging task than classification on feature vectors. In this paper, we present a brief review of the existing work on sequence classification. We summarize the sequence classification in terms of methodologies and application domains. We also provide a review on several extensions of the sequence classification problem, such as early classification on sequences and semi-supervised learning on sequences. 1.
Mining Frequent Arrangements of Temporal Intervals
- UNDER CONSIDERATION FOR PUBLICATION IN KNOWLEDGE AND INFORMATION SYSTEMS
, 2008
"... The problem of discovering frequent arrangements of temporal intervals is studied. It is assumed that the database consists of sequences of events, where an event occurs during a time-interval. The goal is to mine temporal arrangements of event intervals that appear frequently in the database. The m ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The problem of discovering frequent arrangements of temporal intervals is studied. It is assumed that the database consists of sequences of events, where an event occurs during a time-interval. The goal is to mine temporal arrangements of event intervals that appear frequently in the database. The motivation of this work is the observation that in practice most events are not instantaneous but occur over a period of time and different events may occur concurrently. Thus, there are many practical applications that require mining such temporal correlations between intervals including the linguistic analysis of annotated data from American Sign Language as well as network and biological data. Three efficient methods to find frequent arrangements of temporal intervals are described; the first two are tree-based and use breadth and depth first search to mine the set of frequent arrangements, whereas the third one is prefix-based. The above methods apply efficient pruning techniques that include a set of constraints that add user-controlled focus into the mining process. Moreover, based on the extracted patterns a standard method for mining association rules is employed that applies different interestingness measures to evaluate the significance of the discovered
Finding Top-n Emerging Sequences to Contrast Sequence Sets
"... Comparing groups or sets is the main focal issue in statistics, and data mining research has also focused on automatically identifying values and instances that differ significantly across groups, known as contrast sets. Whether traditional statistics or the work on contrast sets, the comparison is ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Comparing groups or sets is the main focal issue in statistics, and data mining research has also focused on automatically identifying values and instances that differ significantly across groups, known as contrast sets. Whether traditional statistics or the work on contrast sets, the comparison is made on nominal data. There is very little work on contrasting sets of event sequences. In this paper we introduce the notion of emerging sequences; sequences that when taken from a set of sequences A and put in a set of sequences B would be considered an abnormal outcast in B and thus distinguishes the set A from the set B. We present approaches for finding such emerging sequences efficiently and introduce an algorithm for discovering the top most emerging sequences. 1.
Efficiently Mining Closed Subsequences with Gap Constraints
, 2008
"... Mining frequent subsequence patterns from sequence databases is a typical data mining problem and various efficient sequential pattern mining algorithms have been proposed. In many problem domains (e.g, biology), the frequent subsequences confined by the predefined gap requirements are more meaningf ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Mining frequent subsequence patterns from sequence databases is a typical data mining problem and various efficient sequential pattern mining algorithms have been proposed. In many problem domains (e.g, biology), the frequent subsequences confined by the predefined gap requirements are more meaningful than the general sequential patterns. In this paper we re-examine the closed sequential pattern mining problem by introducing the gap constraints. The most challenging parts in this task include the constrained pattern closure checking and unpromising search space pruning. Inspired by some state-of-the-art closed or constrained sequential pattern mining algorithms, we propose an efficient approach to finding the complete set of closed sequential patterns with gap constraints. The approach combines the newly devised constrained pattern closure checking scheme and pruning techniques with the pattern growth based subsequence enumeration framework. Our extensive performance study shows that our approach is very efficient in mining frequent closed subsequences with gap constraints.
Interesting-Phrase Mining for Ad-Hoc Text Analytics
"... Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, efficient, and scalable manner. While much of the prior litera ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, efficient, and scalable manner. While much of the prior literature has emphasized mining keywords or tags in blogs or social-tagging communities, we emphasize the analysis of interesting phrases. These include named entities, important quotations, market slogans, and other multi-word phrases that are prominent in a dynamically derived ad-hoc subset of the corpus, e.g., being frequent in the subset but relatively infrequent in the overall corpus. We develop preprocessing and indexing methods for phrases, paired with new search techniques for the top-k most interesting phrases in ad-hoc subsets of the corpus. Our framework is evaluated using a large-scale real-world corpus of New York Times news articles. 1.
Efficient Mining of Contrast Patterns and Their Applications to Classification
"... Abstract — Data mining is one of the most important areas in the 21 century with many wide ranging applications. These include medicine, finance, commerce and engineering. Pattern mining is amongst the most important and challenging techniques employed in data mining. Patterns are collections of ite ..."
Abstract
- Add to MetaCart
Abstract — Data mining is one of the most important areas in the 21 century with many wide ranging applications. These include medicine, finance, commerce and engineering. Pattern mining is amongst the most important and challenging techniques employed in data mining. Patterns are collections of items which satisfy certain properties. Emerging Patterns are those whose frequencies change significantly from one dataset to another. They represent strong contrast knowledge and have been shown to be very successful for constructing accurate and robust classifiers. In this paper, we examine various kinds of contrast patterns. We also investigate efficient pattern mining techniques and discuss how to exploit patterns to construct effective classifiers. I.
Mining Conditional Contrast Patterns
, 2009
"... This chapter considers the problem of “conditional contrast pattern mining.” It is related to contrast mining, where one considers the mining of patterns/models that contrast two or more datasets, classes, conditions, time periods, and so forth. Roughly speaking, conditional contrasts capture situat ..."
Abstract
- Add to MetaCart
This chapter considers the problem of “conditional contrast pattern mining.” It is related to contrast mining, where one considers the mining of patterns/models that contrast two or more datasets, classes, conditions, time periods, and so forth. Roughly speaking, conditional contrasts capture situations where a small change in patterns is associated with a big change in the matching data of the patterns. More precisely, a conditional contrast is a triple (B, F
An Occurrence based Approach to Mine Emerging Sequences
"... Abstract. An important purpose of sequence analysis is to find the distinguishing characteristics of sequence classes. Emerging Sequences (ESs), subsequences that are frequent in sequences of one group and less frequent in the sequences of another, can contrast sequences of different classes and thu ..."
Abstract
- Add to MetaCart
Abstract. An important purpose of sequence analysis is to find the distinguishing characteristics of sequence classes. Emerging Sequences (ESs), subsequences that are frequent in sequences of one group and less frequent in the sequences of another, can contrast sequences of different classes and thus facilitating sequence classification. Different approaches have been developed to extract ESs, in which various mining criterions are applied. In our work we compare Emerging Sequences fulfilling different constraints. By measuring ESs with their occurrences, introducing gap constraint and keeping the uniqueness of items, our ESs demonstrate desirable discriminative power. Evaluating against two mining algorithms based on support and no gap constraint subsequences, the experiments on two types of datasets show that the ESs fulfilling our selection criterions achieve a satisfactory classification accuracy: an average F-measure of 93.2 % is attained when the experiments are performed on 11 datasets.
Removing Manually Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books ∗
, 2009
"... Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and t ..."
Abstract
- Add to MetaCart
Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg TM corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets. 1

