Results 1  10
of
139
gSpan: GraphBased Substructure Pattern Mining
, 2002
"... We investigate new approaches for frequent graphbased pattern mining in graph datasets and propose a novel algorithm called gSpan (graphbased Substructure pattern mining) , which discovers frequent substructures without candidate generation. gSpan builds a new lexicographic order among graphs, and ..."
Abstract

Cited by 650 (34 self)
 Add to MetaCart
We investigate new approaches for frequent graphbased pattern mining in graph datasets and propose a novel algorithm called gSpan (graphbased Substructure pattern mining) , which discovers frequent substructures without candidate generation. gSpan builds a new lexicographic order among graphs, and maps each graph to a unique minimum DFS code as its canonical label. Based on this lexicographic order, gSpan adopts the depthfirst search strategy to mine frequent connected subgraphs efficiently. Our performance study shows that gSpan substantially outperforms previous algorithms, sometimes by an order of magnitude.
Efficiently Mining Frequent Trees in a Forest
, 2002
"... Mining frequent trees is very useful in domains like bioinformatics, web mining, mining semistructured data, and so on. We formulate the problem of mining (embedded) subtrees in a forest of rooted, labeled, and ordered trees. We present TreeMiner, a novel algorithm to discover all frequent subtrees ..."
Abstract

Cited by 213 (6 self)
 Add to MetaCart
(Show Context)
Mining frequent trees is very useful in domains like bioinformatics, web mining, mining semistructured data, and so on. We formulate the problem of mining (embedded) subtrees in a forest of rooted, labeled, and ordered trees. We present TreeMiner, a novel algorithm to discover all frequent subtrees in a forest, using a new data structure called scopelist. We contrast TreeMiner with a pattern matching tree mining algorithm (PatternMatcher). We conduct detailed experiments to test the performance and scalability of these methods. We find that TreeMiner outperforms the pattern matching approach by a factor of 4 to 20, and has good scaleup properties. We also present an application of tree mining to analyze real web logs for usage patterns.
An efficient algorithm for discovering frequent subgraphs
 IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Abstract — Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to nontraditional domains, existing frequent pattern discovery approach cannot be used. This i ..."
Abstract

Cited by 120 (7 self)
 Add to MetaCart
(Show Context)
Abstract — Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to nontraditional domains, existing frequent pattern discovery approach cannot be used. This is because the transaction framework that is assumed by these algorithms cannot be used to effectively model the datasets in these domains. An alternate way of modeling the objects in these datasets is to represent them using graphs. Within that model, one way of formulating the frequent pattern discovery problem is as that of discovering subgraphs that occur frequently over the entire set of graphs. In this paper we present a computationally efficient algorithm, called FSG, for finding all frequent subgraphs in large graph datasets. We experimentally evaluate the performance of FSG using a variety of real and synthetic datasets. Our results show that despite the underlying complexity associated with frequent subgraph discovery, FSG is effective in finding all frequently occurring subgraphs in datasets containing over 200,000 graph transactions and scales linearly with respect to the size of the dataset. Index Terms — Data mining, scientific datasets, frequent pattern discovery, chemical compound datasets.
Spin: Mining maximal frequent subgraphs from graph databases
 IN KDD
, 2004
"... One fundamental challenge for mining recurring subgraphs from semistructured data sets is the overwhelming abundance of such patterns. In large graph databases, the total number of frequent subgraphs can become too large to allow a full enumeration using reasonable computational resources. In this ..."
Abstract

Cited by 99 (12 self)
 Add to MetaCart
(Show Context)
One fundamental challenge for mining recurring subgraphs from semistructured data sets is the overwhelming abundance of such patterns. In large graph databases, the total number of frequent subgraphs can become too large to allow a full enumeration using reasonable computational resources. In this paper, we propose a new algorithm that mines only maximal frequent subgraphs, i.e. subgraphs that are not a part of any other frequent subgraphs. This may exponentially decrease the size of the output set in the best case; in our experiments on practical data sets, mining maximal frequent subgraphs reduces the total number of mined patterns by two to three orders of magnitude. Our method first mines all frequent trees from a general graph database and then reconstructs all maximal subgraphs from the mined trees. Using two chemical structure benchmarks and a set of synthetic graph data sets, we demonstrate that, in addition to decreasing the output size, our algorithm can achieve a fivefold speed up over the current stateoftheart subgraph mining algorithms.
XRules: An Effective Structural Classifier for XML Data
, 2003
"... XML documents have recently become ubiquitous because of their varied applicability in a number of applications. Classification is an important problem in the data mining domain, but current classification methods for XML documents use IRbased methods in which each document is treated as a bag of w ..."
Abstract

Cited by 90 (7 self)
 Add to MetaCart
XML documents have recently become ubiquitous because of their varied applicability in a number of applications. Classification is an important problem in the data mining domain, but current classification methods for XML documents use IRbased methods in which each document is treated as a bag of words. Such techniques ignore a significant amount of information hidden inside the documents. In this paper we discuss the problem of rule based classification of XML data by using frequent discriminatory substructures within XML documents. Such a technique is more capable of finding the classification characteristics of documents. In addition, the technique can also be extended to cost sensitive classification. We show the e#ectiveness of the method with respect to other classifiers. We note that the methodology discussed in this paper is applicable to any kind of semistructured data.
Efficient discovery of frequent unordered trees
 In First International Workshop on Mining Graphs, Trees and Sequences
, 2003
"... Abstract. Recently, an algorithm called Freqt was introduced which enumerates all frequent induced subtrees in an ordered data tree. We propose a new algorithm for mining unordered frequent induced subtrees. We show that the complexity of enumerating unordered trees is not higher than the complexity ..."
Abstract

Cited by 53 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Recently, an algorithm called Freqt was introduced which enumerates all frequent induced subtrees in an ordered data tree. We propose a new algorithm for mining unordered frequent induced subtrees. We show that the complexity of enumerating unordered trees is not higher than the complexity of enumerating ordered trees; a strategy for determining the frequency of unordered trees is introduced. 1
Discovering Frequent Substructures In Large Unordered Trees
 IN PROC. OF THE 6TH INTL. CONF. ON DISCOVERY SCIENCE
, 2003
"... In this paper, we study a data mining problem of discovering frequent substructures in a large collection of semistructured data, where both of the patterns and the data are modeled by labeled unordered trees. An unordered tree is a directed acyclic graph with a specified node called the root, ..."
Abstract

Cited by 51 (6 self)
 Add to MetaCart
(Show Context)
In this paper, we study a data mining problem of discovering frequent substructures in a large collection of semistructured data, where both of the patterns and the data are modeled by labeled unordered trees. An unordered tree is a directed acyclic graph with a specified node called the root, and all nodes but the root have at most one parent. Each node is labeled by a symbol drawn from an alphabet. Such unordered trees can be seen as either a generalization of itemsets in relational databases or an efficient specialization of attributed graphs in graph mining. They are also useful in various applications such as analysis of chemical compounds and mining hyperlink structures in Web. Introducing novel definitions of the support and the canonical form for unordered trees, we present an efficient algorithm called Unot that computes all labeled unordered trees appearing in a collection of data trees with frequency above a userspecified threshold. We prove that the algorithm enumerates each frequent pattern T in O(kb n) per pattern, where k is the size of T , b is the branching factor of the data tree, and n is the total number of occurrences of T in the data trees. The keys of the algorithm are e#cient enumerating all unordered trees in canonical form and incrementally computation of the occurrences based on a powerful design technique known as the reverse search
The Complexity of Mining Maximal Frequent Itemsets and Maximal Frequent Patterns
 In KDD ’04: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data mining
, 2004
"... Mining maximal frequent itemsets is one of the most fundamental problems in data mining. In this paper we study the complexitytheoretic aspects of maximal frequent itemset mining, from the perspective of counting the number of solutions. We present the first formal proof that the problem of countin ..."
Abstract

Cited by 50 (0 self)
 Add to MetaCart
(Show Context)
Mining maximal frequent itemsets is one of the most fundamental problems in data mining. In this paper we study the complexitytheoretic aspects of maximal frequent itemset mining, from the perspective of counting the number of solutions. We present the first formal proof that the problem of counting the number of distinct maximal frequent itemsets in a database of transactions, given an arbitrary support threshold, is #Pcomplete, thereby providing strong theoretical evidence that the problem of mining maximal frequent itemsets is NPhard. This result is of particular interest since the associated decision problem of checking the existence of a maximal frequent itemset is in P. We also extend our complexity analysis to other similar data mining problems dealing with complex data structures, such as sequences, trees, and graphs, which have attracted intensive research interests in recent years. Normally, in these problems a partial order among frequent patterns can be defined in such a way as to preserve the downward closure property, with maximal frequent patterns being those without any successor with respect to this partial order. We investigate several variants of these mining problems in which the patterns of interest are subsequences, subtrees, or subgraphs, and show that the associated problems of counting the number of maximal frequent patterns are all either #Pcomplete or #Phard.
Indexing and Mining Free Trees
 Proceedings of the 2003 IEEE International Conference on Data Mining (ICDM’03
, 2003
"... Tree structures are used extensively in domains such as computational biology, pattern recognition, computer networks, and so on. In this paper, we present an indexing technique for free trees and apply this indexing technique to the problem of mining frequent subtrees. We first define a novel re ..."
Abstract

Cited by 49 (7 self)
 Add to MetaCart
(Show Context)
Tree structures are used extensively in domains such as computational biology, pattern recognition, computer networks, and so on. In this paper, we present an indexing technique for free trees and apply this indexing technique to the problem of mining frequent subtrees. We first define a novel representation, the canonical form, for rooted trees and extend the definition to free trees. We also introduce another concept, the canonical string, as a simpler representation for free trees in their canonical forms. We then apply our tree indexing technique to the frequent subtree mining problem and present FreeTreeMiner, a computationally e#cient algorithm that discovers all frequently occurring subtrees in a database of free trees. Our mining algorithm is a variation of the traditional a priori method for mining frequent itemsets. We study the performance and the scalability of our algorithms through extensive experiments based on both synthetic data and datasets from two real applications: a dataset of chemical compounds and a dataset of Internet multicast trees. The experiments show that our algorithm scales linearly in the cardinality of the database.
Efficient data mining for maximal frequent subtrees
 In ICDM ’03: Proceedings of the Third IEEE International Conference on Data Mining
, 2003
"... A new type of tree mining is defined in this paper, which uncovers maximal frequent induced subtrees from a database of unordered labeled trees. A novel algorithm, PathJoin, is proposed. The algorithm uses a compact data structure, FSTForest, which compresses the trees and still keeps the original ..."
Abstract

Cited by 43 (0 self)
 Add to MetaCart
(Show Context)
A new type of tree mining is defined in this paper, which uncovers maximal frequent induced subtrees from a database of unordered labeled trees. A novel algorithm, PathJoin, is proposed. The algorithm uses a compact data structure, FSTForest, which compresses the trees and still keeps the original tree structure. PathJoin generates candidate subtrees by joining the frequent paths in FSTForest. Such candidate subtree generation is localized and thus substantially reduces the number of candidate subtrees. Experiments with synthetic data sets show that the algorithm is effective and efficient. 1