Results 1  10
of
14
Output Space Sampling for Graph Patterns
, 2009
"... Recent interest in graph pattern mining has shifted from finding all frequent subgraphs to obtaining a small subset of frequent subgraphs that are representative, discriminative or significant. The main motivation behind that is to cope with the scalability problem that the graph mining algorithms s ..."
Abstract

Cited by 22 (5 self)
 Add to MetaCart
(Show Context)
Recent interest in graph pattern mining has shifted from finding all frequent subgraphs to obtaining a small subset of frequent subgraphs that are representative, discriminative or significant. The main motivation behind that is to cope with the scalability problem that the graph mining algorithms suffer when mining databases of large graphs. Another motivation is to obtain a succinct output set that is informative and useful. In the same spirit, researchers also proposed sampling based algorithms that sample the output space of the frequent patterns to obtain representative subgraphs. In this work, we propose a generic sampling framework that is based on MetropolisHastings algorithm to sample the output space of frequent subgraphs. Our experiments on various sampling strategies show the versatility, utility and efficiency of the proposed sampling approach.
Effective feature construction by maximum common subgraph sampling
 MACHINE LEARNING
, 2011
"... The standard approach to feature construction and predictive learning in molecular datasets is to employ computationally expensive graph mining techniques and to bias the feature search exploration using frequency or correlation measures. These features are then typically employed in predictive mode ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
The standard approach to feature construction and predictive learning in molecular datasets is to employ computationally expensive graph mining techniques and to bias the feature search exploration using frequency or correlation measures. These features are then typically employed in predictive models that can be constructed using, for example, SVMs or decision trees. We take a different approach: rather than mining for all optimal local patterns, we extract features from the set of pairwise maximum common subgraphs. The maximum common subgraphs are computed under the blockandbridgepreserving subgraph isomorphism from the outerplanar examples in polynomial time. We empirically observe a significant increase in predictive performance when using maximum common subgraph features instead of correlated local patterns on 60 benchmark datasets from NCI. Moreover, we show that when we randomly sample the pairs of graphs from which to extract the maximum common subgraphs, we obtain a smaller set of features that still allows the same predictive performance as methods that exhaustively enumerate all possible patterns. The sampling strategy turns out to be a very good compromise between a slight decrease in predictive performance (although still remaining comparable with stateoftheart methods) and a significant runtime reduction (two orders of magnitude on a popular medium size chemoinformatics dataset). This suggests that maximum common subgraphs are interesting and meaningful features.
Raedt, Maximum common subgraph mining: a fast and effective approach towards feature generation
 in 7th International Workshop on Mining and Learning with Graphs, 2009
"... There exists a wide variety of local graph mining approaches that search for frequent, correlated or closed patterns in graphs. These methods typically return very large sets of patterns which can then be used as features to build classifiers. Here we take a different approach: rather than mining fo ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
There exists a wide variety of local graph mining approaches that search for frequent, correlated or closed patterns in graphs. These methods typically return very large sets of patterns which can then be used as features to build classifiers. Here we take a different approach: rather than mining for all local patterns, we randomly sample from the set of maximum common subgraphs. The advantages are that maximum common subgraphs are easier to compute than frequent or correlated patterns, and that the resulting features lead to classification models that achieve significantly better predictive performance than models built on the patterns returned by traditional mining approaches. 1.
Approximate Graph Mining with Label Costs ∗
"... Many realworld graphs have complex labels on the nodes and edges. Mining only exact patterns yields limited insights, since it may be hard to find exact matches. However, in many domains it is relatively easy to define a cost (or distance) between different labels. Using this information, it become ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Many realworld graphs have complex labels on the nodes and edges. Mining only exact patterns yields limited insights, since it may be hard to find exact matches. However, in many domains it is relatively easy to define a cost (or distance) between different labels. Using this information, it becomes possible to mine a much richer set of approximate subgraph patterns, which preserve the topology but allow bounded label mismatches. We present novel and scalable methods to efficiently solve the approximate isomorphism problem. We show that approximate mining yields interesting patterns in several realworld graphs ranging from IT and protein interaction networks to protein structures.
Sampling Minimal Frequent Boolean (DNF) Patterns
"... We tackle the challenging problem of mining the simplest Boolean patterns from categorical datasets. Instead of complete enumeration, which is typically infeasible for this class of patterns, we develop effective sampling methods to extract a representative subset of the minimal Boolean patterns (in ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We tackle the challenging problem of mining the simplest Boolean patterns from categorical datasets. Instead of complete enumeration, which is typically infeasible for this class of patterns, we develop effective sampling methods to extract a representative subset of the minimal Boolean patterns (in disjunctive normal form – DNF). We make both theoretical and practical contributions, which allow us to prune the search space based on provable properties. Our approach can provide a nearuniform sample of the minimal DNF patterns. We also show that the mined minimal DNF patterns are very effective when used as features for classification.
Randomly Sampling Maximal Itemsets
"... Pattern mining techniques generally enumerate lots of uninteresting and redundant patterns. To obtain less redundant collections, techniques exist that give condensed representations of these collections. However, the proposed techniques often rely on complete enumeration of the pattern space, which ..."
Abstract
 Add to MetaCart
(Show Context)
Pattern mining techniques generally enumerate lots of uninteresting and redundant patterns. To obtain less redundant collections, techniques exist that give condensed representations of these collections. However, the proposed techniques often rely on complete enumeration of the pattern space, which can be prohibitive in terms of time and memory. Sampling can be used to filter the output space of patterns without explicit enumeration. We propose a framework for random sampling of maximal itemsets from transactional databases. The presented framework can use any monotonically decreasing measure as interestingness criteria for this purpose. Moreover, we use an approximation measure to guide the search for maximal sets to different parts of the output space. We show in our experiments that the method can rapidly generate small collections of patterns with good quality. The sampling framework has been implemented in the interactive visual data mining tool called MIME 1, as such enabling users to quickly sample a collection of patterns and analyze the results.
Direct Pattern Sampling with Respect to Pattern Frequency ∗
"... We present an exact and highly scalable sampling algorithm that can be used as an alternative to exhaustive local pattern discovery methods. It samples patterns according to their frequency of occurrence and can substantially improve efficiency and controllability of the pattern discovery processes. ..."
Abstract
 Add to MetaCart
We present an exact and highly scalable sampling algorithm that can be used as an alternative to exhaustive local pattern discovery methods. It samples patterns according to their frequency of occurrence and can substantially improve efficiency and controllability of the pattern discovery processes. While previous sampling approaches mainly rely on the Markov chain Monte Carlo method, our procedure is direct, i.e. a non processsimulating sampling algorithm. The advantages of this direct method are an almost optimal time complexity per pattern as well as an exactly controlled distribution of the produced patterns. In addition we present experimental results which demonstrate that these procedures can improve the accuracy of patternbased models similar to frequent sets and often also lead to substantial gains in terms of scalability. An extended version of this paper shows modifications of the here presented algorithm to sample by other frequency related distributions. Namely, area, squared frequency and a class discriminativity measure. 1
Journeys to Data Mining.Mohamed Medhat Gaber Editor Journeys to Data Mining Experiences from 15 Renowned ResearchersEditor
"... ..."
(Show Context)
Infrastructure Pattern Discovery in Configuration Management Databases via Large Sparse Graph Mining
"... A configuration management database (CMDB) can be considered to be a large graph representing the IT infrastructure entities and their interrelationships. Mining such graphs is challenging because they are large, complex, and multiattributed, and have many repeated labels. These characteristics po ..."
Abstract
 Add to MetaCart
A configuration management database (CMDB) can be considered to be a large graph representing the IT infrastructure entities and their interrelationships. Mining such graphs is challenging because they are large, complex, and multiattributed, and have many repeated labels. These characteristics pose challenges for graph mining algorithms, due tothe increased cost of subgraph isomorphism (for support counting), and graph isomorphism (for eliminating duplicate patterns). The notion of pattern frequency or support is also more challenging in a single graph, since it has to be defined in terms of the number of its (potentially, exponentially many) embeddings. We present CMDBMiner, a novel twostep methodfor mininginfrastructurepatternsfrom CMDBgraphs. It first samples the set of maximal frequent patterns, and then clusters them to extract the representative infrastructure patterns. We demonstrate the effectiveness of CMDBMiner on realworld CMDB graphs.