Results 1 - 10
of
14
Mining significant graph patterns by leap search
- in SIGMOD ’08
"... With ever-increasing amounts of graph data from disparate sources, there has been a strong need for exploiting significant graph patterns with user-specified objective functions. Most objective functions are not antimonotonic, which could fail all of frequency-centric graph mining algorithms. In thi ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
With ever-increasing amounts of graph data from disparate sources, there has been a strong need for exploiting significant graph patterns with user-specified objective functions. Most objective functions are not antimonotonic, which could fail all of frequency-centric graph mining algorithms. In this paper, we give the first comprehensive study on general mining method aiming to find most significant patterns directly. Our new mining framework, called LEAP(Descending Leap Mine), is developed to exploit the correlation between structural similarity and significance similarity in a way that the most significant pattern could be identified quickly by searching dissimilar graph patterns. Two novel concepts, structural leap search and frequency descending mining, are proposed to support leap search in graph pattern space. Our new mining method revealed that the widely adopted branch-and-bound search in data mining literature is indeed not the best, thus sketching a new picture on scalable graph pattern discovery. Empirical results show that LEAP achieves orders of magnitude speedup in comparison with the state-of-the-art method. Furthermore, graph classifiers built on mined patterns outperform the up-to-date graph kernel method in terms of efficiency and accuracy, demonstrating the high promise of such patterns.
Near-optimal supervised feature selection among frequent subgraphs
- IN SIAM INT’L CONF. ON DATA MINING
, 2009
"... Graph classification is an increasingly important step in numerous application domains, such as function prediction of molecules and proteins, computerised scene analysis, and anomaly detection in program flows. Among the various approaches proposed in the literature, graph classification based on f ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Graph classification is an increasingly important step in numerous application domains, such as function prediction of molecules and proteins, computerised scene analysis, and anomaly detection in program flows. Among the various approaches proposed in the literature, graph classification based on frequent subgraphs is a popular branch: Graphs are represented as (usually binary) vectors, with components indicating whether a graph contains a particular subgraph that is frequent across the dataset. On large graphs, however, one faces the enormous problem that the number of these frequent subgraphs may grow exponentially with the size of the graphs, but only few of them possess enough discriminative power to make them
Direct mining of discriminative and essential frequent patterns via model-based search tree
- In KDD
, 2008
"... Frequent patterns provide solutions to datasets that do not have well-structured feature vectors. However, frequent pattern mining is non-trivial since the number of unique patterns is exponential but many are non-discriminative and correlated. Currently, frequent pattern mining is performed in two ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Frequent patterns provide solutions to datasets that do not have well-structured feature vectors. However, frequent pattern mining is non-trivial since the number of unique patterns is exponential but many are non-discriminative and correlated. Currently, frequent pattern mining is performed in two sequential steps: enumerating a set of frequent patterns, followed by feature selection. Although many methods have been proposed in the past few years on how to perform each separate step efficiently, there is still limited success in eventually finding highly compact and discriminative patterns. The culprit is due to the inherent nature of this widely adopted two-step approach. This paper discusses these problems and proposes a new and different method. It builds a decision tree that partitions the data onto different
Fast subtree kernels on graphs
"... In this article, we propose fast subtree kernels on graphs. On graphs with n nodes and m edges and maximum degree d, these kernels comparing subtrees of height h can be computed in O(mh), whereas the classic subtree kernel by Ramon & Gärtner scales as O(n 2 4 d h). Key to this efficiency is the obse ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
In this article, we propose fast subtree kernels on graphs. On graphs with n nodes and m edges and maximum degree d, these kernels comparing subtrees of height h can be computed in O(mh), whereas the classic subtree kernel by Ramon & Gärtner scales as O(n 2 4 d h). Key to this efficiency is the observation that the Weisfeiler-Lehman test of isomorphism from graph theory elegantly computes a subtree kernel as a byproduct. Our fast subtree kernels can deal with labeled graphs, scale up easily to large graphs and outperform state-of-the-art graph kernels on several classification benchmark datasets in terms of accuracy and runtime. 1
Discriminative Clustering by Regularized Information Maximization
"... Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information- ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1
Graph Classification via Topological and Label Attributes
"... Graph classification is an important data mining task, and various graph kernel methods have been proposed recently for this task. These methods have proven to be effective, but they tend to have high computational overhead. In this paper, we propose an alternative approach to graph classification t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Graph classification is an important data mining task, and various graph kernel methods have been proposed recently for this task. These methods have proven to be effective, but they tend to have high computational overhead. In this paper, we propose an alternative approach to graph classification that is based on feature-vectors constructed from different global topological attributes, as well as global label features. The main idea here is that the graphs from the same class should have similar topological and label attributes. Our method is simple and easy to implement, and via a detailed comparison on real benchmark datasets, we show that our topological and label feature-based approach delivers better or competitive classification accuracy, and is also substantially faster than other graph kernels. It is the most effective method for large unlabeled graphs. 1.
GPM: A Graph Pattern Matching Kernel with Diffusion for Chemical Compound Classification
"... Abstract — Classifying chemical compounds is an active topic in drug design and other cheminformatics applications. Graphs are general tools for organizing information from heterogenous sources and have been applied in modelling many kinds of biological data. With the fast accumulation of chemical s ..."
Abstract
- Add to MetaCart
Abstract — Classifying chemical compounds is an active topic in drug design and other cheminformatics applications. Graphs are general tools for organizing information from heterogenous sources and have been applied in modelling many kinds of biological data. With the fast accumulation of chemical structure data, building highly accurate predictive models for chemical graphs emerges as a new challenge. In this paper, we demonstrate a novel technique called G raph P attern M atching kernel (GPM). Our idea is to leverage existing frequent pattern discovery methods and explore their application to kernel classifiers (e.g. support vector machine) for graph classification. In our method, we first identify all frequent patterns from a graph database. We then map subgraphs to graphs in the database and use a diffusion process to label nodes in the graphs. Finally the kernel is computed using a set matching algorithm. We performed experiments on 16 chemical structure data sets and have compared our methods to other major graph kernels. The experimental results demonstrate excellent performance of our method. I.
Capacity Control for Partially Ordered Feature Sets
"... Abstract. Partially ordered feature sets appear naturally in many classification settings with structured input instances, for example, when the data instances are graphs and a feature tests whether a specific substructure occurs in the instance. Since such features are partially ordered according t ..."
Abstract
- Add to MetaCart
Abstract. Partially ordered feature sets appear naturally in many classification settings with structured input instances, for example, when the data instances are graphs and a feature tests whether a specific substructure occurs in the instance. Since such features are partially ordered according to an “is substructure of ” relation, the information in those datasets is stored in an intrinsically redundant form. We investigate how this redundancy affects the capacity control behavior of linear classification methods. From a theoretical perspective, it can be shown that the capacity of this hypothesis class does not decrease for worst case distributions. However, if the data generating distribution assigns lower probabilities to instances in the lower levels of the hierarchy induced by the partial order, the capacity of the hypothesis class can be bounded by a smaller term. For itemset, subsequence and subtree features in particular, the capacity is finite even when an infinite number of features is present. We validate these results empirically on three graph datasets and show that the limited capacity of linear classifiers on such data makes underfitting rather than overfitting the more prominent capacity control problem. To avoid underfitting, we propose using more general substructure classes with “elastic edges ” and we demonstrate how such broad feature classes can be used with large datasets.
COMPARISON OF CHEMICAL DESCRIPTORS FOR PROTEIN–CHEMICAL INTERACTION PREDICTION
"... Predicting protein–chemical interaction has been an important and challenging task in the bioinformatics community, and there are many related applications in biomedical research, including QSAR modelling and novel lead discovery. A fundamental hypothesis for predicting protein–chemical interaction ..."
Abstract
- Add to MetaCart
Predicting protein–chemical interaction has been an important and challenging task in the bioinformatics community, and there are many related applications in biomedical research, including QSAR modelling and novel lead discovery. A fundamental hypothesis for predicting protein–chemical interaction is that chemical compounds sharing chemical similarity should also share protein target profiles, and the critical question is hence how to measure the distance (or similarity) between two chemicals. An increasing number of chemical descriptors have been invented in the past decades. As chemical descriptors play a critical role in predicting protein– chemical interaction, it is of great importance to compare chemical descriptors and evaluate their performance in such predictions. In this paper, we reported our case study on comparing the performance of DRAGON descriptors, the frequent subgraph-based descriptors (FFSM), and the signature molecular descriptor on predicting protein–chemical interaction using support vector machines over a large number of data sets. Our experiments demonstrated that FFSM and signature descriptors outperformed most DRAGON descriptor classes, and wisely selecting chemical descriptors will be beneficial for predicting protein–chemical interaction. Key Words Protein-chemical interaction, chemical descriptors, molecular graph, cross validation, support vector machines 1.

