Results 1 - 10
of
38
Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds
- In Proceedings of ICDM’03
, 2003
"... In this paper we study the problem of classifying chemical compound datasets. We present a sub-structure-based classification algorithm that decouples the sub-structure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topologi ..."
Abstract
-
Cited by 140 (6 self)
- Add to MetaCart
In this paper we study the problem of classifying chemical compound datasets. We present a sub-structure-based classification algorithm that decouples the sub-structure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric sub-structures present in the dataset. The advantage of our approach is that during classification model construction, all relevant sub-structures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Our experimental evaluation on eight different classification problems shows that our approach is computationally scalable and outperforms existing schemes by 10% to 35%, on the average.
Graph mining: laws, generators, and algorithms
- ACM COMPUT SURV (CSUR
, 2006
"... How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M: N relation in ..."
Abstract
-
Cited by 132 (7 self)
- Add to MetaCart
How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M: N relation in database terminology can be represented as a graph. A lot of these questions boil down to the following: “How can we generate synthetic but realistic graphs? ” To answer this, we must first understand what patterns are common in real-world graphs and can thus be considered a mark of normality/realism. This survey give an overview of the incredible variety of work that has been done on these problems. One of our main contributions is the integration of points of view from physics, mathematics, sociology, and computer science. Further, we briefly describe recent advances on some related and interesting graph problems.
Eigenspace-based Anomaly Detection in Computer Systems
- Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD
, 2004
"... We report on an automated runtime anomaly detection method at the application layer of multi-node computer systems. Although several network management systems are available in the market, none of them have sufficient capabilities to detect faults in multi-tier Web-based systems with redundancy. We ..."
Abstract
-
Cited by 51 (4 self)
- Add to MetaCart
(Show Context)
We report on an automated runtime anomaly detection method at the application layer of multi-node computer systems. Although several network management systems are available in the market, none of them have sufficient capabilities to detect faults in multi-tier Web-based systems with redundancy. We model a Web-based system as a weighted graph, where each node represents a “service ” and each edge represents a dependency between services. Since the edge weights vary greatly over time, the problem we address is that of anomaly detection from a time sequence of graphs. In our method, we first extract a feature vector from the adjacency matrix that represents the activities of all of the services. The heart of our method is to use the principal eigenvector of the eigenclusters of the graph. Then we derive a probability distribution for an anomaly measure defined for a time-series of directional data derived from the graph sequence. Given a critical probability, the threshold value is adaptively updated using a novel online algorithm. We demonstrate that a fault in a Web application can be automatically detected and the faulty services are identified without using detailed knowledge of the behavior of the system.
Indexing and Mining Free Trees
- Proceedings of the 2003 IEEE International Conference on Data Mining (ICDM’03
, 2003
"... Tree structures are used extensively in domains such as computational biology, pattern recognition, computer networks, and so on. In this paper, we present an indexing technique for free trees and apply this indexing technique to the problem of mining frequent subtrees. We first define a novel re ..."
Abstract
-
Cited by 49 (7 self)
- Add to MetaCart
Tree structures are used extensively in domains such as computational biology, pattern recognition, computer networks, and so on. In this paper, we present an indexing technique for free trees and apply this indexing technique to the problem of mining frequent subtrees. We first define a novel representation, the canonical form, for rooted trees and extend the definition to free trees. We also introduce another concept, the canonical string, as a simpler representation for free trees in their canonical forms. We then apply our tree indexing technique to the frequent subtree mining problem and present FreeTreeMiner, a computationally e#cient algorithm that discovers all frequently occurring subtrees in a database of free trees. Our mining algorithm is a variation of the traditional a priori method for mining frequent itemsets. We study the performance and the scalability of our algorithms through extensive experiments based on both synthetic data and datasets from two real applications: a dataset of chemical compounds and a dataset of Internet multicast trees. The experiments show that our algorithm scales linearly in the cardinality of the database.
A Survey of Frequent Subgraph Mining Algorithms
- THE KNOWLEDGE ENGINEERING REVIEW
, 2004
"... Graph mining is an important research area within the domain of data mining. The field of study concentrates on the identification of frequent subgraphs within graph data sets. The research goals are directed at: (i) effective mechanisms for generating candidate subgraphs (without generating duplica ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
Graph mining is an important research area within the domain of data mining. The field of study concentrates on the identification of frequent subgraphs within graph data sets. The research goals are directed at: (i) effective mechanisms for generating candidate subgraphs (without generating duplicates) and (ii) how best to process the generated candidate subgraphs so as to identify the desired frequent subgraphs in a way that is computationally efficient and procedurally effective. This paper presents a survey of current research in the field of frequent subgraph mining, and proposed solutions to address the main research issues.
MARGIN: Maximal Frequent Subgraph Mining
"... The exponential number of possible subgraphs makes the problem of frequent subgraph mining a challenge. Maximal frequent mining has triggered much interest since the size of the set of maximal frequent subgraphs is much smaller to that of the set of frequent subgraphs. We propose an algorithm that m ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
(Show Context)
The exponential number of possible subgraphs makes the problem of frequent subgraph mining a challenge. Maximal frequent mining has triggered much interest since the size of the set of maximal frequent subgraphs is much smaller to that of the set of frequent subgraphs. We propose an algorithm that mines the maximal frequent subgraphs while pruning the lattice space considerably. This reduces the number of isomorphism computations which is the kernel of all frequent subgraph mining problems. Experimental results validate the utility of the technique proposed. 1.
Subdue: compressionbased frequent pattern discovery in graph data
- Proceedings of the 1st international workshop on open
, 2005
"... A majority of the existing algorithms which mine graph datasets target complete, frequent sub-graph discovery. We describe the graph-based data mining system Subdue which focuses on the discovery of sub-graphs which are not only fre-quent but also compress the graph dataset, using a heuristic algori ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
(Show Context)
A majority of the existing algorithms which mine graph datasets target complete, frequent sub-graph discovery. We describe the graph-based data mining system Subdue which focuses on the discovery of sub-graphs which are not only fre-quent but also compress the graph dataset, using a heuristic algorithm. The rationale behind the use of a compression-based methodology for frequent pattern discovery is to pro-duce a fewer number of highly interesting patterns than to generate a large number of patterns from which interesting patterns need to be identied. We perform an experimental comparison of Subdue with the graph mining systems gSpan and FSG on the Chemical Toxicity and the Chemical Com-pounds datasets that are provided with gSpan. We present results on the performance on the Subdue system on the Mu-tagenesis and the KDD 2003 Citation Graph dataset. An analysis of the results indicates that Subdue can eciently discover best-compressing frequent patterns which are fewer in number but can be of higher interest. 1.
Comparison of graphbased and logic-based multi-relational data mining
- ACM SIGKDD Explorations Newsletter
"... The goal of this paper is to generate insights about the dif-ferences between graph-based and logic-based approaches to multi-relational data mining by performing a case study of graph-based system, Subdue and the inductive logic pro-gramming system, CProgol. We identify three key factors for compar ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
(Show Context)
The goal of this paper is to generate insights about the dif-ferences between graph-based and logic-based approaches to multi-relational data mining by performing a case study of graph-based system, Subdue and the inductive logic pro-gramming system, CProgol. We identify three key factors for comparing graph-based and logic-based multi-relational data mining; namely, the ability to discover structurally large concepts, the ability to discover semantically compli-cated concepts and the ability to eectively utilize back-ground knowledge. We perform an experimental comparison of Subdue and CProgol on the Mutagenesis domain and var-ious articially generated Bongard problems. Experimental results indicate that Subdue can signicantly outperform CProgol while discovering structurally large multi-relational concepts. It is also observed that CProgol is better at learn-ing semantically complicated concepts and it tends to use background knowledge more eectively than Subdue. 1.
Parallel Algorithms for Mining Frequent Structural Motifs in Scientific Data
- In ACM International Conference on Supercomputing (ICS) 2004
, 2004
"... Discovery of important substructures from molecules is an important data mining problem. The basic motivation is that the structure of a molecule has a role to play in its biochemical function. There is interest in finding important, often recurrent, substructures both within a single molecule and a ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
(Show Context)
Discovery of important substructures from molecules is an important data mining problem. The basic motivation is that the structure of a molecule has a role to play in its biochemical function. There is interest in finding important, often recurrent, substructures both within a single molecule and across a class of molecules. Recently, we have developed a general purpose suite of algorithms – the MotifMiner Toolkit – that can mine for structural motifs in a wide area of biomolecular datasets. While the algorithms have proven to be extremely useful in their ability to identify novel substructures, the algorithms themselves are quite time consuming. There are two reasons for this: i) inherently the algorithm suffers from the curse of subgraph isomorphism; and ii) handling noise effects (e.g. protein structure data) results in a significant slowdown. To address this problem in this paper we propose parallelization strategies in a cluster environment for the above algorithms. We identify key optimizations that handle load imbalance, scheduling, and communication overheads. Results show that the optimizations are quite effective and that we are able to obtain good speedup on moderate sized clusters. 1.
Mining Fragments with Fuzzy Chains in Molecular Databases
- University of Pisa
, 2004
"... Abstract. This paper discusses methods to discover frequent, discriminative connected subgraphs (fragments) in a database of molecular structures. We present an extension to a well-known algorithm that allows for the discovery of fragments that contain chains of atoms of varying length. This is part ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Abstract. This paper discusses methods to discover frequent, discriminative connected subgraphs (fragments) in a database of molecular structures. We present an extension to a well-known algorithm that allows for the discovery of fragments that contain chains of atoms of varying length. This is particularly important for real-world applications (for example drug discovery or synthetic success prediction) where the exact length of chains connecting two or more otherwise rigid substructures is not critical for the biological or chemical activity of the overall substructure. We demonstrate how the proposed extension successfully discovers fragments with several polymethylene bridges.