Results 11  20
of
105
Maximal Biclique Subgraphs and Closed Pattern Pairs of the Adjacency Matrix: A Onetoone Correspondence and Mining Algorithms
, 2007
"... Maximal biclique (also known as complete bipartite) subgraphs can model many applications in web mining, business, and bioinformatics. Enumerating maximal biclique subgraphs from a graph is a computationally challenging problem, as the size of the output can become exponentially large with respect ..."
Abstract

Cited by 24 (8 self)
 Add to MetaCart
Maximal biclique (also known as complete bipartite) subgraphs can model many applications in web mining, business, and bioinformatics. Enumerating maximal biclique subgraphs from a graph is a computationally challenging problem, as the size of the output can become exponentially large with respect to the vertex number when the graph grows. In this paper, we efficiently enumerate them through the use of closed patterns of the adjacency matrix of the graph. For an undirected graph G without selfloops, we prove that: (i) the number of closed patterns in the adjacency matrix of G is even; (ii) the number of the closed patterns is precisely double the number of maximal biclique subgraphs of G; and (iii) for every maximal biclique subgraph, there always exists a unique pair of closed patterns that matches the two vertex sets of the subgraph. Therefore, the problem of enumerating maximal bicliques can be solved by using efficient algorithms for mining closed patterns, which are algorithms extensively studied in the data mining field. However, this direct use of existing algorithms causes a duplicated enumeration. To achieve high efficiency, we propose an O(mn) time delay algorithm for a nonduplicated enumeration, in particular for enumerating those maximal bicliques with a large size, where m and n are the number of edges and vertices of the graph respectively. We evaluate the high efficiency of our algorithm by comparing it to stateoftheart algorithms on three categories of graphs: randomly generated graphs, benchmarks, and a reallife protein interaction network. In this paper, we also prove that if selfloops are allowed in a graph, then the number of closed patterns in the adjacency matrix is not necessarily even; but the maximal bicliques are exactly the same as those of the graph after removing all the selfloops.
MARGIN: Maximal Frequent Subgraph Mining
"... The exponential number of possible subgraphs makes the problem of frequent subgraph mining a challenge. Maximal frequent mining has triggered much interest since the size of the set of maximal frequent subgraphs is much smaller to that of the set of frequent subgraphs. We propose an algorithm that m ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
(Show Context)
The exponential number of possible subgraphs makes the problem of frequent subgraph mining a challenge. Maximal frequent mining has triggered much interest since the size of the set of maximal frequent subgraphs is much smaller to that of the set of frequent subgraphs. We propose an algorithm that mines the maximal frequent subgraphs while pruning the lattice space considerably. This reduces the number of isomorphism computations which is the kernel of all frequent subgraph mining problems. Experimental results validate the utility of the technique proposed. 1.
Detecting conserved interaction patterns in biological networks
 Journal of Computational Biology
, 2006
"... Molecular interaction data plays an important role in understanding biological processes at a modular level by providing a framework for understanding cellular organization, functional hierarchy, and evolutionary conservation. As the quality and quantity of network and interaction data increases rap ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
Molecular interaction data plays an important role in understanding biological processes at a modular level by providing a framework for understanding cellular organization, functional hierarchy, and evolutionary conservation. As the quality and quantity of network and interaction data increases rapidly, the problem of effectively analyzing this data becomes significant. Graph theoretic formalisms, commonly used for these analysis tasks, often lead to computationally hard problems due to their relation to subgraph isomorphism. This paper presents an innovative new algorithm, MULE, for detecting frequently occurring patterns and modules in biological networks. Using an innovative graph simplification technique based on ortholog contraction, which is ideally suited to biological networks, our algorithm renders these problems computationally tractable and scalable to large numbers of networks. We show, experimentally, that our algorithm can extract frequently occurring patterns in metabolic pathways and protein interaction networks from the KEGG, DIP, and BIND databases within seconds. When compared to existing approaches, our graph simplification technique can be viewed either as a pruning heuristic, or a closely related, but computationally simpler task. When used as a pruning heuristic, we show that our technique reduces effective graph sizes significantly, accelerating existing techniques by several orders of magnitude! Indeed, for most of the test cases, existing techniques could not even be applied without our pruning step. When used as a standalone analysis technique, MULE is shown to convey significant biological insights at nearinteractive rates. The software, sample input graphs, and detailed results for comprehensive analysis of nine eukaryotic PPI networks are available at www.cs.purdue.edu/homes/koyuturk/mule. Key words: graph mining, frequent subgraph discovery, evolution, modular conservation. 1.
Canonical Forms for Frequent Graph Mining
 Proc. 30th Ann. Conf. German Classification Society (GfKl 2006
, 2006
"... Summary. A core problem of approaches to frequent graph mining, which are based on growing subgraphs into a set of graphs, is how to avoid redundant search. A powerful technique for this is a canonical description of a graph, which uniquely identifies it, and a corresponding test. I introduce a fami ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
Summary. A core problem of approaches to frequent graph mining, which are based on growing subgraphs into a set of graphs, is how to avoid redundant search. A powerful technique for this is a canonical description of a graph, which uniquely identifies it, and a corresponding test. I introduce a family of canonical forms based on systematic ways to construct spanning trees. I show that the canonical form used in gSpan [14] is a member of this family, and that MoSS/MoFa [1, 3] is implicitly based on a different member, which I make explicit and exploit in the same way. 1
Novel Approaches for Analyzing Biological Networks
 Journal of Combinatorial Optimization
, 2005
"... This paper proposes clique relaxations to identify clusters in biological networks. In particular, the maximum nclique and maximum nclub problems on an arbitrary graph are introduced and their recognition versions are shown to be NP complete. In addition, integer programming formulations are ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
(Show Context)
This paper proposes clique relaxations to identify clusters in biological networks. In particular, the maximum nclique and maximum nclub problems on an arbitrary graph are introduced and their recognition versions are shown to be NP complete. In addition, integer programming formulations are proposed and the results of sample numerical experiments performed on biological networks are reported.
A novel approach for efficient supergraph query processing on graph databases
 In EDBT
"... In recent years, large amount of data modeled by graphs, namely graph data, have been collected in various domains. Efficiently processing queries on graph databases has attracted a lot of research attentions. Supergraph query is a kind of new and important queries in practice. A supergraph query, q ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
In recent years, large amount of data modeled by graphs, namely graph data, have been collected in various domains. Efficiently processing queries on graph databases has attracted a lot of research attentions. Supergraph query is a kind of new and important queries in practice. A supergraph query, q, on a graph database D is to retrieve all graphs in D such that q is a supergraph of them. Because the number of graphs in databases is large and subgraph isomorphism testing is NPcomplete, efficiently processing such queries is a big challenge. This paper first proposes an optimal compact method for organizing graph databases. Common subgraphs of the graphs in a database are stored only once in the compact organization of the database, in order to reduce the overall cost of subgraph isomorphism testings from stored graphs to queries during query processing. Then, an exact algorithm and an approximate algorithm for generating significant feature set with optimal order are proposed to construct indices on graph databases. The optimal order on the feature set is to reduce the number of subgraph isomorphism testings during query processing. Based on the compact organization of graph databases, a novel algorithm of testing subgraph isomorphisms from multiple graphs to one graph is presented. Finally, based on all these techniques, a query processing method is proposed. Analytical and experimental results show that the proposed algorithms outperform the existing similar algorithms by one to two orders of magnitude. 1.
Discovering Correlated SpatioTemporal Changes in Evolving Graphs
 UNDER CONSIDERATION FOR PUBLICATION IN KNOWLEDGE AND INFORMATION SYSTEMS
, 2007
"... Graphs provide powerful abstractions of relational data, and are widely used in fields such as network management, web page analysis and sociology. While many graph representations of data describe dynamic and time evolving relationships, most graph mining work treats graphs as static entities. Our ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
Graphs provide powerful abstractions of relational data, and are widely used in fields such as network management, web page analysis and sociology. While many graph representations of data describe dynamic and time evolving relationships, most graph mining work treats graphs as static entities. Our focus in this paper is to discover regions of a graph that are evolving in a similar manner. To discover regions of correlated spatiotemporal change in graphs, we propose an algorithm called cSTAG. Whereas most clustering techniques are designed to find clusters that optimise a single distance measure, cSTAG addresses the problem of finding clusters that optimise both temporal and spatial distance measures simultaneously. We show the effectiveness of cSTAG using a quantitative analysis of accuracy on synthetic data sets, as well as demonstrating its utility on two large, reallife data sets, where one is the routing topology of the Internet, and the other is the dynamic graph of files accessed together on the 1998 World Cup official website.
Clustering Document Images Using a Bag of Symbols Representation
 In Proceedings 8th International Conference on Document Analysis and Recognition ISBN ISSN
, 2005
"... Document image classification is an important step in document image analysis. Based on classification results we can tackle other tasks such as indexation, understanding or navigation in document collections. Using a document representation and an unsupervised classification method, we may group do ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
(Show Context)
Document image classification is an important step in document image analysis. Based on classification results we can tackle other tasks such as indexation, understanding or navigation in document collections. Using a document representation and an unsupervised classification method, we may group documents that from the user point of view constitute valid clusters. The semantic gap between a domain independent document representation and the user implicit representation can lead to unsatisfactory results.
Distributed mining of molecular fragments
 Proc. of IEEE DMGrid, Workshop on Data Mining and Grid of IEEE ICDM
, 2004
"... In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, highdimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this contex ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, highdimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the wellknown National Cancer Institute’s HIVscreening dataset. We present experimental results on a smallscale computing environment. 1.
Advanced Pruning Strategies to Speed Up Mining Closed Molecular Fragments
 Proc. IEEE Conf. on Systems, Man and Cybernetics (SMC 2004, The Hague
"... Abstract – In recent years several algorithms for mining frequent subgraphs in graph databases have been proposed, with a major application area being the discovery of frequent substructures of biomolecules. Unfortunately, most of these algorithms still struggle with fairly long execution times if l ..."
Abstract

Cited by 9 (7 self)
 Add to MetaCart
Abstract – In recent years several algorithms for mining frequent subgraphs in graph databases have been proposed, with a major application area being the discovery of frequent substructures of biomolecules. Unfortunately, most of these algorithms still struggle with fairly long execution times if larger substructures or molecular fragments are desired. In this paper we describe two advanced pruning strategies — equivalent sibling pruning and perfect extension pruning — that can be used to speed up the MoFa algorithm (introduced in [2]) in the search for closed molecular fragments, as we demonstrate with experiments on the NCI’s HIV database.