Results 1  10
of
130
PEGASUS: A PetaScale Graph Mining System Implementation and Observations
 IEEE INTERNATIONAL CONFERENCE ON DATA MINING
, 2009
"... Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga, Tera or P ..."
Abstract

Cited by 128 (26 self)
 Add to MetaCart
(Show Context)
Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga, Tera or Petabytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrixvector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIMV (Generalized Iterated MatrixVector multiplication). GIMV is highly optimized, achieving (a) good scaleup on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the nonoptimized version of GIMV. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with ≈ 6,7 billion edges. KeywordsPEGASUS; graph mining; hadoop I.
Spin: Mining maximal frequent subgraphs from graph databases
 IN KDD
, 2004
"... One fundamental challenge for mining recurring subgraphs from semistructured data sets is the overwhelming abundance of such patterns. In large graph databases, the total number of frequent subgraphs can become too large to allow a full enumeration using reasonable computational resources. In this ..."
Abstract

Cited by 99 (12 self)
 Add to MetaCart
(Show Context)
One fundamental challenge for mining recurring subgraphs from semistructured data sets is the overwhelming abundance of such patterns. In large graph databases, the total number of frequent subgraphs can become too large to allow a full enumeration using reasonable computational resources. In this paper, we propose a new algorithm that mines only maximal frequent subgraphs, i.e. subgraphs that are not a part of any other frequent subgraphs. This may exponentially decrease the size of the output set in the best case; in our experiments on practical data sets, mining maximal frequent subgraphs reduces the total number of mined patterns by two to three orders of magnitude. Our method first mines all frequent trees from a general graph database and then reconstructs all maximal subgraphs from the mined trees. Using two chemical structure benchmarks and a set of synthetic graph data sets, we demonstrate that, in addition to decreasing the output size, our algorithm can achieve a fivefold speed up over the current stateoftheart subgraph mining algorithms.
Fast besteffort pattern matching in large attributed graphs
 In KDD
, 2007
"... We focus on large graphs where nodes have attributes, such as a social network where the nodes are labelled with each person’s job title. In such a setting, we want to find subgraphs that match a user query pattern. For example, a ‘star ’ query would be, “find a CEO who has strong interactions with ..."
Abstract

Cited by 53 (14 self)
 Add to MetaCart
(Show Context)
We focus on large graphs where nodes have attributes, such as a social network where the nodes are labelled with each person’s job title. In such a setting, we want to find subgraphs that match a user query pattern. For example, a ‘star ’ query would be, “find a CEO who has strong interactions with a Manager, a Lawyer, and an Accountant, or another structure as close to that as possible”. Similarly, a ‘loop ’ query could help spot a money laundering ring. Traditional SQLbased methods, as well as more recent graph indexing methods, will return no answer when an exact match does not exist. Our method can find exact, as well as nearmatches, and it will present them to the user in our proposed ‘goodness ’ order. For example, our method tolerates indirect paths between, say, the ‘CEO ’ and the ‘Accountant ’ of the above sample query, when direct paths do not exist. Its second feature is scalability. In general, if the query has nq nodes and the data graph has n nodes, the problem needs polynomial time complexity O(n nq), which is prohibitive. Our GRay (“Graph XRay”) method finds highquality subgraphs in time linear on the size of the data graph. Experimental results on the DLBP authorpublication graph (with 356K nodes and 1.9M edges) illustrate both the effectiveness and scalability of our approach. The results agree with our intuition, and the speed is excellent. It takes 4 seconds on average for a 4node query on the DBLP graph.
KAutomorphism: A general framework for privacy preserving network publication
 In VLDB
, 2009
"... The growing popularity of social networks has generated interesting data management and data mining problems. An important concern in the release of these data for study is their privacy, since social networks usually contain personal information. Simply removing all identifiable personal informatio ..."
Abstract

Cited by 50 (1 self)
 Add to MetaCart
(Show Context)
The growing popularity of social networks has generated interesting data management and data mining problems. An important concern in the release of these data for study is their privacy, since social networks usually contain personal information. Simply removing all identifiable personal information (such as names and social security number) before releasing the data is insufficient. It is easy for an attacker to identify the target by performing different structural queries. In this paper we propose kautomorphism to protect against multiple structural attacks and develop an algorithm (called KM) that ensures kautomorphism. We also discuss an extension of KM to handle “dynamic ” releases of the data. Extensive experiments show that the algorithm performs well in terms of protection it provides. 1.
What is Frequent in a Single Graph
 University of Florence, Italy
"... Pattern mining has been studied in different types of data, starting from itemsets up to highly structured data such as relational data or hypergraphs. Usually the setting is such that a multiset of these structures is given and the aim is to find patterns that can be mapped onto at least a minimum ..."
Abstract

Cited by 42 (0 self)
 Add to MetaCart
(Show Context)
Pattern mining has been studied in different types of data, starting from itemsets up to highly structured data such as relational data or hypergraphs. Usually the setting is such that a multiset of these structures is given and the aim is to find patterns that can be mapped onto at least a minimum number of
Discovering Frequent Geometric Subgraphs
 In IEEE Intl. Conference on Data Mining ’02
, 2002
"... As data mining techniques are being increasingly applied to nontraditional domains, existing approaches for finding frequent itemsets cannot be used as they cannot model the requirement of these domains. An alternate way of modeling the objects in these data sets, is to use a graph to model the ..."
Abstract

Cited by 38 (1 self)
 Add to MetaCart
(Show Context)
As data mining techniques are being increasingly applied to nontraditional domains, existing approaches for finding frequent itemsets cannot be used as they cannot model the requirement of these domains. An alternate way of modeling the objects in these data sets, is to use a graph to model the database objects. Within that model, the problem of finding frequent patterns becomes that of discovering subgraphs that occur frequently over the entire set of graphs. In this paper we present a computationally e#cient algorithm for finding frequent geometric subgraphs in a large collection of geometric graphs. Our algorithm is able to discover geometric subgraphs that can be rotation, scaling and translation invariant, and it can accommodate inherent errors on the coordinates of the vertices. We evaluated the performance of the algorithm using a large database of over 20,000 real two dimensional chemical structures, and our experimental results show that our algorithms requires relatively little time, can accommodate low support values, and scales linearly on the number of transactions.
A.: Mining graph evolution rules
 In: ECML/PKDD
, 2009
"... Abstract. In this paper we introduce graphevolution rules, a novel type of frequencybased pattern that describe the evolution of large networks over time, at a local level. Given a sequence of snapshots of an evolving graph, we aim at discovering rules describing the local changes occurring in it. ..."
Abstract

Cited by 37 (4 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper we introduce graphevolution rules, a novel type of frequencybased pattern that describe the evolution of large networks over time, at a local level. Given a sequence of snapshots of an evolving graph, we aim at discovering rules describing the local changes occurring in it. Adopting a definition of support based on minimum image we study the problem of extracting patterns whose frequency is larger than a minimum support threshold. Then, similar to the classical association rules framework, we derive graphevolution rules from frequent patterns that satisfy a given minimum confidence constraint. We discuss merits and limits of alternative definitions of support and confidence, justifying the chosen framework. To evaluate our approach we devise GERM (Graph Evolution Rule Miner), an algorithm to mine all graphevolution rules whose support and confidence are greater than given thresholds. The algorithm is applied to analyze four large realworld networks (i.e., two social networks, and two coauthorship networks from bibliographic data), using different time granularities. Our extensive experimentation confirms the feasibility and utility of the presented approach. It further shows that different kinds of networks exhibit different evolution rules, suggesting the usage of these local patterns to globally discriminate different kind of networks. 1
Pattern mining in frequent dynamic subgraphs
 IN ICDM
, 2006
"... Graphstructured data is becoming increasingly abundant in many application domains. Graph mining aims at finding interesting patterns within this data that represent novel knowledge. While current data mining deals with static graphs that do not change over time, coming years will see the advent of ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
(Show Context)
Graphstructured data is becoming increasingly abundant in many application domains. Graph mining aims at finding interesting patterns within this data that represent novel knowledge. While current data mining deals with static graphs that do not change over time, coming years will see the advent of an increasing number of time series of graphs. In this article, we investigate how pattern mining on static graphs can be extended to time series of graphs. In particular, we are considering dynamic graphs with edge insertions and edge deletions over time. We define frequency in this setting and provide algorithmic solutions for finding frequent dynamic subgraph patterns. Existing subgraph mining algorithms can be easily integrated into our framework to make them handle dynamic graphs. Experimental results on realworld data confirm the practical feasibility of our approach.
Complete and accurate clone detection in graphbased models
 in 31st Int. Conf. on Softw. Eng., 2009
"... ModelDriven Engineering (MDE) has become an important development framework for many largescale software. Previous research has reported that as in traditional codebased development, cloning also occurs in MDE. However, there has been little work on clone detection in models with the limitations ..."
Abstract

Cited by 32 (1 self)
 Add to MetaCart
(Show Context)
ModelDriven Engineering (MDE) has become an important development framework for many largescale software. Previous research has reported that as in traditional codebased development, cloning also occurs in MDE. However, there has been little work on clone detection in models with the limitations on detection precision and completeness. This paper presents ModelCD, a novel clone detection tool for Matlab/Simulink models, that is able to efficiently and accurately detect both exactly matched and approximate model clones. The core of ModelCD is two novel graphbased clone detection algorithms that are able to systematically and incrementally discover clones with a high degree of completeness, accuracy, and scalability. We have conducted an empirical evaluation with various experimental studies on many realworld systems to demonstrate the usefulness of our approach and to compare the performance of ModelCD with existing tools. 1