Results 1  10
of
21
RDF3X: a riscstyle engine for RDF
 Proc. VLDB Endowment
, 2008
"... ABSTRACT RDF is a data representation format for schemafree structured information that is gaining momentum in the context of SemanticWeb corpora, life sciences, and also Web 2.0 platforms. The "payasyougo" nature of RDF and the flexible patternmatching capabilities of its query lan ..."
Abstract

Cited by 149 (11 self)
 Add to MetaCart
(Show Context)
ABSTRACT RDF is a data representation format for schemafree structured information that is gaining momentum in the context of SemanticWeb corpora, life sciences, and also Web 2.0 platforms. The "payasyougo" nature of RDF and the flexible patternmatching capabilities of its query language SPARQL entail efficiency and scalability challenges for complex queries including long join paths. This paper presents the RDF3X engine, an implementation of SPARQL that achieves excellent performance by pursuing a RISCstyle architecture with a streamlined architecture and carefully designed, puristic data structures and operations. The salient points of RDF3X are: 1) a generic solution for storing and indexing RDF triples that completely eliminates the need for physicaldesign tuning, 2) a powerful yet simple query processor that leverages fast merge joins to the largest possible extent, and 3) a query optimizer for choosing optimal join orders using a cost model based on statistical synopses for entire join paths. The performance of RDF3X, in comparison to the previously best stateoftheart systems, has been measured on several largescale datasets with more than 50 million RDF triples and benchmark queries that include pattern matching and long join paths in the underlying data graphs.
PEGASUS: A PetaScale Graph Mining System Implementation and Observations
 IEEE INTERNATIONAL CONFERENCE ON DATA MINING
, 2009
"... Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga, Tera or P ..."
Abstract

Cited by 128 (26 self)
 Add to MetaCart
(Show Context)
Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga, Tera or Petabytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrixvector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIMV (Generalized Iterated MatrixVector multiplication). GIMV is highly optimized, achieving (a) good scaleup on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the nonoptimized version of GIMV. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with ≈ 6,7 billion edges. KeywordsPEGASUS; graph mining; hadoop I.
A.: Mining graph evolution rules
 In: ECML/PKDD
, 2009
"... Abstract. In this paper we introduce graphevolution rules, a novel type of frequencybased pattern that describe the evolution of large networks over time, at a local level. Given a sequence of snapshots of an evolving graph, we aim at discovering rules describing the local changes occurring in it. ..."
Abstract

Cited by 37 (4 self)
 Add to MetaCart
Abstract. In this paper we introduce graphevolution rules, a novel type of frequencybased pattern that describe the evolution of large networks over time, at a local level. Given a sequence of snapshots of an evolving graph, we aim at discovering rules describing the local changes occurring in it. Adopting a definition of support based on minimum image we study the problem of extracting patterns whose frequency is larger than a minimum support threshold. Then, similar to the classical association rules framework, we derive graphevolution rules from frequent patterns that satisfy a given minimum confidence constraint. We discuss merits and limits of alternative definitions of support and confidence, justifying the chosen framework. To evaluate our approach we devise GERM (Graph Evolution Rule Miner), an algorithm to mine all graphevolution rules whose support and confidence are greater than given thresholds. The algorithm is applied to analyze four large realworld networks (i.e., two social networks, and two coauthorship networks from bibliographic data), using different time granularities. Our extensive experimentation confirms the feasibility and utility of the presented approach. It further shows that different kinds of networks exhibit different evolution rules, suggesting the usage of these local patterns to globally discriminate different kind of networks. 1
Patterns on the Connected Components of TerabyteScale Graphs
"... Abstract—How do connected components evolve? What are the regularities that govern the dynamic growth process and the static snapshot of the connected components? In this work, we study patterns in connected components of large, realworld graphs. First, we study one of the largest static Web graphs ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
(Show Context)
Abstract—How do connected components evolve? What are the regularities that govern the dynamic growth process and the static snapshot of the connected components? In this work, we study patterns in connected components of large, realworld graphs. First, we study one of the largest static Web graphs with billions of nodes and edges and analyze the regularities among the connected components using GFD(Graph Fractal Dimension) as our main tool. Second, we study several time evolving graphs and find dynamic patterns and rules that govern the dynamics of connected components. We analyze the growth rates of top connected components and study their relation over time. We also study the probability that a newcomer absorbs to disconnected components as a function of the current portion of the disconnected components and the degree of the newcomer. Finally, we propose a generative model that explains both the dynamic growth process and the static regularities of connected components.
Graph Classification Based on Pattern Cooccurrence
"... Subgraph patterns are widely used in graph classification, but their effectiveness is often hampered by large number of patterns or lack of discrimination power among individual patterns. We introduce a novel classification method based on pattern cooccurrence to derive graph classification rules. O ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
Subgraph patterns are widely used in graph classification, but their effectiveness is often hampered by large number of patterns or lack of discrimination power among individual patterns. We introduce a novel classification method based on pattern cooccurrence to derive graph classification rules. Our method employs a pattern exploration order such that the complementary discriminative patterns are examined first. Patterns are grouped into cooccurrence rules during the pattern exploration, leading to an integrated process of pattern mining and classifier learning. By taking advantage of cooccurrence information, our method can generate strong features by assembling weak features. Unlike previous methods that invoke the pattern mining process repeatedly, our method only performs pattern mining once. In addition, our method produces a more interpretable classifier and shows better or competitive classification effectiveness in terms of accuracy and execution time.
GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph
"... Mining frequent subgraphs is an important operation on graphs; it is defined as finding all subgraphs that appear frequently in a database according to a given frequency threshold. Most existing work assumes a database of many small graphs, but modern applications, such as social networks, citation ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Mining frequent subgraphs is an important operation on graphs; it is defined as finding all subgraphs that appear frequently in a database according to a given frequency threshold. Most existing work assumes a database of many small graphs, but modern applications, such as social networks, citation graphs, or proteinprotein interactions in bioinformatics, are modeled as a single large graph. In this paper we present GRAMI, a novel framework for frequent subgraph mining in a single large graph. GRAMI undertakes a novel approach that only finds the minimal set of instances to satisfy the frequency threshold and avoids the costly enumeration of all instances required by previous approaches. We accompany our approach with a heuristic and optimizations that significantly improve performance. Additionally, we present an extension of GRAMI that mines frequent patterns. Compared to subgraphs, patterns offer a more powerful version of matching that captures transitive interactions between graph nodes (like friend of a friend) which are very common in modern applications. Finally, we present CGRAMI, a version supporting structural and semantic constraints, and AGRAMI, an approximate version producing results with no false positives. Our experiments on real data demonstrate that our framework is up to 2 orders of magnitude faster and discovers more interesting patterns than existing approaches. 1.
The RDF3X engine for scalable . . .
, 2009
"... RDF is a data model for schemafree structured information that is gaining momentum in the context of SemanticWeb data, life sciences, and also Web 2.0 platforms. The “payasyougo” nature of RDF and the flexible patternmatching capabilities of its query language SPARQL entail efficiency and scal ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
RDF is a data model for schemafree structured information that is gaining momentum in the context of SemanticWeb data, life sciences, and also Web 2.0 platforms. The “payasyougo” nature of RDF and the flexible patternmatching capabilities of its query language SPARQL entail efficiency and scalability challenges for complex queries including long join paths. This paper presents the RDF3X engine, an implementation of SPARQL that achieves excellent performance by pursuing a RISCstyle architecture with streamlined indexing and query processing. The physical design is identical for all RDF3X databases regardless of their workloads, and completely eliminates the need for index tuning by exhaustive indexes for all permutations of subjectpropertyobject triples and their binary and unary projections. These indexes are highly compressed, and the query processor can aggressively leverage fast merge joins with excellent performance of processor caches. The query optimizer is able to choose
On the Usefulness of WeightBased Constraints in Frequent Subgraph Mining
"... Frequent subgraph mining is an important datamining technique. In this paper we look at weighted graphs, which are ubiquitous in the real world. The analysis of weights in combination with mining for substructures might yield more precise results. In particular, we study frequent subgraph mining in ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Frequent subgraph mining is an important datamining technique. In this paper we look at weighted graphs, which are ubiquitous in the real world. The analysis of weights in combination with mining for substructures might yield more precise results. In particular, we study frequent subgraph mining in the presence of weightbased constraints and explain how to integrate them into mining algorithms. While such constraints only yield approximate mining results in most cases, we demonstrate that such results are useful nevertheless and explain this effect. To do so, we both assess the completeness of the approximate result sets, and we carry out applicationoriented studies with realworld dataanalysis problems: softwaredefect localization, weighted graph classification and explorative mining in logistics. Our results are that the runtime can improve by a factor of up to 3.5 in defect localization and classification and 7 in explorative mining. At the same time, we obtain an even slightly increased defectlocalization precision, stable classification precision and obtain good explorative mining results.
MOSubdue: A Pareto Dominancebased Multiobjective Subdue Algorithm For Frequent Subgraph Mining
 KNOWLEDGE AND INFORMATION SYSTEMS
"... Graphbased data mining approaches have been mainly proposed to the task popularly known as frequent subgraph mining subject to a single user preference, like frequency, size, etc. In this work, we propose to deal with the frequent subgraph mining problem from multiobjective optimization viewpoint ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Graphbased data mining approaches have been mainly proposed to the task popularly known as frequent subgraph mining subject to a single user preference, like frequency, size, etc. In this work, we propose to deal with the frequent subgraph mining problem from multiobjective optimization viewpoint, where a subgraph (or solution) is defined by several userdefined preferences (or objectives), which are conflicting in nature. For example, mined subgraphs with high frequency are often of small size, and viceversa. Use of such objectives in the multiobjective subgraph mining process generates Paretooptimal subgraphs, where no subgraph is better than another subgraph in all objectives. We have applied a Paretodominance approach for evaluation and search subgraphs regarding to both proximity and diversity in multiobjective sense, which has incorporated in the framework of Subdue algorithm for subgraph mining. The method is called MultiObjective subgraph mining by Subdue (MOSubdue), and has several advantages: i) generation of Paretooptimal subgraphs in a single run, ii) selection of subgraphseeds from the candidate subgraphs based on all objectives, iii) search in the multiobjective subgraphs lattice space, and iv) capability to deal with different multiobjective frequent subgraph mining tasks by customizing the tackled objectives. The good performance of MOSubdue is shown by performing multiobjective subgraph mining defined by two and three objectives on two reallife datasets.