Results 1 - 10
of
53
Graphs-at-a-time: Query Language and Access Methods for Graph Databases
, 2008
"... With the prevalence of graph data in a variety of domains, there is an increasing need for a language to query and manipulate graphs with heterogeneous attributes and structures. We propose a query language for graph databases that supports arbitrary attributes on nodes, edges, and graphs. In this l ..."
Abstract
-
Cited by 70 (0 self)
- Add to MetaCart
(Show Context)
With the prevalence of graph data in a variety of domains, there is an increasing need for a language to query and manipulate graphs with heterogeneous attributes and structures. We propose a query language for graph databases that supports arbitrary attributes on nodes, edges, and graphs. In this language, graphs are the basic unit of information and each query manipulates one or more collections of graphs. To allow for flexible compositions of graph structures, we extend the notion of formal languages from strings to the graph domain. We present a graph algebra extended from the relational algebra in which the selection operator is generalized to graph pattern matching and a composition operator is introduced for rewriting matched graphs. Then, we investigate access methods of the selection operator. Pattern matching over large graphs is challenging due to the NP-completeness of subgraph isomorphism. We address this by a combination of techniques: use of neighborhood subgraphs and profiles, joint reduction of the search space, and optimization of the search order. Experimental results on real and synthetic large graphs demonstrate that our graph specific optimizations outperform an SQL-based implementation by orders of magnitude.
Large-Scale Malware Indexing Using Function-Call Graphs
"... A major challenge of the anti-virus (AV) industry is how to effectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly determine if a new malware sample is similar to any previously-seen malware program. In this paper, we design ..."
Abstract
-
Cited by 55 (0 self)
- Add to MetaCart
(Show Context)
A major challenge of the anti-virus (AV) industry is how to effectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly determine if a new malware sample is similar to any previously-seen malware program. In this paper, we design, implement and evaluate a malware database management system called SMIT (Symantec Malware Indexing Tree) that can efficiently make such determination based on malware’s function-call graphs, which is a structural representation known to be less susceptible to instruction-level obfuscations commonly employed by malware writers to evade detection of AV software. Because each malware program is represented as a graph, the problem of searching for the most similar malware program in a database to a given malware sample is cast into a nearest-neighbor search problem in a graph database. To speed
Taming Verification Hardness: An Efficient Algorithm for Testing Subgraph Isomorphism
"... Graphs are widely used to model complicated data semantics in many applications. In this paper, we aim to develop efficient techniques to retrieve graphs, containing a given query graph, from a large set of graphs. Considering the problem of testing subgraph isomorphism is generally NP-hard, most of ..."
Abstract
-
Cited by 50 (9 self)
- Add to MetaCart
(Show Context)
Graphs are widely used to model complicated data semantics in many applications. In this paper, we aim to develop efficient techniques to retrieve graphs, containing a given query graph, from a large set of graphs. Considering the problem of testing subgraph isomorphism is generally NP-hard, most of the existing techniques are based on the framework of filtering-and-verification to reduce the precise computation costs; consequently various novel feature-based indexes have been developed. While the existing techniques work well for small query graphs, the verification phase becomes a bottleneck when the query graph size increases. Motivated by this, in the paper we firstly propose a novel and efficient algorithm for testing subgraph isomorphism, QuickSI. Secondly, we develop a new feature-based index technique to accommodate QuickSI in the filtering phase. Our extensive experiments on real and synthetic data demonstrate the efficiency and scalability of the proposed techniques, which significantly improve the existing techniques. 1.
A novel spectral coding in a large graph database
- In Proceedings of the International Conference on Extending Database Technology
, 2008
"... Retrieving related graphs containing a query graph from a large graph database is a key issue in many graph-based applications, such as drug discovery and structural pattern recognition. Because sub-graph isomorphism is a NP-complete problem [4], we have to employ a filter-and-verification framework ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
(Show Context)
Retrieving related graphs containing a query graph from a large graph database is a key issue in many graph-based applications, such as drug discovery and structural pattern recognition. Because sub-graph isomorphism is a NP-complete problem [4], we have to employ a filter-and-verification framework to speed up the search efficiency, that is, using an effective and efficient pruning strategy to filter out the false positives (graphs that are not possible in the results) as many as possible first, then validating the remaining candidates by subgraph isomorphism checking. In this paper, we propose a novel filtering method, a spectral encoding method, i.e. GCoding. Specifically, we assign a signature to each vertex based on its local structures. Then, we generate a spectral graph code by combining all vertex signatures in a graph. Based on spectral graph codes, we derive a necessary condition for sub-graph isomorphism. Then we propose two pruning rules for sub-graph search problem, and prove that they satisfy the no-false-negative requirement (no dismissal in answers). Since graph codes are in numerical space, we take this advantage and conduct efficient filtering over graph codes. Extensive experiments show that GCoding outperforms existing counterpart methods. 1.
On Graph Query Optimization in Large Networks
"... The dramatic proliferation of sophisticated networks has resulted in a growing need for supporting effective querying and mining methods over such large-scale graph-structured data. At the core of many advanced network operations lies a common and critical graph query primitive: how to search graph ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
The dramatic proliferation of sophisticated networks has resulted in a growing need for supporting effective querying and mining methods over such large-scale graph-structured data. At the core of many advanced network operations lies a common and critical graph query primitive: how to search graph structures efficiently within a large network? Unfortunately, the graph query is hard due to the NP-complete nature of subgraph isomorphism. It becomes even challenging when the network examined is large and diverse. In this paper, we present a high performance graph indexing mechanism, SPath, to address the graph query problem on large networks. SPath leverages decomposed shortest paths around vertex neighborhood as basic indexing units, which prove to be both effective in graph search space pruning and highly scalable in index construction and deployment. Via SPath, a graph query is processed and optimized beyond the traditional vertex-at-a-time fashion to a more efficient path-at-a-time way: the query is first decomposed to a set of shortest paths, among which a subset of candidates with good selectivity is picked by a query plan optimizer; Candidate paths are further joined together to help recover the query graph to finalize the graph query processing. We evaluate SPath with the state-of-the-art GraphQL on both real and synthetic data sets. Our experimental studies demonstrate the effectiveness and scalability of SPath, which proves to be a more practical and efficient indexing method in addressing graph queries on large networks. 1.
Comparing Stars: On Approximating Graph Edit Distance
, 2009
"... Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer visio ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
(Show Context)
Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer vision etc. Unfortunately, the problem of graph edit distance computation is NP-Hard in general. Accordingly, in this paper we introduce three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time. Applying these methods, two algorithms AppFull and AppSub are introduced to perform different kinds of graph search on graph databases. Comprehensive experimental studies are conducted on both real and synthetic datasets to examine various aspects of the methods for bounding graph edit distance. Result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs. The effectiveness of these algorithms also confirms the usefulness of using our bounds in filtering and searching of graphs.
Neighborhood based fast graph search in large networks
- in SIGMOD
, 2011
"... Complex social and information network search becomes impor-tant with a variety of applications. In the core of these applications, lies a common and critical problem: Given a labeled network and a query graph, how to efficiently search the query graph in the tar-get network. The presence of noise a ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
(Show Context)
Complex social and information network search becomes impor-tant with a variety of applications. In the core of these applications, lies a common and critical problem: Given a labeled network and a query graph, how to efficiently search the query graph in the tar-get network. The presence of noise and the incomplete knowledge about the structure and content of the target network make it unre-alistic to find an exact match. Rather, it is more appealing to find the top-k approximate matches. In this paper, we propose a neighborhood-based similarity mea-sure that could avoid costly graph isomorphism and edit distance computation. Under this new measure, we prove that subgraph sim-ilarity search is NP hard, while graph similarity match is polyno-mial. By studying the principles behind this measure, we found an information propagation model that is able to convert a large net-
GADDI: Distance index based subgraph matching in biological networks
- In Proceedings of the 12th international conference on extending database technology (EDBT’09
, 2009
"... Currently, a huge amount of biological data can be naturally rep-resented by graphs, e.g., protein interaction networks, gene reg-ulatory networks, etc. The need for indexing large graphs is an urgent research problem of great practical importance. The main challenge is size. Each graph may contain ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
(Show Context)
Currently, a huge amount of biological data can be naturally rep-resented by graphs, e.g., protein interaction networks, gene reg-ulatory networks, etc. The need for indexing large graphs is an urgent research problem of great practical importance. The main challenge is size. Each graph may contain thousands (or more) ver-tices. Most of the previous work focuses on indexing a set of small or medium sized database graphs (with only tens of vertices) and finding whether a query graph occurs in any of these. In this paper, we are interested in finding all the matches of a query graph in a given large graph of thousands of vertices, which is a very impor-tant task in many biological applications. This increases the com-plexity significantly. We propose a novel distance measurement which reintroduces the idea of frequent substructures in a single large graph. We devise the novel structure distance based approach (GADDI) to efficiently find matches of the query graph. GADDI is further optimized by the use of a dynamic matching scheme to minimize redundant calculations. Last but not least, a number of real and synthetic data sets are used to evaluate the efficiency and scalability of our proposed method. 1.
GBASE: A Scalable and General Graph Management System
"... Graphs appear in numerous applications including cyber-security, the Internet, social networks, protein networks, recommendation systems, and many more. Graphs with millions or even billions of nodes and edges are common-place. How to store such large graphs efficiently? What are the core operations ..."
Abstract
-
Cited by 24 (5 self)
- Add to MetaCart
(Show Context)
Graphs appear in numerous applications including cyber-security, the Internet, social networks, protein networks, recommendation systems, and many more. Graphs with millions or even billions of nodes and edges are common-place. How to store such large graphs efficiently? What are the core operations/queries on those graph? How to answer the graph queries quickly? We propose GBASE, a scalable and general graph management and mining system. The key novelties lie in 1) our storage and compression scheme for a parallel setting and 2) the carefully chosen graph operations and their efficient implementation. We designed and implemented an instance of GBASE using MAPREDUCE/HADOOP. GBASE provides a parallel indexing mechanism for graph mining operations that both saves storage space, as well as accelerates queries. We ran numerous experiments on real graphs, spanning billions of nodes and edges, and we show that our proposed GBASE is indeed fast, scalable and nimble, with significant savings in space and time.
Towards proximity pattern mining in large graphs
- In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. SIGMOD ’10. ACM
, 2010
"... Mining graph patterns in large networks is critical to a vari-ety of applications such as malware detection and biological module discovery. However, frequent subgraphs are often ineffective to capture association existing in these applica-tions, due to the complexity of isomorphism testing and the ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
(Show Context)
Mining graph patterns in large networks is critical to a vari-ety of applications such as malware detection and biological module discovery. However, frequent subgraphs are often ineffective to capture association existing in these applica-tions, due to the complexity of isomorphism testing and the inelastic pattern definition. In this paper, we introduce proximity pattern which is a significant departure from the traditional concept of fre-quent subgraphs. Defined as a set of labels that co-occur in neighborhoods, proximity pattern blurs the boundary be-tween itemset and structure. It relaxes the rigid structure constraint of frequent subgraphs, while introducing connec-tivity to frequent itemsets. Therefore, it can benefit from both: efficient mining in itemsets and structure proximity from graphs. We developed two models to define proximity patterns. The second one, called Normalized Probabilistic Association (NmPA), is able to transform a complex graph mining problem to a simplified probabilistic itemset min-ing problem, which can be solved efficiently by a modified FP-tree algorithm, called pFP. NmPA and pFP are evalu-ated on real-life social and intrusion networks. Empirical results show that it not only finds interesting patterns that are ignored by the existing approaches, but also achieves high performance for finding proximity patterns in large-scale graphs.