Algorithmics and Applications of Tree and Graph Searching
 In Symposium on Principles of Database Systems
, 2002
"... Modern search engines answer keywordbased queries extremely efficiently. The impressive speed is due to clever inverted index structures, caching, a domainindependent knowledge of strings, and thousands of machines. Several research efforts have attempted to generalize keyword search to keytree an ..."
Cited by 146 (8 self)
Modern search engines answer keywordbased queries extremely efficiently. The impressive speed is due to clever inverted index structures, caching, a domainindependent knowledge of strings, and thousands of machines. Several research efforts have attempted to generalize keyword search to keytree and keygraph searching, because trees and graphs have many applications in nextgeneration database systems. This paper surveys both algorithms and applications, giving some emphasis to our own work.
An improved algorithm for matching large graphs
 In: 3rd IAPRTC15 Workshop on Graphbased Representations in Pattern Recognition, Cuen
, 2001
"... In this paper an improved version of a graph matching algorithm is presented, which is able to efficiently solve the graph isomorphism and graphsubgraph isomorphism problems on Attributed Relational Graphs. This version is particularly suited to work with very large graphs, since its memory require ..."
Cited by 94 (4 self)
In this paper an improved version of a graph matching algorithm is presented, which is able to efficiently solve the graph isomorphism and graphsubgraph isomorphism problems on Attributed Relational Graphs. This version is particularly suited to work with very large graphs, since its memory requirements are quite smaller than those of other algorithms of the same kind. After a detailed description of the algorithm, an experimental comparison is made against both the previous version (developed by the same authors) and the Ullmann’s algorithm. 1.
ClosureTree: An Index Structure for Graph Queries
, 2006
"... Graphs have become popular for modeling structured data. As a result, graph queries are becoming common and graph indexing has come to play an essential role in query processing. We introduce the concept of a graph closure, a generalized graph that represents a number of graphs. Our indexing techniq ..."
Cited by 92 (1 self)
Graphs have become popular for modeling structured data. As a result, graph queries are becoming common and graph indexing has come to play an essential role in query processing. We introduce the concept of a graph closure, a generalized graph that represents a number of graphs. Our indexing technique, called Closuretree, organizes graphs hierarchically where each node summarizes its descendants by a graph closure. Closuretree can efficiently support both subgraph queries and similarity queries. Subgraph queries find graphs that contain a specific subgraph, whereas similarity queries find graphs that are similar to a query graph. For subgraph queries, we propose a technique called pseudo subgraph isomorphism which approximates subgraph isomorphism with high accuracy. For similarity queries, we measure graph similarity through edit distance using heuristic graph mapping methods. We implement two kinds of similarity queries: KNN query and range query. Our experiments on chemical compounds and synthetic graphs show that for subgraph queries, Closuretree outperforms existing techniques by up to two orders of magnitude in terms of candidate answer set size and index size. For similarity queries, our experiments validate the quality and efficiency of the presented algorithms.
Fgindex: towards verificationfree query processing on graph databases
 in SIGMOD, 2007
"... Graphs are prevalently used to model the relationships between objects in various domains. With the increasing usage of graph databases, it has become more and more demanding to efficiently process graph queries. Querying graph databases is costly since it involves subgraph isomorphism testing, whic ..."
Cited by 77 (10 self)
Graphs are prevalently used to model the relationships between objects in various domains. With the increasing usage of graph databases, it has become more and more demanding to efficiently process graph queries. Querying graph databases is costly since it involves subgraph isomorphism testing, which is an NPcomplete problem. In recent years, some effective graph indexes have been proposed to first obtain a candidate answer set by filtering part of the false results and then perform verification on each candidate by checking subgraph isomorphism. Query performance is improved since the number of subgraph isomorphism tests is reduced. However, candidate verification is still inevitable, which can be expensive when the size of the candidate answer set is large. In this paper, we propose a novel indexing technique that constructs a nested invertedindex, called FGindex, based on the set of Frequent subGraphs (FGs). Given a graph query that is an FG in the database, FGindex returns the exact set of query answers without performing candidate verification. When the query is an infrequent graph, FGindex produces a candidate answer set which is close to the exact answer set. Since an infrequent graph means the graph occurs in only a small number of graphs in the database, the number of subgraph isomorphism tests is small. To ensure that the index fits into the main memory, we propose a new notion of δTolerance Closed Frequent Graphs (δTCFGs), which allows us to flexibly tune the size of the index in a parameterized way. Our extensive experiments verify that query processing using FGindex is orders of magnitude more efficient than using the stateoftheart graph index.
Anonymizing Social Networks
 VLDB 2008
, 2008
"... Advances in technology have made it possible to collect data about individuals and the connections between them, such as email correspondence and friendships. Agencies and researchers who have collected such social network data often have a compelling interest in allowing others to analyze the data. ..."
Cited by 73 (2 self)
Advances in technology have made it possible to collect data about individuals and the connections between them, such as email correspondence and friendships. Agencies and researchers who have collected such social network data often have a compelling interest in allowing others to analyze the data. However, in many cases the data describes relationships that are private (e.g., email correspondence) and sharing the data in full can result in unacceptable disclosures. In this paper, we present a framework for assessing the privacy risk of sharing anonymized network data. This includes a model of adversary knowledge, for which we consider several variants and make connections to known graph theoretical results. On several realworld social networks, we show that simple anonymization techniques are inadequate, resulting in substantial breaches of privacy for even modestly informed adversaries. We propose a novel anonymization technique based on perturbing the network and demonstrate empirically that it leads to substantial reduction of the privacy threat. We also analyze the effect that anonymizing the network has on the utility of the data for social network analysis.
GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis
 In the Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06
, 2006
"... Along with the blossom of open source projects comes the convenience for software plagiarism. A company, if less selfdisciplined, may be tempted to plagiarize some open source projects for its own products. Although current plagiarism detection tools appear sufficient for academic use, they are nev ..."
Cited by 71 (1 self)
Along with the blossom of open source projects comes the convenience for software plagiarism. A company, if less selfdisciplined, may be tempted to plagiarize some open source projects for its own products. Although current plagiarism detection tools appear sufficient for academic use, they are nevertheless short for fighting against serious plagiarists. For example, disguises like statement reordering and code insertion can effectively confuse these tools. In this paper, we develop a new plagiarism detection tool, called GPlag, which detects plagiarism by mining program dependence graphs (PDGs). A PDG is a graphic representation of the data and control dependencies within a procedure. Because PDGs are nearly invariant during plagiarism, GPlag is more effective than stateoftheart tools for plagiarism detection. In order to make GPlag scalable to large programs, a statistical lossy filter is proposed to prune the plagiarism search space. Experiment study shows that GPlag is both effective and efficient: It detects plagiarism that easily slips over existing tools, and it usually takes a few seconds to find (simulated) plagiarism in programs having thousands of lines of code.
Mobile Robot Localisation and Mapping in Extensive Outdoor Environments
, 2002
"... This thesis addresses the issues of scale for practical implementations of simultaneous localisation and mapping (SLAM) in extensive outdoor environments. Building an incremental map while also using it for localisation is of prime importance for mobile robot navigation but, until recently, has bee ..."
Cited by 70 (9 self)
This thesis addresses the issues of scale for practical implementations of simultaneous localisation and mapping (SLAM) in extensive outdoor environments. Building an incremental map while also using it for localisation is of prime importance for mobile robot navigation but, until recently, has been confined to smallscale, mostly indoor, environments. The critical problems for largescale implementations are as follows. First, data association finding correspondences between map landmarks and robot sensor measurementsbecomes difficult in complex, cluttered environments, especially if the robot location is uncertain. Second, the information required to maintain a consistent map using traditional methods imposes a prohibitive computational burden as the map increases in size. And third, the mathematics for SLAM relies on assumptions of small errors and nearlinearity, and these become invalid for larger maps.
Graphgrep: A fast and universal method for querying graphs. In:
 Proc. 16th Int. Conference on Pattern Recognition. Volume
, 2002
LargeScale Malware Indexing Using FunctionCall Graphs
"... A major challenge of the antivirus (AV) industry is how to effectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly determine if a new malware sample is similar to any previouslyseen malware program. In this paper, we design ..."
Cited by 55 (0 self)
A major challenge of the antivirus (AV) industry is how to effectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly determine if a new malware sample is similar to any previouslyseen malware program. In this paper, we design, implement and evaluate a malware database management system called SMIT (Symantec Malware Indexing Tree) that can efficiently make such determination based on malware’s functioncall graphs, which is a structural representation known to be less susceptible to instructionlevel obfuscations commonly employed by malware writers to evade detection of AV software. Because each malware program is represented as a graph, the problem of searching for the most similar malware program in a database to a given malware sample is cast into a nearestneighbor search problem in a graph database. To speed