Results 1 - 10
of
26
Comparing Stars: On Approximating Graph Edit Distance
, 2009
"... Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer visio ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
(Show Context)
Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer vision etc. Unfortunately, the problem of graph edit distance computation is NP-Hard in general. Accordingly, in this paper we introduce three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time. Applying these methods, two algorithms AppFull and AppSub are introduced to perform different kinds of graph search on graph databases. Comprehensive experimental studies are conducted on both real and synthetic datasets to examine various aspects of the methods for bounding graph edit distance. Result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs. The effectiveness of these algorithms also confirms the usefulness of using our bounds in filtering and searching of graphs.
The pq-Gram Distance between Ordered Labeled Trees
- ACM TRANSACTIONS ON DATABASE SYSTEMS (TODS)
, 2010
"... When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
(Show Context)
When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ. We propose pq-grams to approximately match hierarchical data from autonomous sources and define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the fanout weighted tree edit distance. We prove that the pq-gram distance is a lower bound of the fanout weighted tree edit distance and give a normalization of the pq-gram distance for which the triangle inequality holds. Experiments on synthetic and real world data (residential addresses and XML) confirm the scalability of our approach and show the effectiveness of pq-grams.
Approximate joins for data-centric XML
- In Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Society
, 2008
"... Abstract- In data integration applications, a join matches elements that are common to two data sources. Often, however, elements are represented slightly different in each source, so an approximate join must be used. For XML data, most approximate join strategies are based on some ordered tree matc ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
(Show Context)
Abstract- In data integration applications, a join matches elements that are common to two data sources. Often, however, elements are represented slightly different in each source, so an approximate join must be used. For XML data, most approximate join strategies are based on some ordered tree matching technique. But in data-centric XML the order is irrelevant: two elements should match even if their subelement order varies. In this paper we give a solution for the approximate join of unordered trees. Our solution is based on windowed pq-grams. We develop an efficient technique to systematically generate windowed pq-grams in a three-step process: sorting the unordered tree, extending the sorted tree with dummy nodes, and computing the windowed pq-grams on the extended tree. The windowed pq-gram distance between two sorted trees approximates the tree edit distance between the respective unordered trees. The approximate join algorithm based on windowed pq-grams is implemented as an equality join on strings which avoids the costly computation of the distance between every pair of input trees. Our experiments with synthetic and real world data confirm the analytic results and suggest that our technique is both useful and scalable. I.
The Paths More Taken: Matching DOM Trees to Search Logs for Accurate Webpage Clustering
"... An unsupervised clustering of the webpages on a website is a primary requirement for most wrapper induction and automated data extraction methods. Since page content can vary drastically across pages of one cluster (e.g., all product pages on amazon.com), traditional clustering methods typically use ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
An unsupervised clustering of the webpages on a website is a primary requirement for most wrapper induction and automated data extraction methods. Since page content can vary drastically across pages of one cluster (e.g., all product pages on amazon.com), traditional clustering methods typically use some distance function between the DOM trees representing a pair of webpages. However, without knowing which portions of the DOM tree are “important, ” such distance functions might discriminate between similar pages based on trivial features (e.g., differing number of reviews on two product pages), or club together distinct types of pages based on superficial features present in the DOM trees of both (e.g., matching footer/copyright), leading to poor clustering performance.
Hashing Tree-Structured Data: Methods and Applications
"... Abstract — In this article we propose a new hashing framework for tree-structured data. Our method maps an unordered tree into a multiset of simple wedge-shaped structures refered to as pivots. By coupling our pivot multisets with the idea of minwise hashing, we realize a fixed sized signature-sketc ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Abstract — In this article we propose a new hashing framework for tree-structured data. Our method maps an unordered tree into a multiset of simple wedge-shaped structures refered to as pivots. By coupling our pivot multisets with the idea of minwise hashing, we realize a fixed sized signature-sketch of the treestructured datum yielding an effective mechanism for hashing such data. We discuss several potential pivot structures and study some of the theoretical properties of such structures, and discuss their implications to tree edit distance and properties related to perfect hashing. We then empirically demonstrate the efficacy and efficiency of the overall approach on a range of real-world datasets and applications. I.
What is the IQ of your Data Transformation System
- In CIKM
, 2012
"... What is the IQ of your data transformation system? ..."
(Show Context)
An incrementally maintainable index for approximate lookups in hierarchical data
- In Proceedings of the 32nd international conference on Very large data bases
, 2006
"... Several recent papers argue for approximate lookups in hierarchical data and propose index structures that support approximate searches in large sets of hierarchical data. These index structures must be updated if the underlying data changes. Since the performance of a full index reconstruction is p ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Several recent papers argue for approximate lookups in hierarchical data and propose index structures that support approximate searches in large sets of hierarchical data. These index structures must be updated if the underlying data changes. Since the performance of a full index reconstruction is prohibitive, the index must be updated incrementally. We propose a persistent and incrementally maintainable index for approximate lookups in hierarchical data. The index is based on small tree patterns, called pq-grams. It supports efficient updates in response to structure and value changes in hierarchical data and is based on the log of tree edit operations. We prove the correctness of the incremental maintenance for sequences of edit operations. Our algorithms identify a small set of pq-grams that must be updated to maintain the index. The experimental results with synthetic and real data confirm the scalability of our approach. 1.
Approximate structural matching over ordered XML documents
- in IDEAS, 2007
"... There is an increasing need for an XML query engine that not only searches for exact matches to a query but also returns “query-like ” structures. We have designed and developed XFinder, an efficient top K tree pattern query evaluation system, which reduces the problem of approximate tree structural ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
There is an increasing need for an XML query engine that not only searches for exact matches to a query but also returns “query-like ” structures. We have designed and developed XFinder, an efficient top K tree pattern query evaluation system, which reduces the problem of approximate tree structural matching to a simpler problem of subsequence matching. However, since not all subsequences correspond to valid tree structures, it is expensive to enumerate common subsequences between XML data and query and then filter the invalid ones. XFinder addresses this challenge by detecting and pruning structurally irrelevant subsequence matches as early as possible. Experiments show the efficiency of XFinder on various data and query sets. 1
Efficient Similarity Search for Tree-Structured Data
, 2008
"... Tree-structured data are becoming ubiquitous nowadays and manipulating them based on similarity is essential for many applications. Although similarity search on textual data has been extensively studied, searching for similar trees is still an open problem due to the high complexity of computing t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Tree-structured data are becoming ubiquitous nowadays and manipulating them based on similarity is essential for many applications. Although similarity search on textual data has been extensively studied, searching for similar trees is still an open problem due to the high complexity of computing the similarity between trees, especially for large numbers of tress. In this paper, we propose to transform tree-structured data into strings with a one-to-one mapping. We prove that the edit distance of the corresponding strings forms a bound for the similarity measures between trees, including tree edit distance, largest common subtrees and smallest common super-trees. Based on the theoretical analysis, we can employ any existing algorithm of approximate string search for effective similarity search on trees. Moreover, we embed the bound into a filter-and-refine framework for facilitating similarity search on tree-structured data. The experimental results show that our algorithm achieves high performance and outperforms state-of-the-art methods significantly. Our method is especially suitable for accelerating similarity query processing on large numbers of trees in massive datasets.
XML: Some Papers in a Haystack
"... XML has been explored by both research and industry communities. More than 5500 papers were published on different aspects of XML. With so many publications, it is hard for someone to decide where to start. Hence, this paper presents some of the research topics on XML, namely: XML on relational data ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
XML has been explored by both research and industry communities. More than 5500 papers were published on different aspects of XML. With so many publications, it is hard for someone to decide where to start. Hence, this paper presents some of the research topics on XML, namely: XML on relational databases, query processing, views, data matching, and schema evolution. It then summarizes some (some!) of the most relevant or traditional papers on those subjects. 1.