Results 1  10
of
10
Triejoin: a triebased method for efficient string similarity joins
 THE VLDB JOURNAL
, 2012
"... A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient stringsimilarityjoin algorithm. In this paper, we study string similarity joins with editdistance constraints. Exis ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient stringsimilarityjoin algorithm. In this paper, we study string similarity joins with editdistance constraints. Existing methods usually employ a filterandrefine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called triejoin, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient triejoin algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform stateoftheart methods by an order of magnitude on the data sets with short strings.
The pqGram Distance between Ordered Labeled Trees
 ACM TRANSACTIONS ON DATABASE SYSTEMS (TODS)
, 2010
"... When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
(Show Context)
When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ. We propose pqgrams to approximately match hierarchical data from autonomous sources and define the pqgram distance between ordered labeled trees as an effective and efficient approximation of the fanout weighted tree edit distance. We prove that the pqgram distance is a lower bound of the fanout weighted tree edit distance and give a normalization of the pqgram distance for which the triangle inequality holds. Experiments on synthetic and real world data (residential addresses and XML) confirm the scalability of our approach and show the effectiveness of pqgrams.
Hashing TreeStructured Data: Methods and Applications
"... Abstract — In this article we propose a new hashing framework for treestructured data. Our method maps an unordered tree into a multiset of simple wedgeshaped structures refered to as pivots. By coupling our pivot multisets with the idea of minwise hashing, we realize a fixed sized signaturesketc ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Abstract — In this article we propose a new hashing framework for treestructured data. Our method maps an unordered tree into a multiset of simple wedgeshaped structures refered to as pivots. By coupling our pivot multisets with the idea of minwise hashing, we realize a fixed sized signaturesketch of the treestructured datum yielding an effective mechanism for hashing such data. We discuss several potential pivot structures and study some of the theoretical properties of such structures, and discuss their implications to tree edit distance and properties related to perfect hashing. We then empirically demonstrate the efficacy and efficiency of the overall approach on a range of realworld datasets and applications. I.
The power of two minhashes for similarity search among hierarchical data objects
"... ..."
(Show Context)
1 Efficient Topk Approximate Subtree Matching in Small Memory
"... Abstract—We consider the Topk Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree within a large document tree using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficul ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract—We consider the Topk Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree within a large document tree using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASMpostorder, a memoryefficient and scalable TASM algorithm. We prove an upper bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to efficiently prune subtrees that are above this size threshold. We develop an algorithm based on the prefix ring buffer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space complexity of TASMpostorder depends only on k and the query size, and the runtime of TASMpostorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our analytic results.
Efficient processing of containment queries on nested sets
 In EDBT
, 2013
"... ABSTRACT We study the problem of computing containment queries on sets which can have both atomic and setvalued objects as elements, i.e., nested sets. Containment is a fundamental query pattern with many basic applications. Our study of nested set containment is motivated by the ubiquity of neste ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
ABSTRACT We study the problem of computing containment queries on sets which can have both atomic and setvalued objects as elements, i.e., nested sets. Containment is a fundamental query pattern with many basic applications. Our study of nested set containment is motivated by the ubiquity of nested data in practice, e.g., in XML and JSON data management, in business and scientific workflow management, and in web analytics. Furthermore, there are to our knowledge no known efficient solutions to computing containment queries on massive collections of nested sets. Our specific contributions in this paper are: (1) we introduce two novel algorithms for efficient evaluation of containment queries on massive collections of nested sets; (2) we study caching and filtering mechanisms to accelerate query processing in the algorithms; (3) we develop extensions to the algorithms to a) compute several related query types and b) accommodate natural variations of the semantics of containment; and, (4) we present analytic and empirical analyses which demonstrate that both algorithms are efficient and scalable.
A Survey on Tree Edit Distance Lower Bound Estimation Techniques for Similarity Join on XML Data
"... When integrating treestructured data from autonomous and heterogeneous sources, exact joins often fail for the same object may be represented differently. Approximate join techniques are often used, in which similar trees are considered describing the same realworld object. A commonly accepted met ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
When integrating treestructured data from autonomous and heterogeneous sources, exact joins often fail for the same object may be represented differently. Approximate join techniques are often used, in which similar trees are considered describing the same realworld object. A commonly accepted metric to evaluate tree similarity is the tree edit distance. While yielding good results, this metric is computationally complex, thus has limited benefit for large databases. To make the join process efficient, many previous works take filtering and refinement mechanisms. They provide lower bounds for the tree edit distance in order to reduce unnecessary calculations. This work explores some widely accepted filtering and refinement based methods, and combines them to form multilevel filters. Experimental results indicate that stringbased lower bounds are tighter yet more computationally complex than setbased lower bounds, and multilevel filters provide the tightest lower bound efficiently. 1.
REGULAR PAPER Windowed pqgrams for approximate joins of datacentric XML
, 2012
"... Abstract In data integration applications, a join matches elements that are common to two data sources. Since elements are represented slightly different in each source, an approximate join must be used to do the matching. For XML data, most existing approximate join strategies are based on some or ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract In data integration applications, a join matches elements that are common to two data sources. Since elements are represented slightly different in each source, an approximate join must be used to do the matching. For XML data, most existing approximate join strategies are based on some ordered tree matching technique, such as the tree edit distance. In datacentric XML, however, the sibling order is irrelevant, and two elements should match even if their subelement order varies. Thus, approximate joins for datacentric XML must leverage unordered tree matching techniques. This is computationally hard since the algorithms cannot rely on a predefined sibling order. In this paper, we give a solution for approximate joins based on unordered tree matching. The core of our solution are windowed pqgrams, which are small subtrees of a specific shape. We develop an efficient technique to generate windowed pqgrams in a threestep process: sort the tree, extend the sorted tree with dummy nodes, and decompose the extended tree into windowed pqgrams. The windowed pqgram distance between two trees is the number of pqgrams that are in one tree decomposition only. We show that our distance is a pseudometric and empirically demonstrate that it effectively approximates the unordered tree edit distance. The approximate join
The VLDB Journal DOI 10.1007/s0077801102528 REGULAR PAPER Triejoin: a triebased method for efficient string similarity joins
"... Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient stringsimilarityjoin algorithm. In this paper, we study string similarity joins with editdistance constra ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient stringsimilarityjoin algorithm. In this paper, we study string similarity joins with editdistance constraints. Existing methods usually employ a filterandrefine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called triejoin, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient triejoin algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform stateoftheart methods by an order of magnitude on the data sets with short strings.
RWSDiff: Flexible and Efficient Change Detection in Hierarchical Data
"... The problem of generating a costminimal edit script between two trees has many important applications. However, finding such a costminimal script is computationally hard, thus the only methods that scale are approximate ones. Various approximate solutions have been proposed recently. However, mo ..."
Abstract
 Add to MetaCart
(Show Context)
The problem of generating a costminimal edit script between two trees has many important applications. However, finding such a costminimal script is computationally hard, thus the only methods that scale are approximate ones. Various approximate solutions have been proposed recently. However, most of them still show quadratic or worse runtime complexity in the tree size and thus do not scale well either. The only solutions with loglinear runtime complexity use simple matching algorithms that only find corresponding subtrees as long as these subtrees are equal. Consequently, such solutions are not robust at all, since small changes in the leaves which occur frequently can make all subtrees that contain the changed leaves unequal and thus prevent the matching of large portions of the trees. This problem could be avoided by searching for similar instead of equal subtrees but current similarity approaches are too costly and thus also show quadratic complexity. Hence, currently no robust loglinear method exists. We propose the random walks similarity (RWS) measure which can be used to find similar subtrees rapidly. We use this measure to build the RWSDiff algorithm that is able to compute an approximately costminimal edit script in loglinear time while having the robustness of a similaritybased approach. Our evaluation reveals that random walk similarity indeed increases edit script quality and robustness drastically while still maintaining a runtime comparable to simple matching approaches.