Results 1  10
of
12
Efficient exact edit similarity query processing with the asymmetric signature scheme
 In SIGMOD Conference
, 2011
"... Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the numbe ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
(Show Context)
Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far greater than the lower bound, and this results in high query time and index space complexities. In this paper, we show that the minimum signature size lower bound is τ+1. We then propose asymmetric signature schemes that achieve this lower bound. We develop efficient query processing algorithms based on the new scheme. Several dynamic programmingbased candidate pruning methods are also developed to further speed up the performance. We have conducted a comprehensive experimental study involving nine stateoftheart algorithms. The experiment results clearly demonstrate the efficiency of our methods.
Indexing the Earth Mover’s Distance Using Normal Distributions
"... Querying uncertain data sets (represented as probability distributions) presents many challenges due to the large amount of data involved and the difficulties comparing uncertainty between distributions. The Earth Mover’s Distance (EMD) has increasingly been employed to compare uncertain data due to ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Querying uncertain data sets (represented as probability distributions) presents many challenges due to the large amount of data involved and the difficulties comparing uncertainty between distributions. The Earth Mover’s Distance (EMD) has increasingly been employed to compare uncertain data due to its ability to effectively capture the differences between two distributions. Computing the EMD entails finding a solution to the transportation problem, which is computationally intensive. In this paper, we propose a new lower bound to the EMD and an index structure to significantly improve the performance of EMD based K– nearest neighbor (K–NN) queries on uncertain databases. We propose a new lower bound to the EMD that approximates the EMD on a projection vector. Each distribution is projected onto a vector and approximated by a normal distribution, as well as an accompanying error term. We then represent each normal as a point in a Hough transformed space. We then use the concept of stochastic dominance to implement an efficient index structure in the transformed space. We show that our method significantly decreases K– NN query time on uncertain databases. The index structure also scales well with database cardinality. It is well suited for heterogeneous data sets, helping to keep EMD based queries tractable as uncertain data sets become larger and more complex. 1.
MELODYJOIN: Efficient Earth Mover’s Distance Similarity Joins Using MapReduce
"... Abstract—The Earth Mover’s Distance (EMD) similarity join retrieves pairs of records with EMD below a given threshold. It has a number of important applications such as near duplicate image retrieval and pattern analysis in probabilistic datasets. However, the computational cost of EMD is super cubi ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Abstract—The Earth Mover’s Distance (EMD) similarity join retrieves pairs of records with EMD below a given threshold. It has a number of important applications such as near duplicate image retrieval and pattern analysis in probabilistic datasets. However, the computational cost of EMD is super cubic to the number of bins in the histograms used to represent the data objects. Consequently, the EMD similarity join operation is prohibitive for large datasets. This is the first paper that specifically addresses the EMD similarity join and we propose to use MapReduce to approach this problem. The MapReduce algorithms designed for generic metric distance similarity joins are inefficient for the EMD similarity join because they involve a large number of distance computations and have unbalanced workloads on reducers when dealing with skewed datasets. We propose a novel framework, named MELODYJOIN, which transforms data into the space of EMD lower bounds and performs pruning and partitioning at a low cost because computing these EMD lower bounds has a constant complexity. Furthermore, we address two key problems, the limited pruning power and the unbalanced workloads, by enhancing each phase in the MELODYJOIN framework. We conduct extensive experiments on real datasets. The results show that MELODYJOIN outperforms the stateoftheart technique by an order of magnitude, scales up better on large datasets than the stateoftheart technique, and scales out well on distributed machines. I.
Earth Mover’s Distance based Similarity Search at Scale
"... Earth Mover’s Distance (EMD), as a similarity measure, has received a lot of attention in the fields of multimedia and probabilistic databases, computer vision, image retrieval, machine learning, etc. EMD on multidimensional histograms provides better distinguishability between the objects approxima ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Earth Mover’s Distance (EMD), as a similarity measure, has received a lot of attention in the fields of multimedia and probabilistic databases, computer vision, image retrieval, machine learning, etc. EMD on multidimensional histograms provides better distinguishability between the objects approximated by the histograms (e.g., images), compared to classic measures like Euclidean distance. Despite its usefulness, EMD has a high computational cost; therefore, a number of effective filtering methods have been proposed, to reduce the pairs of histograms for which the exact EMD has to be computed, during similarity search. Still, EMD calculations in the refinement step remain the bottleneck of the whole similarity search process. In this paper, we focus on optimizing the refinement phase of EMDbased similarity search by (i) adapting an efficient mincost flow algorithm (SIA) for EMD computation, (ii) proposing a dynamic distance bound, which can be used to terminate an EMD refinement early, and (iii) proposing a dynamic refinement order for the candidates which, paired with a concurrent EMD refinement strategy, reduces the amount of needless computations. Our proposed techniques are orthogonal to and can be easily integrated with the stateoftheart filtering techniques, reducing the cost of EMDbased similarity queries by orders of magnitude. 1.
A Asymmetric Signature Schemes for Efficient Exact Edit Similarity Query Processing
"... Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ. Most existing methods answering edit similarity queries employ schemes to generate string subsequences as signatures and generate candidates by set overla ..."
Abstract
 Add to MetaCart
Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ. Most existing methods answering edit similarity queries employ schemes to generate string subsequences as signatures and generate candidates by set overlap queries on query and data signatures. In this paper, we show that for any such signature scheme, the lower bound of the minimum number of signatures is τ + 1, which is lower than what are achieved by existing methods. We then propose several asymmetric signature schemes, i.e., extracting different numbers of signatures for the data and query strings, which achieve this lower bound. A basic asymmetric scheme is first established on the basis of matching qchunks and qgrams between two strings. Two efficient query processing algorithms (IndexGram and IndexChunk) are developed on top of this scheme. We also propose novel candidate pruning methods to further improve the efficiency. We then generalize the basic scheme by incorporating novel ideas of floating qchunks, optimal selection of qchunks, and reducing the number of signatures using global ordering. As a result, the Super and Turbo families of schemes are developed together with their corresponding query processing algorithms. We have conducted a comprehensive experimental study using the six asymmetric algorithms and nine previous stateoftheart algorithms. The experiment results clearly showcase the efficiency of our methods and demonstrate space and time characteristics of our proposed algorithms.
Similarity Query Processing for Probabilistic Sets
, 2013
"... Evaluating similarity between sets is a fundamental task in computer science. However, there are many applications in which elements in a set may be uncertain due to various reasons. Existing work on modeling such probabilistic sets and computing their similarities suffers from huge model sizes or ..."
Abstract
 Add to MetaCart
(Show Context)
Evaluating similarity between sets is a fundamental task in computer science. However, there are many applications in which elements in a set may be uncertain due to various reasons. Existing work on modeling such probabilistic sets and computing their similarities suffers from huge model sizes or significant similarity evaluation cost, and hence is only applicable to small probabilistic sets. In this paper, we propose a simple yet expressive model that supports many applications where one probabilistic set may have thousands of elements. We define two types of similarities between two probabilistic sets using the possible world semantics; they complement each other in capturing the similarity distributions in the cross product of possible worlds. We design efficient dynamic programmingbased algorithms to calculate both types of similarities. Novel individual and batch pruning techniques based on upper bounding the similarity values are also proposed. To accommodate extremely large probabilistic sets, we also design samplingbased approximate query processing methods with strong probabilistic guarantees. We have conducted extensive experiments using both synthetic and real datasets, and demonstrated the effectiveness and efficiency of our proposed methods.
Noname manuscript No. (will be inserted by the editor) Topk Queries on Temporal Data
"... Abstract The database community has devoted extensive amount of efforts to indexing and querying temporal data in the past decades. However, insufficient amount of attention has been paid to temporal ranking queries. More precisely, given any time instance t, the query asks for the topk objects at ..."
Abstract
 Add to MetaCart
Abstract The database community has devoted extensive amount of efforts to indexing and querying temporal data in the past decades. However, insufficient amount of attention has been paid to temporal ranking queries. More precisely, given any time instance t, the query asks for the topk objects at time t with respect to some score attribute. Some generic indexing structures based on Rtrees do support ranking queries on temporal data, but as they are not tailored for such queries, the performance is far from satisfactory. We present the Sebtree, a simple indexing scheme that supports temporal ranking queries much more efficiently. The Sebtree answers a topk query for any time instance t in the optimal number of I/Os in expectation, namely, N k O(logB B B) I/Os, where N is the size of the data set and B is the disk block size. The index has nearlinear size (for constant and reasonable kmax values, where kmax is the maximum value for the possible values of the query parameter k), can be constructed in nearlinear time, and also supports insertions and deletions without affecting its query performance guarantee. Most of all, the Sebtree is especially appealing in practice due to its simplicity as it uses the Btree as the only building block. Extensive experiments on a number of large data sets, show that the Sebtree is more than an order of magnitude faster than the Rtree based indexes for temporal ranking queries.
Efficient Similarity Join Based on Earth Mover’s Distance Using MapReduce
"... Abstract—Earth Mover’s Distance (EMD) evaluates the similarity between probability distributions, known as a robust measure more consistent with human similarity perception than traditional similarity functions. EMDbased similarity join retrieves pairs of probability distributions with EMD below a ..."
Abstract
 Add to MetaCart
Abstract—Earth Mover’s Distance (EMD) evaluates the similarity between probability distributions, known as a robust measure more consistent with human similarity perception than traditional similarity functions. EMDbased similarity join retrieves pairs of probability distributions with EMD below a specified threshold, supporting many important applications, such as duplicate image retrieval and sensor pattern recognition. This paper studies the possibility of using MapReduce to improve the scalability of EMD similarity join. While existing MapReduce optimization techniques mainly aim to minimize the communication overhead, such methods are not applicable to our problem, due to the high computational cost of EMD. Utilizing the dualprogram mapping technique, we present a new general data partition framework to facilitate effective workload decomposition using MapReduce, ensuring similar distributions in terms of EMD are mapped to the same reduce task for further verification. New optimization strategies are also proposed to balance the workloads among reduce tasks and eliminate large unnecessary EMD evaluations. Our experiments verify the superiority of our proposal on system efficiency, with a huge advantage of at least one order of magnitude than the stateoftheart solution, and on system effectiveness, with a real case study towards the abused image phenomenon on the most popular C2C website in China.
Optimal Spatial Dominance: An Effective Search of Nearest Neighbor Candidates
"... In many domains such as computational geometry and database management, an object may be described by multiple instances (points). Then the distance (or similarity) between two objects is captured by the pairwise distances among their instances. In the past, numerous nearest neighbor (NN) functio ..."
Abstract
 Add to MetaCart
(Show Context)
In many domains such as computational geometry and database management, an object may be described by multiple instances (points). Then the distance (or similarity) between two objects is captured by the pairwise distances among their instances. In the past, numerous nearest neighbor (NN) functions have been proposed to define the distance between objects with multiple instances and to identify the NN object. Nevertheless, considering that a user may not have a specific NN function in mind, it is desirable to provide her with a set of NN candidates. Ideally, the set of NN candidates must include every object that is NN for at least one of the NN functions and must exclude every nonpromising object. However, no one has studied the problem of NN candidates computation from this perspective. Al
Earth Mover’s Distance based Similarity Search at Scale
"... Earth Mover’s Distance (EMD), as a similarity measure, has received a lot of attention in the fields of multimedia and probabilistic databases, computer vision, image retrieval, machine learning, etc. EMD on multidimensional histograms provides better distinguishability between the objects approx ..."
Abstract
 Add to MetaCart
(Show Context)
Earth Mover’s Distance (EMD), as a similarity measure, has received a lot of attention in the fields of multimedia and probabilistic databases, computer vision, image retrieval, machine learning, etc. EMD on multidimensional histograms provides better distinguishability between the objects approximated by the histograms (e.g., images), compared to classic measures like Euclidean distance. Despite its usefulness, EMD has a high computational cost; therefore, a number of effective filtering methods have been proposed, to reduce the pairs of histograms for which the exact EMD has to be computed, during similarity search. Still, EMD calculations in the refinement step remain the bottleneck of the whole similarity search process. In this paper, we focus on optimizing the refinement phase of EMDbased similarity search by (i) adapting an efficient mincost flow algorithm (SIA) for EMD computation, (ii) proposing a dynamic distance bound, which can be used to terminate an EMD refinement early, and (iii) proposing a dynamic refinement order for the candidates which, paired with a concurrent EMD refinement strategy, reduces the amount of needless computations. Our proposed techniques are orthogonal to and can be easily integrated with the stateoftheart filtering techniques, reducing the cost of EMDbased similarity queries by orders of magnitude. 1.