Results 1  10
of
31
Indexdriven similarity search in metric spaces
 ACM Transactions on Database Systems
, 2003
"... Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search th ..."
Abstract

Cited by 192 (8 self)
 Add to MetaCart
Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search that make the general assumption that similarity is represented with a distance metric d. Existing methods for handling similarity search in this setting typically fall into one of two classes. The first directly indexes the objects based on distances (distancebased indexing), while the second is based on mapping to a vector space (mappingbased approach). The main part of this article is dedicated to a survey of distancebased indexing methods, but we also briefly outline how search occurs in mappingbased methods. We also present a general framework for performing search based on distances, and present algorithms for common types of queries that operate on an arbitrary “search hierarchy. ” These algorithms can be applied on each of the methods presented, provided a suitable search hierarchy is defined.
Scalable Network Distance Browsing in Spatial Databases
, 2008
"... An algorithm is presented for finding the k nearest neighbors in a spatial network in a bestfirst manner using network distance. The algorithm is based on precomputing the shortest paths between all possible vertices in the network and then making use of an encoding that takes advantage of the fact ..."
Abstract

Cited by 84 (10 self)
 Add to MetaCart
(Show Context)
An algorithm is presented for finding the k nearest neighbors in a spatial network in a bestfirst manner using network distance. The algorithm is based on precomputing the shortest paths between all possible vertices in the network and then making use of an encoding that takes advantage of the fact that the shortest paths from vertex u to all of the remaining vertices can be decomposed into subsets based on the first edges on the shortest paths to them from u. Thus, in the worst case, the amount of work depends on the number of objects that are examined and the number of links on the shortest paths to them from q, rather than depending on the number of vertices in the network. The amount of storage required to keep track of the subsets is reduced by taking advantage of their spatial coherence which is captured by the aid of a shortest path quadtree. In particular, experiments on a number of large road networks as
Keyword search in spatial databases: towards searching by document.
 In ICDE,
, 2009
"... AbstractThis work addresses a novel spatial keyword query called the mclosest keywords (mCK) query. Given a database of spatial objects, each tuple is associated with some descriptive information represented in the form of keywords. The mCK query aims to find the spatially closest tuples which ma ..."
Abstract

Cited by 44 (2 self)
 Add to MetaCart
(Show Context)
AbstractThis work addresses a novel spatial keyword query called the mclosest keywords (mCK) query. Given a database of spatial objects, each tuple is associated with some descriptive information represented in the form of keywords. The mCK query aims to find the spatially closest tuples which match m userspecified keywords. Given a set of keywords from a document, mCK query can be very useful in geotagging the document by comparing the keywords to other geotagged documents in a database. To answer mCK queries efficiently, we introduce a new index called the bR*tree, which is an extension of the R*tree. Based on bR*tree, we exploit a prioribased search strategies to effectively reduce the search space. We also propose two monotone constraints, namely the distance mutex and keyword mutex, as our a priori properties to facilitate effective pruning. Our performance study demonstrates that our search strategy is indeed efficient in reducing query response time and demonstrates remarkable scalability in terms of the number of query keywords which is essential for our main application of searching by document.
Relaxing join and selection queries
 In VLDB ’06: Proceedings of the 32nd International Conference on Very Large Data Bases
, 2006
"... Database users can be frustrated by having an empty answer to a query. In this paper, we propose a framework to systematically relax queries involving joins and selections. When considering relaxing a query condition, intuitively one seeks the ’minimal ’ amount of relaxation that yields an answer. W ..."
Abstract

Cited by 37 (4 self)
 Add to MetaCart
(Show Context)
Database users can be frustrated by having an empty answer to a query. In this paper, we propose a framework to systematically relax queries involving joins and selections. When considering relaxing a query condition, intuitively one seeks the ’minimal ’ amount of relaxation that yields an answer. We first characterize the types of answers that we return to relaxed queries. We then propose a lattice based framework in order to aid query relaxation. Nodes in the lattice correspond to different ways to relax queries. We characterize the properties of relaxation at each node and present algorithms to compute the corresponding answer. We then discuss how to traverse this lattice in a way that a nonempty query answer is obtained with the minimum amount of query condition relaxation. We implemented this framework and we present our results of a thorough performance evaluation using real and synthetic data. Our results indicate the practical utility of our framework. 1.
Algorithms for processing Kclosestpair queries in spatial databases
, 2004
"... This paper addresses the problem of finding the K closest pairs between two spatial datasets (the socalled, K closest pairs query, KCPQ), where each dataset is stored in an Rtree. There are two di#erent techniques for solving this kind of distancebased query. The first technique is the increment ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
This paper addresses the problem of finding the K closest pairs between two spatial datasets (the socalled, K closest pairs query, KCPQ), where each dataset is stored in an Rtree. There are two di#erent techniques for solving this kind of distancebased query. The first technique is the incremental approach, which returns the output elements onebyone in ascending order of distance. The second one is the nonincremental alternative, which returns the K elements of the result all together at the end of the algorithm. In this paper, based on distance functions between two MBRs in the multidimensional Euclidean space, we propose a pruning heuristic and two updating strategies for minimizing the pruning distance, and use them in the design of three nonincremental branchandbound algorithms for KCPQ between spatial objects stored in two Rtrees. Two of those approaches are recursive following a DepthFirst searching strategy and one is iterative obeying a BestFirst traversal policy. The planesweep method and the search ordering are used as optimization techniques for improving the naive approaches. Besides, a number of interesting extensions of the KCPQ (KSelfCPQ, SemiCPQ, KFPQ (the Kfarthest pairs query), etc.) are discussed. An extensive performance study is also presented. This study is based on experiments performed with real datasets. A wide range of values for the basic parameters a#ecting the performance of the algorithms is examined in order to designate the most e#cient algorithm for each setting of parameter values. Finally, an experimental study of the behavior of the proposed KCPQ branchandbound algorithms in terms of scalability of the dataset size and the K value is also included.
A Bayesian Method for Guessing the Extreme Values in a Data Set ∗ ABSTRACT
"... For a large number of data management problems, it would be very useful to be able to obtain a few samples from a data set, and to use the samples to guess the largest (or smallest) value in the entire data set. Min/max online aggregation, topk query processing, outlier detection, and distance join ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
(Show Context)
For a large number of data management problems, it would be very useful to be able to obtain a few samples from a data set, and to use the samples to guess the largest (or smallest) value in the entire data set. Min/max online aggregation, topk query processing, outlier detection, and distance join are just a few possible applications. This paper details a statistically rigorous, Bayesian approach to attacking this problem. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to two specific problems that arise in the context of data management. 1.
Distance join queries on spatial networks
 In Proc. of the 14th ACM International Symposium on Geographic Information Systems (GIS
, 2006
"... The result of a distance join operation on two sets of objects R,S on a spatial network G is a set P of object pairs <p, q>, p ∈ R, q ∈ S such that the distance of an object pair <p, q> is the shortest distance from p to q in G. Several variations to the distance join operation such as ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
(Show Context)
The result of a distance join operation on two sets of objects R,S on a spatial network G is a set P of object pairs <p, q>, p ∈ R, q ∈ S such that the distance of an object pair <p, q> is the shortest distance from p to q in G. Several variations to the distance join operation such as UnOrdered, Incremental, Topk, SemiJoin impose additional constraints on the distance between the object pairs in P, the ordering of object pairs in P, and on the cardinality of P. A distance join algorithm on spatial networks is proposed that works in conjunction with the SILC framework, which is a new approach to query processing on spatial networks. Experimental results demonstrate up to an order of magnitude speed up when compared with a prominent existing technique.
NNH: Improving Performance of NearestNeighbor Searches Using Histograms
 IN EDBT
, 2003
"... Efficient search for nearest neighbors (NN) is a fundamental problem arising in a large variety of applications of vast practical interest. In this paper we propose a novel technique, called NNH ("Nearest Neighbor Histograms"), which uses specific histogram structures to improve the pe ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
(Show Context)
Efficient search for nearest neighbors (NN) is a fundamental problem arising in a large variety of applications of vast practical interest. In this paper we propose a novel technique, called NNH ("Nearest Neighbor Histograms"), which uses specific histogram structures to improve the performance of NN search algorithms. A primary feature of our proposal is that such histogram structures can coexist in conjunction with a plethora of NN search algorithms without the need to substantially modify them. The main idea behind our proposal is to choose a small number of pivot objects in the space, and precalculate the distances to their nearest neighbors. We provide a complete specification of such histogram structures and show how to make use of the information they provide towards more e#ective searching. In particular, we show how to construct them, how to decide the number of pivots, how to choose pivot objects, how to incrementally maintain them under dynamic updates, and how to utilize them in conjunction with a variety of NN search algorithms to improve the performance of NN searches. Our intensive experiments show that nearest neighbor histograms can be efficiently constructed and maintained, and when used in conjunction with a variety of algorithms for NN search, they can improve the performance dramatically.
Efficient Similarity String Joins in Large Data Sets
, 2002
"... Many emerging database applications require very efficient mechanisms to perform similarity joins. While similarity joins have been extensively explored in the literature, the problem of similarity string joins has received comparatively less attention. The stringjoin problem can be specified as fo ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Many emerging database applications require very efficient mechanisms to perform similarity joins. While similarity joins have been extensively explored in the literature, the problem of similarity string joins has received comparatively less attention. The stringjoin problem can be specified as follows: Given two large sets of strings, a distance metric (e.g., edit distance) between strings, and a predefined threshold k, how to find all the pairs from the two sets such that their distance is no greater than k? Such a problem arises naturally in a variety of applications such as data cleansing and data integration. We propose a novel twostep process for string join.
Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques
"... This paper describes an efficient approach to record linkage. Given two lists of records, the recordlinkage problem consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on domainspecific similarities over individual att ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
This paper describes an efficient approach to record linkage. Given two lists of records, the recordlinkage problem consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on domainspecific similarities over individual attributes. The recordlinkage problem arises naturally in the context of data cleansing that usually precedes data analysis and mining. Since the scalability issue of record linkage was addressed in [21], the repertoire of database techniques dealing with multidimensional data sets has significantly increased. Specifically, many effective and efficient approaches for distancepreserving transforms and similarity joins have been developed. Based on these advances, we explore a novel approach to record linkage. For each attribute of records, we first map values to a multidimensional Euclidean space that preserves domainspecific similarity. Many mapping algorithms can be applied, and we use the Fastmap approach [16] as an example. Given the merging rule that defines when two records are similar based on their attributelevel similarities, a set of attributes are chosen along which the merge will proceed. A multidimensional similarity join over the chosen attributes is used to find similar pairs of records. Our extensive experiments using real data sets show that our solution has very good efficiency and recall.