Results 11 - 20
of
390
Properties of embedding methods for similarity searching in metric spaces
- PAMI
, 2003
"... Complex data types—such as images, documents, DNA sequences, etc.—are becoming increasingly important in modern database applications. A typical query in many of these applications seeks to find objects that are similar to some target object, where (dis)similarity is defined by some distance functi ..."
Abstract
-
Cited by 109 (5 self)
- Add to MetaCart
Complex data types—such as images, documents, DNA sequences, etc.—are becoming increasingly important in modern database applications. A typical query in many of these applications seeks to find objects that are similar to some target object, where (dis)similarity is defined by some distance function. Often, the cost of evaluating the distance between two objects is very high. Thus, the number of distance evaluations should be kept at a minimum, while (ideally) maintaining the quality of the result. One way to approach this goal is to embed the data objects in a vector space so that the distances of the embedded objects approximates the actual distances. Thus, queries can be performed (for the most part) on the embedded objects. In this paper, we are especially interested in examining the issue of whether or not the embedding methods will ensure that no relevant objects are left out (i.e., there are no false dismissals and, hence, the correct result is reported). Particular attention is paid to the SparseMap, FastMap, and MetricMap embedding methods. SparseMap is a variant of Lipschitz embeddings, while FastMap and MetricMap are inspired by dimension reduction methods for Euclidean spaces (using KLT or the related PCA and SVD). We show that, in general, none of these embedding methods guarantee that queries on the embedded objects have no false dismissals, while also demonstrating the limited cases in which the guarantee does hold. Moreover, we describe a variant of SparseMap that allows queries with no false dismissals. In addition, we show that with FastMap and MetricMap, the distances of the embedded objects can be much greater than the actual distances. This makes it impossible (or at least impractical) to modify FastMap and MetricMap to guarantee no false dismissals.
Location-based Spatial Queries
- In SIGMOD
, 2003
"... In this paper we propose an approach that enables mobile clients to determine the validity of previous queries based on their current locations. In order to make this possible, the server returns in addition to the query result, a validity region around the client's location within which the re ..."
Abstract
-
Cited by 108 (12 self)
- Add to MetaCart
(Show Context)
In this paper we propose an approach that enables mobile clients to determine the validity of previous queries based on their current locations. In order to make this possible, the server returns in addition to the query result, a validity region around the client's location within which the result remains the same. We focus on two of the most common spatial query types, namely nearest neighbor and window queries, define the validity region in each case and propose the corresponding query processing algorithms. In addition, we provide analytical models for estimating the expected size of the validity region. Our techniques can significantly reduce the number of queries issued to the server, while introducing minimal computational and network overhead compared to traditional spatial queries.
Top-k Query Evaluation with Probabilistic Guarantees
- In VLDB
, 2004
"... Top-k queries based on ranking elements of multidimensional datasets are a fundamental building block for many kinds of information discovery. The best known general-purpose algorithm for evaluating top-k queries is Fagin’s threshold algorithm (TA). Since the user’s goal behind top-k queries is to i ..."
Abstract
-
Cited by 105 (16 self)
- Add to MetaCart
(Show Context)
Top-k queries based on ranking elements of multidimensional datasets are a fundamental building block for many kinds of information discovery. The best known general-purpose algorithm for evaluating top-k queries is Fagin’s threshold algorithm (TA). Since the user’s goal behind top-k queries is to identify one or a few relevant and novel data items, it is intriguing to use approximate variants of TA to reduce run-time costs. This paper introduces a family of approximate top-k algorithms based on probabilistic arguments. When scanning index lists of the underlying multidimensional data space in descending order of local scores, various forms of convolution and derived bounds are employed to predict when it is safe, with high probability, to drop candidate items and to prune the index scans. The precision and the efficiency of the developed methods are experimentally evaluated based on a large Web corpus and a structured data collection.
Conceptual partitioning: an efficient method for continuous nearest neighbor monitoring
- In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data
, 2005
"... Given a set of objects P and a query point q, a k nearest neighbor (k-NN) query retrieves the k objects in P that lie closest to q. Even though the problem is well-studied for static datasets, the traditional methods do not extend to highly dynamic environments where multiple continuous queries requ ..."
Abstract
-
Cited by 104 (18 self)
- Add to MetaCart
Given a set of objects P and a query point q, a k nearest neighbor (k-NN) query retrieves the k objects in P that lie closest to q. Even though the problem is well-studied for static datasets, the traditional methods do not extend to highly dynamic environments where multiple continuous queries require real-time results, and both objects and queries receive frequent location updates. In this paper we propose conceptual partitioning (CPM), a comprehensive technique for the efficient monitoring of continuous NN queries. CPM achieves low running time by handling location updates only from objects that fall in the vicinity of some query (and ignoring the rest). It can be used with multiple, static or moving queries, and it does not make any assumptions about the object moving patterns. We analyze the performance of CPM and show that it outperforms the current state-of-the-art algorithms for all problem settings. Finally, we extend our framework to aggregate NN (ANN) queries, which monitor the data objects that minimize the aggregate distance with respect to a set of query points (e.g., the objects with the minimum sum of distances to all query points). 1.
SEA-CNN: Scalable processing of continuous k-nearest neighbor queries in spatio-temporal databases
- In ICDE
, 2005
"... Location-aware environments are characterized by a large number of objects and a large number of continuous queries. Both the objects and continuous queries may change their locations over time. In this paper, we focus on continuous k-nearest neighbor queries (CKNN, for short). We present a new algo ..."
Abstract
-
Cited by 97 (6 self)
- Add to MetaCart
(Show Context)
Location-aware environments are characterized by a large number of objects and a large number of continuous queries. Both the objects and continuous queries may change their locations over time. In this paper, we focus on continuous k-nearest neighbor queries (CKNN, for short). We present a new algorithm, termed SEA-CNN, for answering continuously a collection of concurrent CKNN queries. SEA-CNN has two important features: incremental evaluation and shared execution. SEA-CNN achieves both efficiency and scalability in the presence of a set of concurrent queries. Furthermore, SEA-CNN does not make any assumptions about the movement of objects, e.g., the objects velocities and shapes of trajectories, or about the mutability of the objects and/or the queries, i.e., moving or stationary queries issued on moving or stationary objects. We provide theoretical analysis of SEA-CNN with respect to the execution costs, memory requirements and effects of tunable parameters. Comprehensive experimentation shows that SEA-CNN is highly scalable and is more efficient in terms of both I/O and CPU costs in comparison to other R-tree-based CKNN techniques. 1.
Selecting Stars: The k Most Representative Skyline Operator
- In Proc. of the Int. IEEE Conf. on Data Engineering (ICDE
, 2007
"... Skyline computation has many applications including multi-criteria decision making. In this paper, we study the problem of selecting k skyline points so that the number of points, which are dominated by at least one of these k skyline points, is maximized. We first present an efficient dynamic progr ..."
Abstract
-
Cited by 93 (3 self)
- Add to MetaCart
(Show Context)
Skyline computation has many applications including multi-criteria decision making. In this paper, we study the problem of selecting k skyline points so that the number of points, which are dominated by at least one of these k skyline points, is maximized. We first present an efficient dynamic programming based exact algorithm in a 2d-space. Then, we show that the problem is NP-hard when the dimensionality is 3 or more and it can be approximately solved by a polynomial time algorithm with the guaranteed approximation ratio 1 − 1 e. To speed-up the computation, an efficient, scalable, index-based randomized algorithm is developed by applying the FM probabilistic counting technique. A comprehensive performance evaluation demonstrates that our randomized technique is very efficient, highly accurate, and scalable. 1.
A Road Network Embedding Technique for k-Nearest Neighbor Search in Moving Object Databases
- GeoInformatica
, 2002
"... A very important class of queries in GIS applications is the class of K-nearest neighbor queries. Most of the current studies on the K-nearest neighbor queries utilize spatial index structures and hence are based on the Euclidean distances between the points. In real-world road networks, however, th ..."
Abstract
-
Cited by 90 (5 self)
- Add to MetaCart
(Show Context)
A very important class of queries in GIS applications is the class of K-nearest neighbor queries. Most of the current studies on the K-nearest neighbor queries utilize spatial index structures and hence are based on the Euclidean distances between the points. In real-world road networks, however, the shortest distance between two points depends on the actual path connecting the points and cannot be computed accurately using one of the Minkowski metrics. Thus, the Euclidean distance may not properly approximate the real distance. In this paper, we apply an embedding technique to transform a road network to a high dimensional space in order to utilize computationally simple Minkowski metrics for distance measurement. Subsequently, we extend our approach to dynamically transform new points into the embedding space. Finally, we propose an ef®cient technique that can ®nd the actual shortest path between two points in the original road network using only the embedding space. Our empirical experiments indicate that the Chessboard distance metric …L? † in the embedding space preserves the ordering of the distances between a point and its neighbors more precisely as compared to the Euclidean distance in the original road network.
Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects
"... The conventional Internet is acquiring a geo-spatial dimension. Web documents are being geo-tagged, and geo-referenced objects such as points of interest are being associated with descriptive text documents. The resulting fusion of geo-location and documents enables a new kind of top-k query that ta ..."
Abstract
-
Cited by 88 (16 self)
- Add to MetaCart
(Show Context)
The conventional Internet is acquiring a geo-spatial dimension. Web documents are being geo-tagged, and geo-referenced objects such as points of interest are being associated with descriptive text documents. The resulting fusion of geo-location and documents enables a new kind of top-k query that takes into account both location proximity and text relevancy. To our knowledge, only naive techniques exist that are capable of computing a general web information retrieval query while also taking location into account. This paper proposes a new indexing framework for locationaware top-k text retrieval. The framework leverages the inverted file for text retrieval and the R-tree for spatial proximity querying. Several indexing approaches are explored within the framework. The framework encompasses algorithms that utilize the proposed indexes for computing the top-k query, thus taking into account both text relevancy and location proximity to prune the search space. Results of empirical studies with an implementation of the framework demonstrate that the paper’s proposal offers scalability and is capable of excellent performance. 1.
Stabbing the sky: Efficient skyline computation over sliding windows
- In ICDE
, 2005
"... We consider the problem of efficiently computing the skyline against the most recent N elements in a data stream seen so far. Specifically, we study the n-of-N skyline queries; that is, computing the skyline for the most recent n (∀n ≤ N) elements. Firstly, we developed an effective pruning techniqu ..."
Abstract
-
Cited by 87 (8 self)
- Add to MetaCart
(Show Context)
We consider the problem of efficiently computing the skyline against the most recent N elements in a data stream seen so far. Specifically, we study the n-of-N skyline queries; that is, computing the skyline for the most recent n (∀n ≤ N) elements. Firstly, we developed an effective pruning technique to minimize the number of elements to be kept. It can be shown that on average storing only O(log d N) elements from the most recent N elements is sufficient to support the precise computation of all n-of-N skyline queries in a d-dimension space if the data distribution on each dimension is independent. Then, a novel encoding scheme is proposed, together with efficient update techniques, for the stored elements, so that computing an n-of-N skyline query in a d-dimension space takes O(log N + s) time that is reduced to O(d log log N + s) if the data distribution is independent, where s is the number of skyline points. Thirdly, a novel trigger based technique is provided to process continuous n-of-N skyline queries with O(δ) time to update the current result per new data element and O(log s) time to update the trigger list per result change, where δ is the number of element changes from the current result to the new result. Finally, we extend our techniques to computing the skyline against an arbitrary window in the most recent N elements. Besides theoretical performance guarantees, our extensive experiments demonstrated that the new techniques can support on-line skyline query computation over very rapid data streams. 1
Scalable Network Distance Browsing in Spatial Databases
, 2008
"... An algorithm is presented for finding the k nearest neighbors in a spatial network in a best-first manner using network distance. The algorithm is based on precomputing the shortest paths between all possible vertices in the network and then making use of an encoding that takes advantage of the fact ..."
Abstract
-
Cited by 84 (10 self)
- Add to MetaCart
(Show Context)
An algorithm is presented for finding the k nearest neighbors in a spatial network in a best-first manner using network distance. The algorithm is based on precomputing the shortest paths between all possible vertices in the network and then making use of an encoding that takes advantage of the fact that the shortest paths from vertex u to all of the remaining vertices can be decomposed into subsets based on the first edges on the shortest paths to them from u. Thus, in the worst case, the amount of work depends on the number of objects that are examined and the number of links on the shortest paths to them from q, rather than depending on the number of vertices in the network. The amount of storage required to keep track of the subsets is reduced by taking advantage of their spatial coherence which is captured by the aid of a shortest path quadtree. In particular, experiments on a number of large road networks as