Results 1  10
of
14
Efficient processing of k nearest neighbor joins using mapreduce
 Professor of Computer Science at the National University of Singapore (NUS). He obtained his BSc (1st Class Honors) and PhD from Monash University, Australia, in 1985 and
, 2012
"... k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining applications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensiv ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
(Show Context)
k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining applications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensive operation. Given the increasing volume of data, it is difficult to perform a kNN join on a centralized machine efficiently. In this paper, we investigate how to perform kNN join using MapReduce which is a wellaccepted framework for dataintensive applications over clusters of computers. In brief, the mappers cluster objects into groups; the reducers perform the kNN join on each group of objects separately. We design an effective mapping mechanism that exploits pruning rules for distance filtering, and hence reduces both the shuffling and computational costs. To reduce the shuffling cost, we propose two approximate algorithms to minimize the number of replicas. Extensive experiments on our inhouse cluster demonstrate that our proposed methods are efficient, robust and scalable. 1.
K Nearest Neighbor Queries and KNNJoins in Large Relational Databases (Almost) for Free
"... Abstract — Finding the ..."
(Show Context)
Similarityaware query processing and optimization
 In Proceedings of the International Conference on Very Large Data Bases PhD Workshop
, 2009
"... Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, require or can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, e.g. ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, require or can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, e.g., join and selection, to be aware of data similarities; there has not been much study on the role, interaction, and implementation of similarityaware operators as firstclass database operators. The focus of the thesis work presented in this paper is the proposal and study of several similarityaware database operators and a systematic analysis of their role as query operators, interactions, optimizations, and implementation techniques. This paper presents the core research questions that drive our research work and the physical database operators that were studied as part of this thesis work so far, i.e., Similarity Groupby and Similarity Join. We describe multiple optimization techniques for the introduced operators. Specifically, we present: (1) multiple nontrivial equivalence rules that enable similarity query transformations, (2) Eager and Lazy aggregation transformations for Similarity Groupby and Similarity Join to allow preaggregation before potentially expensive joins, and (3) techniques to use materialized views to answer similaritybased queries. This paper also presents the main guidelines to implement the presented operators as integral components of a DBMS query engine and some of the key performance evaluation results of this implementation in an open source DBMS. In addition, we present the way the proposed operators are efficiently exploited to answer more useful business questions in a decision support system. 1.
Efficient Similarity Join of Large Sets of Moving Object Trajectories
"... We address the problem of performing efficient similarity join for large sets of moving objects trajectories. Unlike previous approaches which use a dedicated index in a transformed space, our premise is that in many applications of locationbased services, the trajectories are already indexed in th ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
We address the problem of performing efficient similarity join for large sets of moving objects trajectories. Unlike previous approaches which use a dedicated index in a transformed space, our premise is that in many applications of locationbased services, the trajectories are already indexed in their native space, in order to facilitate the processing of common spatiotemporal queries, e.g., range, nearest neighbor etc. We introduce a novel distance measure adapted from the classic Fréchet distance, which can be naturally extended to support lower/upper bounding using the underlying indices of moving object databases in the native space. This, in turn, enables efficient implementation of various trajectory similarity joins. We report on extensive experiments demonstrating that our methodology provides performance speedup of trajectory similarity join by more than 50 % on average, while maintaining effectiveness comparable to the wellknown approaches for identifying trajectory similarity based on timeseries analysis. 1
Nearest group queries
 In SSDBM
, 2013
"... k nearest neighbor (kNN) search is an important problem in a vast number of applications, including clustering, pattern recognition, image retrieval and recommendation systems. It finds k elements from a data source D that are closest to a given query point q in a metric space. In this paper, we ext ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
k nearest neighbor (kNN) search is an important problem in a vast number of applications, including clustering, pattern recognition, image retrieval and recommendation systems. It finds k elements from a data source D that are closest to a given query point q in a metric space. In this paper, we extend kNN query to retrieve closest elements from multiple data sources. This new type of query is named k nearest group (kNG) query, which finds k groups of elements that are closest to q with each group containing one object from each data source. kNG query is useful in many location based services. To efficiently process kNG queries, we propose a baseline algorithm using Rtree as well as an improved version using Hilbert Rtree. We also study a variant of kNG query, named kNG Join, which is analagous to kNN Join. Given a set of query points Q, kNG Join returns k nearest groups for each point in Q. Such a query is useful in publish/subscribe systems to find matching items for a collection of subscribers. A comprehensive performance study was conducted on both synthetic and real datasets and the experimental results show that Hilbert Rtree achieves significantly better performance than Rtree in answering both kNG query and kNG Join.
Adaptive kNearestNeighbor Classification Using a Dynamic Number of Nearest Neighbors ⋆
"... Abstract. Classification based on knearest neighbors (kNN classification) is one of the most widely used classification methods. The number k of nearest neighbors used for achieving a high accuracy in classification is given in advance and is highly dependent on the data set used. If the size of da ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Classification based on knearest neighbors (kNN classification) is one of the most widely used classification methods. The number k of nearest neighbors used for achieving a high accuracy in classification is given in advance and is highly dependent on the data set used. If the size of data set is large, the sequential or binary search of NNs is inapplicable due to the increased computational costs. Therefore, indexing schemes are frequently used to speedup the classification process. If the required number of nearest neighbors is high, the use of an index may not be adequate to achieve high performance. In this paper, we demonstrate that the execution of the nearest neighbor search algorithm can be interrupted if some criteria are satisfied. This way, a decision can be made without the computation of all k nearest neighbors of a new object. Three different heuristics are studied towards enhancing the nearest neighbor algorithm with an earlybreak capability. These heuristics aim at: (i) reducing computation and I/O costs as much as possible, and (ii) maintaining classification accuracy at a high level. Experimental results based on reallife data sets illustrate the applicability of the proposed method in achieving better performance than existing methods.
Computation and Monitoring of Exclusive Closest Pairs
"... Abstract—Given two datasets A and B, their exclusive closest pairs (ECP) join is a onetoone assignment of objects from the two datasets, such that (i) the closest pair (a, b) in A × B is in the result and (ii) the remaining pairs are determined by removing objects a, b from A, B respectively, and ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Given two datasets A and B, their exclusive closest pairs (ECP) join is a onetoone assignment of objects from the two datasets, such that (i) the closest pair (a, b) in A × B is in the result and (ii) the remaining pairs are determined by removing objects a, b from A, B respectively, and recursively searching for the next closest pair. A real application of exclusive closest pairs is the computation of (car, parking slot) assignments. This paper introduces the problem and proposes several solutions that solve it in mainmemory, exploiting space partitioning. In addition, we define a dynamic version of the problem, where the objective is to continuously monitor the ECP join solution, in an environment where the joined datasets change positions and content. Finally, we study an extended form of the query, where objects in one of the two joined sets (e.g., parking slots) have a capacity constraint, allowing them to match with multiple objects from the other set (e.g., cars). We show how our techniques can be extended for this variant and compare them with a previous solution to this problem. Experimental results on a system prototype demonstrate the efficiency and applicability of the proposed algorithms.
Finding the sites with best accessibilities to amenities
 In DASFAA
, 2011
"... Abstract. Finding the most accessible locations has a number of applications. For example, a user may want to find an accommodation that is close to different amenities such as schools, supermarkets, and hospitals etc. In this paper, we study the problem of finding the most accessible locations amon ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Finding the most accessible locations has a number of applications. For example, a user may want to find an accommodation that is close to different amenities such as schools, supermarkets, and hospitals etc. In this paper, we study the problem of finding the most accessible locations among a set of possible sites. The task is converted to a topk query that returns k points from a set of sites R with the best accessibilities. Two Rtree based algorithms are proposed to answer the query efficiently. Experimental results show that our proposed algorithms are several times faster than a baseline algorithm on largescale real datasets under a wide range of parameter settings. 1
Adaptive kNearest Neighbor Classification Based on a Dynamic Number of Nearest Neighbors
"... Abstract. Classification based on knearest neighbors (kNN classification) is one of the most widely used classification methods. The number k of nearest neighbors used for achieving a high precision in classification is given in advance and is highly dependent on the data set used. If the size of d ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Classification based on knearest neighbors (kNN classification) is one of the most widely used classification methods. The number k of nearest neighbors used for achieving a high precision in classification is given in advance and is highly dependent on the data set used. If the size of data set is large, the sequential or binary search of NNs is inapplicable due to the increased computational costs. Therefore, indexing schemes are frequently used to speedup the classification process. If the required number of nearest neighbors is high, the use of an index may not be adequate to achieve high performance. In this paper, we demonstrate that the execution of the nearest neighbor search algorithm can be interrupted if some criteria are satisfied. This way, a decision can be made without the computation of all k nearest neighbors of a new object. Three different heuristics are studied towards enhancing the nearest neighbor algorithm with an earlybreak capability. These heuristics aim at: (i) reducing computation and I/O costs as much as possible, and (ii) maintaining classification precision at a high level. Experimental results based on reallife data sets illustrate the applicability of the proposed method in achieving better performance than existing methods.
Finding Data Broadness Via Generalized Nearest
"... Abstract. A data object is broad if it is one of the kNearest Neighbors (kNN) of many data objects. We introduce a new database primitive called Generalized Nearest Neighbor (GNN) to express data broadness. We also develop three strategies to answer GNN queries efficiently for large datasets of mu ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. A data object is broad if it is one of the kNearest Neighbors (kNN) of many data objects. We introduce a new database primitive called Generalized Nearest Neighbor (GNN) to express data broadness. We also develop three strategies to answer GNN queries efficiently for large datasets of multidimensional objects. The R*Tree based search algorithm generates candidate pages and ranks them based on their distances. Our first algorithm, Fetch All (FA), fetches as many candidate pages as possible. Our second algorithm, Fetch One (FO), fetches one candidate page at a time. Our third algorithm, Fetch Dynamic (FD), dynamically decides on the number of pages that needs to be fetched. We also propose three optimizations, Column Filter, Row Filter and Adaptive Filter, to eliminate pages from each dataset. Column Filter prunes the pages that are guaranteed to be nonbroad. Row Filter prunes the pages whose removal do not change the broadness of any data point. Adaptive Filter prunes the search space dynamically along each dimension to eliminate unpromising objects. Our experiments show that FA is the fastest when the buffer size is large and FO is the fastest when the buffer size is small. FD is always either fastest or very close to the faster of FA and FO. FD is significantly faster than the existing methods adapted to the GNN problem. 1