Results 11  20
of
764
Distance Browsing in Spatial Databases
, 1999
"... Two different techniques of browsing through a collection of spatial objects stored in an Rtree spatial data structure on the basis of their distances from an arbitrary spatial query object are compared. The conventional approach is one that makes use of a knearest neighbor algorithm where k is kn ..."
Abstract

Cited by 390 (21 self)
 Add to MetaCart
Two different techniques of browsing through a collection of spatial objects stored in an Rtree spatial data structure on the basis of their distances from an arbitrary spatial query object are compared. The conventional approach is one that makes use of a knearest neighbor algorithm where k is known prior to the invocation of the algorithm. Thus if m#kneighbors are needed, the knearest neighbor algorithm needs to be reinvoked for m neighbors, thereby possibly performing some redundant computations. The second approach is incremental in the sense that having obtained the k nearest neighbors, the k +1 st neighbor can be obtained without having to calculate the k +1nearest neighbors from scratch. The incremental approach finds use when processing complex queries where one of the conditions involves spatial proximity (e.g., the nearest city to Chicago with population greater than a million), in which case a query engine can make use of a pipelined strategy. A general incremental nearest neighbor algorithm is presented that is applicable to a large class of hierarchical spatial data structures. This algorithm is adapted to the Rtree and its performance is compared to an existing knearest neighbor algorithm for Rtrees [45]. Experiments show that the incremental nearest neighbor algorithm significantly outperforms the knearest neighbor algorithm for distance browsing queries in a spatial database that uses the Rtree as a spatial index. Moreover, the incremental nearest neighbor algorithm also usually outperforms the knearest neighbor algorithm when applied to the knearest neighbor problem for the Rtree, although the improvement is not nearly as large as for distance browsing queries. In fact, we prove informally that, at any step in its execution, the incremental...
Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds
 Journal of Machine Learning Research
, 2003
"... The problem of dimensionality reduction arises in many fields of information processing, including machine learning, data compression, scientific visualization, pattern recognition, and neural computation. ..."
Abstract

Cited by 385 (10 self)
 Add to MetaCart
The problem of dimensionality reduction arises in many fields of information processing, including machine learning, data compression, scientific visualization, pattern recognition, and neural computation.
Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces
, 1993
"... We consider the computational problem of finding nearest neighbors in general metric spaces. Of particular interest are spaces that may not be conveniently embedded or approximated in Euclidian space, or where the dimensionality of a Euclidian representation is very high. Also relevant are highdim ..."
Abstract

Cited by 358 (5 self)
 Add to MetaCart
We consider the computational problem of finding nearest neighbors in general metric spaces. Of particular interest are spaces that may not be conveniently embedded or approximated in Euclidian space, or where the dimensionality of a Euclidian representation is very high. Also relevant are highdimensional Euclidian settings in which the distribution of data is in some sense of lower dimension and embedded in the space. The vptree (vantage point tree) is introduced in several forms, together with associated algorithms, as an improved method for these difficult search problems. Tree construction executes in O(n log(n)) time, and search is under certain circumstances and in the limit, O(log(n)) expected time. The theoretical basis for this approach is developed and the results of several experiments are reported. In Euclidian cases, kdtree performance is compared.
Neighbourhood components analysis
 Advances in Neural Information Processing Systems 17
, 2004
"... In this paper we propose a novel method for learning a Mahalanobis distance measure to be used in the KNN classification algorithm. The algorithm directly maximizes a stochastic variant of the leaveoneout KNN score on the training set. It can also learn a lowdimensional linear embedding of labele ..."
Abstract

Cited by 346 (9 self)
 Add to MetaCart
(Show Context)
In this paper we propose a novel method for learning a Mahalanobis distance measure to be used in the KNN classification algorithm. The algorithm directly maximizes a stochastic variant of the leaveoneout KNN score on the training set. It can also learn a lowdimensional linear embedding of labeled data that can be used for data visualization and fast classification. Unlike other methods, our classification model is nonparametric, making no assumptions about the shape of the class distributions or the boundaries between them. The performance of the method is demonstrated on several data sets, both for metric learning and linear dimensionality reduction. 1
Similarity Indexing with the SStree
 In Proceedings of the 12th International Conference on Data Engineering
, 1996
"... jain0ece.ucsd.edu ..."
Efficient Clustering of HighDimensional Data Sets with Application to Reference Matching
, 2000
"... Many important problems involve clustering large datasets. Although naive implementations of clustering are computationally expensive, there are established efficient techniques for clustering when the dataset has either (1) a limited number of clusters, (2) a low feature dimensionality, or (3) a sm ..."
Abstract

Cited by 338 (15 self)
 Add to MetaCart
(Show Context)
Many important problems involve clustering large datasets. Although naive implementations of clustering are computationally expensive, there are established efficient techniques for clustering when the dataset has either (1) a limited number of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of efficiently clustering datasets that are large in all three ways at oncefor example, having millions of data points that exist in many thousands of dimensions representing many thousands of clusters. We present a new technique for clustering these large, highdimensional datasets. The key idea involves using a cheap, approximate distance measure to efficiently divide the data into overlapping subsets we call canopies. Then clustering is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical. Under r...
Shape Indexing Using Approximate NearestNeighbour Search in HighDimensional Spaces
, 1997
"... Shape indexing is a way of making rapid associations between features detected in an image and object models that could have produced them. When model databases are large, the use of highdimensional features is critical, due to the improved level of discrimination they can provide. Unfortunately, f ..."
Abstract

Cited by 311 (12 self)
 Add to MetaCart
(Show Context)
Shape indexing is a way of making rapid associations between features detected in an image and object models that could have produced them. When model databases are large, the use of highdimensional features is critical, due to the improved level of discrimination they can provide. Unfortunately, finding the nearest neighbour to a query point rapidly becomes inefficient as the dimensionality of the feature space increases. Past indexing methods have used hash tables for hypothesis recovery, but only in lowdimensional situations. In this paper, we show that a new variant of the kd tree search algorithm makes indexing in higherdimensional spaces practical. This Best Bin First, or BBF, search is an approximate algorithm which finds the nearest neighbour for a large fraction of the queries, and a very close neighbour in the remaining cases. The technique has been integrated into a fully developed recognition system, which is able to detect complex objects in real, cluttered scenes in just a few seconds.
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling
, 1999
"... Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. Existing clustering algorithms, such as Kmeans, PAM, CLARANS, DBSCAN, CURE, and ROCK are designed to find clusters that fit s ..."
Abstract

Cited by 268 (19 self)
 Add to MetaCart
Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. Existing clustering algorithms, such as Kmeans, PAM, CLARANS, DBSCAN, CURE, and ROCK are designed to find clusters that fit some static models. These algorithms can breakdown if the choice of parameters in the static model is incorrect with respect to the data set being clustered, or if the model is not adequate to capture the characteristics of clusters. Furthermore, most of these algorithms breakdown when the data consists of clusters that are of diverse shapes, densities, and sizes. In this paper, we present a novel hierarchical clustering algorithm called CHAMELEON that measures the similarity of two clusters based on a dynamic model. In the clustering process, two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal intercon...
Mondrian multidimensional kanonymity
 in Proc. 22nd ICDE. IEEE
"... KAnonymity has been proposed as a mechanism for privacy protection in microdata publishing, and numerous recoding “models ” have been considered for achieving kanonymity. This paper proposes a new multidimensional model, which provides an additional degree of flexibility not seen in previous (sing ..."
Abstract

Cited by 255 (5 self)
 Add to MetaCart
(Show Context)
KAnonymity has been proposed as a mechanism for privacy protection in microdata publishing, and numerous recoding “models ” have been considered for achieving kanonymity. This paper proposes a new multidimensional model, which provides an additional degree of flexibility not seen in previous (singledimensional) approaches. Often this flexibility leads to higherquality anonymizations, as measured both by generalpurpose metrics, as well as more specific notions of query answerability. In this paper, we prove that optimal multidimensional anonymization is NPhard (like previous kanonymity models). However, we introduce a simple, scalable, greedy algorithm that produces anonymizations that are a constantfactor approximation of optimal. Experimental results show that this greedy algorithm frequently leads to more desirable anonymizations than two optimal exhaustivesearch algorithms for singledimensional models. 1.