Results 1  10
of
1,346
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality
, 1998
"... The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the ddimens ..."
Abstract

Cited by 1017 (40 self)
 Add to MetaCart
The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the ddimensional Euclidean space where X = ! d under some l p norm. Despite decades of effort, the current solutions are far from satisfactory; in fact, for large d, in theory or in practice, they provide little improvement over the bruteforce algorithm which compares the query point to each data point. Of late, there has been some interest in the approximate nearest neighbors problem, which is: Find a point p 2 P that is an fflapproximate nearest neighbor of the query q in that for all p 0 2 P , d(p; q) (1 + ffl)d(p 0 ; q). We present two algorithmic results for the approximate version that significantly improve the known bounds: (a) preprocessing cost polynomial in n and d, and a trul...
An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions
 ACMSIAM SYMPOSIUM ON DISCRETE ALGORITHMS
, 1994
"... Consider a set S of n data points in real ddimensional space, R d , where distances are measured using any Minkowski metric. In nearest neighbor searching we preprocess S into a data structure, so that given any query point q 2 R d , the closest point of S to q can be reported quickly. Given any po ..."
Abstract

Cited by 983 (32 self)
 Add to MetaCart
Consider a set S of n data points in real ddimensional space, R d , where distances are measured using any Minkowski metric. In nearest neighbor searching we preprocess S into a data structure, so that given any query point q 2 R d , the closest point of S to q can be reported quickly. Given any positive real ffl, a data point p is a (1 + ffl)approximate nearest neighbor of q if its distance from q is within a factor of (1 + ffl) of the distance to the true nearest neighbor. We show that it is possible to preprocess a set of n points in R d in O(dn log n) time and O(dn) space, so that given a query point q 2 R d , and ffl ? 0, a (1 + ffl)approximate nearest neighbor of q can be computed in O(c d;ffl log n) time, where c d;ffl d d1 + 6d=ffle d is a factor depending only on dimension and ffl. In general, we show that given an integer k 1, (1 + ffl)approximations to the k nearest neighbors of q can be computed in additional O(kd log n) time.
Distance metric learning for large margin nearest neighbor classification
 In NIPS
, 2006
"... We show how to learn a Mahanalobis distance metric for knearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the knearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven ..."
Abstract

Cited by 685 (15 self)
 Add to MetaCart
(Show Context)
We show how to learn a Mahanalobis distance metric for knearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the knearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classificationâ€”for example, achieving a test error rate of 1.3 % on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification. 1
Similarity search in high dimensions via hashing
, 1999
"... The nearest or nearneighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over highdimensional data, e.g., image dat ..."
Abstract

Cited by 622 (13 self)
 Add to MetaCart
The nearest or nearneighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over highdimensional data, e.g., image databases, document collections, timeseries databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the \curse of dimensionality. &quot; That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in kd trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than bruteforce linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should su ce for most practical purposes. In this paper, we examine a novel scheme for approximate similarity search based on hashing. The basic idea is to hash the points
Selection of relevant features and examples in machine learning
 ARTIFICIAL INTELLIGENCE
, 1997
"... In this survey, we review work in machine learning on methods for handling data sets containing large amounts of irrelevant information. We focus on two key issues: the problem of selecting relevant features, and the problem of selecting relevant examples. We describe the advances that have been mad ..."
Abstract

Cited by 590 (2 self)
 Add to MetaCart
In this survey, we review work in machine learning on methods for handling data sets containing large amounts of irrelevant information. We focus on two key issues: the problem of selecting relevant features, and the problem of selecting relevant examples. We describe the advances that have been made on these topics in both empirical and theoretical work in machine learning, and we present a general framework that we use to compare different methods. We close with some challenges for future work in this area.
Using Discriminant Eigenfeatures for Image Retrieval
, 1996
"... This paper describes the automatic selection of features from an image training set using the theories of multidimensional linear discriminant analysis and the associated optimal linear projection. We demonstrate the effectiveness of these Most Discriminating Features for viewbased class retrieval ..."
Abstract

Cited by 504 (15 self)
 Add to MetaCart
(Show Context)
This paper describes the automatic selection of features from an image training set using the theories of multidimensional linear discriminant analysis and the associated optimal linear projection. We demonstrate the effectiveness of these Most Discriminating Features for viewbased class retrieval from a large database of widely varying realworld objects presented as "wellframed" views, and compare it with that of the principal component analysis.
Bursty and Hierarchical Structure in Streams
, 2002
"... A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. Email and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade aw ..."
Abstract

Cited by 385 (2 self)
 Add to MetaCart
A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. Email and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise  that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges.
TiMBL: Tilburg Memory Based Learner  version 2.0  Reference Guid
, 1999
"... This document is available from http://ilk.kub.nl/~ilk/papers/ilk9901.ps.gz. All rights reserved Induction of Linguistic Knowledge, Tilburg University. Contents 1 License terms 1 2 Installation 3 3 Changes 4 4 Learning algorithms 6 ..."
Abstract

Cited by 327 (77 self)
 Add to MetaCart
This document is available from http://ilk.kub.nl/~ilk/papers/ilk9901.ps.gz. All rights reserved Induction of Linguistic Knowledge, Tilburg University. Contents 1 License terms 1 2 Installation 3 3 Changes 4 4 Learning algorithms 6
A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features
 Machine Learning
, 1993
"... In the past, nearest neighbor algorithms for learning from examples have worked best in domains in which all features had numeric values. In such domains, the examples can be treated as points and distance metrics can use standard definitions. In symbolic domains, a more sophisticated treatment of t ..."
Abstract

Cited by 305 (3 self)
 Add to MetaCart
(Show Context)
In the past, nearest neighbor algorithms for learning from examples have worked best in domains in which all features had numeric values. In such domains, the examples can be treated as points and distance metrics can use standard definitions. In symbolic domains, a more sophisticated treatment of the feature space is required. We introduce a nearest neighbor algorithm for learning in domains with symbolic features. Our algorithm calculates distance tables that allow it to produce realvalued distances between instances, and attaches weights to the instances to further modify the structure of feature space. We show that this technique produces excellent classification accuracy on three problems that have been studied by machine learning researchers: predicting protein secondary structure, identifying DNA promoter sequences, and pronouncing English text. Direct experimental comparisons with the other learning algorithms show that our nearest neighbor algorithm is comparable or superior ...