Results 1  10
of
10
Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions
, 2008
"... In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The ..."
Abstract

Cited by 457 (7 self)
 Add to MetaCart
In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The problem is of significant interest in a wide variety of areas.
Nearest Neighbors In HighDimensional Spaces
, 2004
"... In this chapter we consider the following problem: given a set P of points in a highdimensional space, construct a data structure which given any query point q nds the point in P closest to q. This problem, called nearest neighbor search is of significant importance to several areas of computer sci ..."
Abstract

Cited by 93 (2 self)
 Add to MetaCart
In this chapter we consider the following problem: given a set P of points in a highdimensional space, construct a data structure which given any query point q nds the point in P closest to q. This problem, called nearest neighbor search is of significant importance to several areas of computer science, including pattern recognition, searching in multimedial data, vector compression [GG91], computational statistics [DW82], and data mining. Many of these applications involve data sets which are very large (e.g., a database containing Web documents could contain over one billion documents). Moreover, the dimensionality of the points is usually large as well (e.g., in the order of a few hundred). Therefore, it is crucial to design algorithms which scale well with the database size as well as with the dimension. The nearestneighbor problem is an example of a large class of proximity problems, which, roughly speaking, are problems whose definitions involve the notion of...
Efficient Spoken Term Discovery Using Randomized Algorithms
"... Abstract—Spoken term discovery is the task of automatically identifying words and phrases in speech data by searching for long repeated acoustic patterns. Initial solutions relied on exhaustive dynamic time warpingbased searches across the entire similarity matrix, a method whose scalability is ult ..."
Abstract

Cited by 19 (13 self)
 Add to MetaCart
(Show Context)
Abstract—Spoken term discovery is the task of automatically identifying words and phrases in speech data by searching for long repeated acoustic patterns. Initial solutions relied on exhaustive dynamic time warpingbased searches across the entire similarity matrix, a method whose scalability is ultimately limited by the O(n 2) nature of the search space. Recent strategies have attempted to improve search efficiency by using either unsupervised or mismatchedlanguage acoustic models to reduce the complexity of the feature representation. Taking a completely different approach, this paper investigates the use of randomized algorithms that operate directly on the raw acoustic features to produce sparse approximate similarity matrices in O(n) space and O(n log n) time. We demonstrate these techniques facilitate spoken term discovery performance capable of outperforming a modelbased strategy in the zero resource setting. I.
Connectivity structure of bipartite graphs via the kncplot
 In WSDM ’08: Proceedings of the international conference on Web search and web data mining
, 2008
"... In this paper we introduce the kneighbor connectivity plot, or KNCplot, as a tool to study the macroscopic connectivity structure of sparse bipartite graphs. Given a bipartite graph G = (U, V, E), we say that two nodes in U are kneighbors if there exist at least k distinct lengthtwo paths betwee ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
In this paper we introduce the kneighbor connectivity plot, or KNCplot, as a tool to study the macroscopic connectivity structure of sparse bipartite graphs. Given a bipartite graph G = (U, V, E), we say that two nodes in U are kneighbors if there exist at least k distinct lengthtwo paths between them; this defines a kneighborhood graph on U where the edges are given by the kneighbor relation. For example, in a bipartite graph of users and interests, two users are kneighbors if they have at least k common interests. The KNCplot shows the degradation of connectivity of the graph as a function of k. We show that this tool provides an effective and interpretable highlevel characterization of the connectivity of a bipartite graph. However, naive algorithms to compute the KNCplot are inefficient for k> 1. We give an efficient and practical algorithm that runs in subquadratic time O(E  2−1/k) and is a nontrivial improvement over the obvious quadratictime algorithms for this problem. We prove significant improvements in this runtime for graphs with powerlaw degree distributions, and give a different algorithm with nearlinear runtime when V grows slowly as a function of the size of the graph. We compute the KNCplot of four large realworld bipartite graphs, and discuss the structural properties of these graphs that emerge. We conclude that the KNCplot represents a useful and practical tool for macroscopic analysis of large bipartite graphs.
A Comparison of Extended Fingerprint Hashing and Locality Sensitive Hashing for Binary Audio Fingerprints
, 2011
"... Hash tables have been proposed for the indexing of highdimensional binary vectors, specifically for the identification of media by fingerprints. In this paper we develop a new model to predict the performance of a hashbased method (Fingerprint Hashing) under varying levels of noise. We show that by ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Hash tables have been proposed for the indexing of highdimensional binary vectors, specifically for the identification of media by fingerprints. In this paper we develop a new model to predict the performance of a hashbased method (Fingerprint Hashing) under varying levels of noise. We show that by the adjustment of two parameters, robustness to a higher level of noise is achieved. We extend Fingerprint Hashingtoamultitablerangesearch(ExtendedFingerprint Hashing) and show this approach also increases robustness to noise. We then show the relationship between Extended Fingerprint Hashing and Locality Sensitive Hashing and investigate design choices for dealing with higher noise levels. If index size must be held constant, the Extended Fingerprint Hash is a superior method. We also show that to achieve similar performance at a given level of noise a Locality Sensitive Hash requires nearly a sixfold increase in index size which is likely to be impractical for many applications.
A Heterogeneous High Dimensional Approximate Nearest Neighbor Algorithm
"... We consider the problem of finding high dimensional approximate nearest neighbors. Suppose there are d independent rare features, each having its own independent statistics. A point x will have xi = 0 denote the absence of feature i, and xi = 1 its existence. Let pi,jk be the probability that xi = j ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
We consider the problem of finding high dimensional approximate nearest neighbors. Suppose there are d independent rare features, each having its own independent statistics. A point x will have xi = 0 denote the absence of feature i, and xi = 1 its existence. Let pi,jk be the probability that xi = j and yi = k for “near ” points x, y versus the random (pi,00 + pi,01)(pi,10 + pi,11) for random pairs. Sparsity means that usually xi = 0: p01 + p10 + p11 = o(1). Distance between points is a variant of the Hamming distance. Dimensional reduction converts the sparse heterogeneous problem into a lower dimensional full homogeneous problem. However we will see that the converted problem can be much harder to solve than the original problem. Instead we suggest a direct approach, which works in the dense case too. It consists of T tries. In try t we rearrange the coordinates in increasing order of λt,i which satisfy pi,00 (1−rt,i)(pi,00+pi,01) λt,i +rt,i pi,11 (1−rt,i)(pi,10+pi,11) λt,i +rt,i = 1 where 0 < rt,i < 1 are uniform random numbers, and the p ′ s are the coordinates ’ statistical parameters. The points are lexicographically ordered, and each is compared to its neighbors in that order. We analyze generalizations of this algorithm, show that it is optimal in some class of algorithms, and estimate the necessary number of tries to success. It is governed by an information like function, which we call small leaves bucketing forest information. Any doubts whether it is “information ” are dispelled by another paper, where bucketing information is defined. I.
Optimal hash functions for approximate closest pairs on the ncube
"... One way to find closest pairs in large datasets is to use hash functions [6], [12]. In recent years localitysensitive hash functions for various metrics have been given: projecting an ncube onto k bits is simple hash function that performs well. In this paper we investigate alternatives to project ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
One way to find closest pairs in large datasets is to use hash functions [6], [12]. In recent years localitysensitive hash functions for various metrics have been given: projecting an ncube onto k bits is simple hash function that performs well. In this paper we investigate alternatives to projection. For various parameters hash functions given by complete decoding algorithms for codes work better, and asymptotically random codes perform better than projection.
Fast Approximate Matching in Restriction Site Mapping
, 1995
"... This paper presents a trie data structure for fast approximate matching of DNA fragments in a large scale restriction mapping project. It analyzes several parameters that affect the performance of the data structure and briefly explores strategies on how it might be used and how to proceed in the ..."
Abstract
 Add to MetaCart
This paper presents a trie data structure for fast approximate matching of DNA fragments in a large scale restriction mapping project. It analyzes several parameters that affect the performance of the data structure and briefly explores strategies on how it might be used and how to proceed in the initial stages of potential algorithms for mapping. The paper then reports on experimental results and concludes with future directions. 1 Introduction 1.1 Restriction Mapping A DNA molecule is composed of two intertwined strands consisting of thousands to millions of the bases A, G, C, and T. A physical map of a DNA molecule consists of an ordering of markers or short sequences of base pairs along the molecule. Physical maps help reduce the cost of locating interesting regions (e. g. gene sites) in the genome of an organism by many orders of magnitude. In physical mapping projects, fragments called clones are sampled from one or more DNA molecules and collected in a clone library. The...