Results 1  10
of
30
Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions
, 2008
"... In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The ..."
Abstract

Cited by 457 (7 self)
 Add to MetaCart
In this article, we give an overview of efficient algorithms for the approximate and exact nearest neighbor problem. The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. The problem is of significant interest in a wide variety of areas.
Nearest Neighbors In HighDimensional Spaces
, 2004
"... In this chapter we consider the following problem: given a set P of points in a highdimensional space, construct a data structure which given any query point q nds the point in P closest to q. This problem, called nearest neighbor search is of significant importance to several areas of computer sci ..."
Abstract

Cited by 93 (2 self)
 Add to MetaCart
In this chapter we consider the following problem: given a set P of points in a highdimensional space, construct a data structure which given any query point q nds the point in P closest to q. This problem, called nearest neighbor search is of significant importance to several areas of computer science, including pattern recognition, searching in multimedial data, vector compression [GG91], computational statistics [DW82], and data mining. Many of these applications involve data sets which are very large (e.g., a database containing Web documents could contain over one billion documents). Moreover, the dimensionality of the points is usually large as well (e.g., in the order of a few hundred). Therefore, it is crucial to design algorithms which scale well with the database size as well as with the dimension. The nearestneighbor problem is an example of a large class of proximity problems, which, roughly speaking, are problems whose definitions involve the notion of...
Detecting NearDuplicates for Web Crawling
 WWW 2007
, 2007
"... Nearduplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a nea ..."
Abstract

Cited by 92 (0 self)
 Add to MetaCart
Nearduplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a nearduplicate of a previously crawled web page or not. In the course of developing a nearduplicate detection system for a multibillion page repository, we make two research contributions. First, we demonstrate that Charikar’s fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing fbit fingerprints that differ from a given fingerprint in at most k bitpositions, for small k. Our technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.
Dictionary matching and indexing with errors and don’t cares,”
 in Proceedings of the thirtysixth annual ACM symposium on Theory of computing. ACM,
, 2004
"... ..."
A Practical qGram Index for Text Retrieval Allowing Errors
 CLEI Electronic Journal
, 1998
"... We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substri ..."
Abstract

Cited by 38 (10 self)
 Add to MetaCart
(Show Context)
We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a fixed length q is stored in the index, together with pointers to all the text positions where it appears. The search pattern is partitioned into pieces which are searched in the index, and all their occurrences in the text are verified for a complete match. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This is especially well suited for natural language texts, and allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experi...
NONEXPANSIVE HASHING
 COMBINATORICA 18 (1) (1998) 121–132
, 1998
"... In a nonexpansive hashing scheme, similar inputs are stored in memory locations which are close. We develop a nonexpansive hashing scheme wherein any set of size O(R 1−ε) from a large universe may be stored in a memory of size R (any ε>0, and R>R 0(ɛ)), and where retrieval takes O(1) operati ..."
Abstract

Cited by 32 (0 self)
 Add to MetaCart
(Show Context)
In a nonexpansive hashing scheme, similar inputs are stored in memory locations which are close. We develop a nonexpansive hashing scheme wherein any set of size O(R 1−ε) from a large universe may be stored in a memory of size R (any ε>0, and R>R 0(ɛ)), and where retrieval takes O(1) operations. We explain how to use nonexpansive hashing schemes for efficient storage and retrieval of noisy data. A dynamic version of this hashing scheme is presented as well.
Indexing and Dictionary Matching with One Error (Extended Abstract)
, 1999
"... The indexing problem is the one where a text is preprocessed and subsequent queries of the form: "Find all occurrences of pattern P in the text" are answered in time proportional to the length of the query and the number of occurrences. In the dictionary matching problem a set of patterns ..."
Abstract

Cited by 30 (5 self)
 Add to MetaCart
The indexing problem is the one where a text is preprocessed and subsequent queries of the form: "Find all occurrences of pattern P in the text" are answered in time proportional to the length of the query and the number of occurrences. In the dictionary matching problem a set of patterns is preprocessed and subsequent queries of the form: "Find all occurrences of dictionary patterns in text T" are answered in time proportional to the length of the text and the number of occurrences. There exist efficient worstcase solutions for the indexing problem and the dictionary matching problem, but none that find approximate occurrences of the patterns, i.e. where the pattern is within a bound edit (or hamming...
Multiple Approximate String Matching
 In Proc. of WADS'97, LNCS 1272
, 1997
"... We present two new algorithms for online multiple approximate string matching. These are extensions of previous algorithms that search for a single pattern. The singlepattern version of the first one is based on the simulation with bits of a nondeterministic finite automaton built from the patter ..."
Abstract

Cited by 25 (13 self)
 Add to MetaCart
(Show Context)
We present two new algorithms for online multiple approximate string matching. These are extensions of previous algorithms that search for a single pattern. The singlepattern version of the first one is based on the simulation with bits of a nondeterministic finite automaton built from the pattern and using the text as input. To search for multiple patterns, we superimpose their automata, using the result as a filter. The second algorithm partitions the pattern in subpatterns that are searched with no errors, with a fast exact multipattern search algorithm. To handle multiple patterns, we search the subpatterns of all of them together. The average running time achieved is in both cases O(n) for moderate error level, pattern length and number of patterns. They adapt (with higher costs) to the other cases. However, the algorithms differ in speed and thresholds of usefulness. We analyze theoretically when each algorithm should be used, and show experimentally that they are faster ...
Approximate Text Searching
, 1998
"... This thesis focuses on the problem of text retrieval allowing errors, also called "approximate" string matching. The problem is to nd a pattern in a text, where the pattern and the text may have "errors". This problem has received a lot of attention in recent years because of its ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
(Show Context)
This thesis focuses on the problem of text retrieval allowing errors, also called "approximate" string matching. The problem is to nd a pattern in a text, where the pattern and the text may have "errors". This problem has received a lot of attention in recent years because of its applications in many areas, such as information retrieval, computational biology and signal processing, to name a few. The aim of this work is the development and analysis of novel algorithms to deal with the problem under various conditions, as well as a better understanding of the problem itself and its statistical behavior. Although our results are valid in many dierent areas, we focus our attention on typical text searching for information retrieval applications. This makes some ranges of values for the parameters of the problem more interesting than others. We have divided this presentation in two parts. The rst one deals with online approximate string matching, i.e. when there is no time or space to preprocess the text. These algorithms are the core of oline algorithms as well. Online searching is the area of the problem where better algorithms existed. We have obtained new bounds for the probability of an approximate match of a pattern in