Results 21 - 30
of
196
Probabilistic Proximity Searching Algorithms Based on Compact Partitions
- Journal of Discrete Algorithms
, 2002
"... The main bottleneck of the research in metric space searching is the so-called curse of dimensionality, which makes the task of searching some metric spaces intrinsically dicult, whatever algorithm is used. A recent trend to break this bottleneck resorts to probabilistic algorithms, where it has ..."
Abstract
-
Cited by 16 (7 self)
- Add to MetaCart
The main bottleneck of the research in metric space searching is the so-called curse of dimensionality, which makes the task of searching some metric spaces intrinsically dicult, whatever algorithm is used. A recent trend to break this bottleneck resorts to probabilistic algorithms, where it has been shown that one can nd 99% of the relevant objects at a fraction of the cost of the exact algorithm. These algorithms are welcome in most applications because resorting to metric space searching already involves a fuzziness in the retrieval requirements.
Efficient Similarity Search for Hierarchical Data in Large Databases
- In Proc. of EDBT
, 2004
"... Structured and semi-structured object representations are getting more and more important for modern database applications. Examples for such data are hierarchical structures including chemical compounds, XML data or image data. ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
Structured and semi-structured object representations are getting more and more important for modern database applications. Examples for such data are hierarchical structures including chemical compounds, XML data or image data.
Overcoming the ℓ1 non-embeddability barrier: Algorithms for product metrics
, 2008
"... A common approach for solving computational problems over a difficult metric space is to embed the “hard ” metric into L1, which admits efficient algorithms and is thus considered an “easy ” metric. This approach has proved successful or partially successful for important spaces such as the edit dis ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
A common approach for solving computational problems over a difficult metric space is to embed the “hard ” metric into L1, which admits efficient algorithms and is thus considered an “easy ” metric. This approach has proved successful or partially successful for important spaces such as the edit distance, but it also has inherent limitations: it is provably impossible to go below certain approximation for some metrics. We propose a new approach, of embedding the difficult space into richer host spaces, namely iterated products of standard spaces like ℓ1 and ℓ∞. We show that this class is rich since it contains useful metric spaces with only a constant distortion, and, at the same time, it is tractable and admits efficient algorithms. Using this approach, we obtain for example the first nearest neighbor data structure with O(log log d) approximation for edit distance in nonrepetitive strings (the Ulam metric). This approximation is exponentially better than the lower bound for embedding into L1. Furthermore, we give constant factor approximation for two other computational problems. Along the way, we answer positively a question posed in [Ajtai, Jayram, Kumar, and Sivakumar, STOC 2002]. One of our algorithms has already found applications for smoothed edit distance over 0-1 strings [Andoni and Krauthgamer, ICALP 2008]. 1
Improving Hybrid MDS with Pivot-Based Searching
, 2003
"... An algorithm is presented for the visualisation of multidimensional abstract data, building on a hybrid model introduced at InfoVis 2002. The most computationally complex stage of the original model involved performing a nearestneighbour search for every data item. The complexity of this phase has b ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
An algorithm is presented for the visualisation of multidimensional abstract data, building on a hybrid model introduced at InfoVis 2002. The most computationally complex stage of the original model involved performing a nearestneighbour search for every data item. The complexity of this phase has been reduced by treating all high-dimensional relationships as a set of discretised distances to a constant number of randomly selected pivot items. In improving this computational bottleneck, the complexity is reduced from O(N # N) to O(N ). As well as documenting this improvement, the paper describes evaluation with a data set of 108000 14-dimensional items; a considerable increase on the size of data previously tested. Results illustrate that the reduction in complexity is reflected in significantly improved run times and that no negative impact is made upon the quality of layout produced.
Effective Proximity Retrieval by Ordering Permutations
, 2007
"... We introduce a new probabilistic proximity search algorithm for range and K-nearest neighbor (K-NN) searching in both coordinate and metric spaces. Although there exist solutions for these problems, they boil down to a linear scan when the space is intrinsically high-dimensional, as is the case in m ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
We introduce a new probabilistic proximity search algorithm for range and K-nearest neighbor (K-NN) searching in both coordinate and metric spaces. Although there exist solutions for these problems, they boil down to a linear scan when the space is intrinsically high-dimensional, as is the case in many pattern recognition tasks. This, for example, renders the K-NN approach to classification rather slow in large databases. Our novel idea is to predict closeness between elements according to how they order their distances towards a distinguished set of anchor objects. Each element in the space sorts the anchor objects from closest to farthest to it, and the similarity between orders turns out to be an excellent predictor of the closeness between the corresponding elements. We present extensive experiments comparing our method against state-of-the-art exact and approximate techniques, both in synthetic and real, metric and non-metric databases, measuring both CPU time and distance computations. The experiments demonstrate that our technique almost always improves upon the performance of alternative techniques, in some cases by a wide margin.
A content-addressable network for similarity search in metric spaces
- In DBISP2P ’05: Proceedings of the the 2nd International Workshop on Databases, Information Systems and Peer-to-Peer Computing
, 2005
"... Abstract. In this paper we present a scalable and distributed access structure for similarity search in metric spaces. The approach is based on the Content– addressable Network (CAN) paradigm, which provides a Distributed Hash Table (DHT) abstraction over a Cartesian space. We have extended the CAN ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Abstract. In this paper we present a scalable and distributed access structure for similarity search in metric spaces. The approach is based on the Content– addressable Network (CAN) paradigm, which provides a Distributed Hash Table (DHT) abstraction over a Cartesian space. We have extended the CAN structure to support storage and retrieval of more generic metric space objects. We use pivots for projecting objects of the metric space in an N-dimensional vector space, and exploit the CAN organization for distributing the objects among computer nodes of the structure. We obtain a Peer–to–Peer network, called the MCAN, which is able to search metric space objects by means of the similarity range queries. Experiments conducted on our prototype system confirm full scalability of the approach. 1
t-Spanners as a Data Structure for Metric Space Searching
- In Proc. 9th International Symposium on String Processing and Information Retrieval
, 2002
"... A \emph{t-spanner}, a subgraph that approximates graph distances within a precision factor $t$, is a well known concept in graph theory. In this paper we use it in a novel way, namely as a data structure for searching metric spaces. The key idea is to consider the $t$-spanner as an approximation of ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
A \emph{t-spanner}, a subgraph that approximates graph distances within a precision factor $t$, is a well known concept in graph theory. In this paper we use it in a novel way, namely as a data structure for searching metric spaces. The key idea is to consider the $t$-spanner as an approximation of the complete graph of distances among the objects, and use it as a compact device to simulate the large matrix of distances required by successful search algorithms like AESA [Vidal 1986]. The $t$-spanner provides a time-space tradeoff where full AESA is just one extreme. We show that the resulting algorithm is competitive against current approaches, e.g., 1.5 times the time cost of AESA using only 3.21\% of its space requirement, in a metric space of strings; and 1.09 times the time cost of AESA using only 3.83 \% of its space requirement, in a metric space of documents. We also show that $t$-spanners provide better space-time tradeoffs than classical alternatives such as pivot-based indexes. Furthermore, we show that the concept of $t$-spanners has potential for large improvements.
Probabilistic Proximity Search: Fighting the Curse of Dimensionality in Metric Spaces
- Information Processing Letters
"... Proximity searches become very difficult on "high dimensional" metric spaces, that is, those whose histogram of distances has a large mean and/or a small variance. This so-called "curse of dimensionality", well known in vector spaces, is also observed in metric spaces. The search complexity grows sh ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
Proximity searches become very difficult on "high dimensional" metric spaces, that is, those whose histogram of distances has a large mean and/or a small variance. This so-called "curse of dimensionality", well known in vector spaces, is also observed in metric spaces. The search complexity grows sharply with the dimension and with the search radius. We present a general probabilistic framework applicable to any search algorithm and whose net effect is to reduce the search radius. The higher the dimension, the more effective the technique. We illustrate empirically its practical performance on a particular class of algorithms, where large improvements in the search time are obtained at the cost of a very small error probability.
Disorder inequality: A combinatorial approach to nearest neighbor search
- In WSDM’08
"... We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use the triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. In this paper we introduce a special property of the similarity function on a set S that leads to efficient combinatorial algorithms for S. The disorder constant D(S) of a set S is defined to ensure the following inequality: if x is the a’th most similar object to z and y is the b’th most similar object to z, then x is among the D(S) · (a + b) most similar objects to y. Assuming that disorder is small we present the first two known combinatorial algorithms for nearest neighbors whose query time has logarithmic dependence on the size of S. The first one, called Ranwalk, is a randomized zero-error algorithm that always returns the exact nearest neighbor. It uses space quadratic in the input size in preprocessing, but is very efficient in query processing. The second algorithm, called Arwalk, uses near-linear space. It uses random choices in preprocessing, but the query processing is essentially deterministic. For an arbitrary query q, there is only a small probability that the chosen data structure does not support q. Finally, we show that for the Reuters corpus average disorder is indeed quite small and that Ranwalk efficiently computes the nearest neighbor in most cases.
Real-Time Speaker Identification and Verification
- ACCEPTED FOR PUBLICATION IN IEEE TRANS. SPEECH & AUDIO PROCESSING
"... In speaker identification, most of the computation originates from the distance or likelihood computations between the feature vectors of the unknown speaker and the models in the database. The identification time depends on the number of feature vectors, their dimensionality, the complexity of the ..."
Abstract
-
Cited by 10 (7 self)
- Add to MetaCart
In speaker identification, most of the computation originates from the distance or likelihood computations between the feature vectors of the unknown speaker and the models in the database. The identification time depends on the number of feature vectors, their dimensionality, the complexity of the speaker models and the number of speakers. In this paper, we concentrate on optimizing vector quantization (VQ) based speaker identification. We reduce the number of test vectors by pre-quantizing the test sequence prior to matching, and the number of speakers by pruning out unlikely speakers during the identification process. The best variants are then generalized to Gaussian mixture model (GMM) based modeling. We apply the algorithms also to efficient cohort set search for score normalization in speaker verification. We obtain a speed-up factor of 16:1 in the case of VQ-based modeling with minor degradation in the identification accuracy, and 34:1 in the case of GMM-based modeling. An equal error rate of 7 % can be reached in 0.84 seconds on average when the length of test utterance is 30.4 seconds.

