| E. Ch avez, G. Navarro, R. Baeza-Yates, and J.L. Marroquin, "Searching in metric spaces," ACM Computing Surveys, vol. 33, no. 3, pp. 273--321, Sept. 2001. |
.... to metric spaces as well: the typical feature in high dimensional spaces with L p distances is that the probability distribution of distances among elements has a very concentrated histogram (with larger mean as the dimension grows) making the work of any similarity search algorithm more dicult [2, 3]. In the extreme case we have a space where d(x; x) 0 and 8y 6= x; d(x; y) 1, where it is impossible to avoid a single distance evaluation at search time. We say that a general metric space is high dimensional when its histogram of distances is concentrated. There are a number of methods to ....
....The original sa tree adapts itself to the dimension, but not optimally. 2 The Spatial Approximation Tree We describe brie y in this section the static sa tree data structure. For lack of space we do not cover alternative structures for metric space searching; the reader is referred to [3] for details. The sa tree needs O(n) space, O(n log n= log log n) construction time, and sublinear search time: O(n 1 (1= log log n) in high dimensions and O(n ) 0 1) in low dimensions. It is experimentally shown to o er better space time tradeo s than other data structures ....
E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroqun. Searching in metric spaces. ACM Computing Surveys, 33(3):273-321, September 2001.
....or hardness for searching a D dimensional space: higher dimensional spaces have a probability distribution of distances among elements whose histogram is more concentrated and with larger mean. This makes the work of any similarity search algorithm more di cult (this is discussed for example in [33,6,9,14]) In the extreme case we have a space where d(x; x) 0 and 8y 6= x; d(x; y) 1, where the query has to be exhaustively compared against every element in the set. We will extend this idea by saying that a general metric space is harder than other when its histogram of distances is more ....
....construction is discussed in Section 7. We draw our conclusions in Section 8. A partial and less mature earlier version of this work appeared in [22] 2 Previous Work Algorithms to search in general metric spaces can be divided in two large areas: pivot based and clustering algorithms. See [14] for a more complete review. 2r d(p,q) 2r d(p,q) d(p,x) d(p,x) Fig. 1 A atter (left) versus a more concentrated (right) histogram. The latter implies harder to search metric spaces because the triangle inequality permits discarding less elements (the non grayed area) Pivot based ....
E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroqu n. Searching in metric spaces. ACM Computing Surveys, 2001. To appear.
....that the search complexity is exponential with the dimension. It is interesting to point out that the concept of dimensionality can be translated into metric spaces as well. The histogram of distances of a high dimensional vector space has a large mean and normally a small relative variance. In [7] this is used to de ne the intrinsic dimension of a general metric space as = 2 , where and are the mean and variance of the histogram of distances. Under this de nition, a database of random k dimensional vectors with uniformly distributed coordinates has intrinsic dimension = ....
....the mean and variance of the histogram of distances. Under this de nition, a database of random k dimensional vectors with uniformly distributed coordinates has intrinsic dimension = k) Hence, the de nition extends naturally that of vector spaces. Analytical lower bounds and experiments in [7] show that all the search algorithms degrade as increases. The problem has received the name of curse of dimensionality. In terms of the histogram, we see two reasons for it. First, if increases because is reduced, then most distances tend to give the same values and hence yield less ....
[Article contains additional citation context not shown here]
E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroqun. Searching in metric spaces. ACM Computing Surveys, 2001. To appear.
....components of the support. In particular we can restrict ourselves to vector spaces, or to metric spaces that can be embedded into a low dimensional vector space. It is not necessary to explicitly nd the mapping (that can be a tough goal) instead we can postulate an prove a null hypothesis. In [CNBYM01] they prove analytical and experimental bounds for k, depending on both the size of the sample and the dimension of the embedded vector space. The procedure is sound as long as the distribution is regular enough. The bounds proved are not tight, but can be improved using a Montecarlo procedure. ....
....For massive data sets, main memory indexes are of no use. Instead a secondary memory index has to be used. Up to the best of our knowledge the searching cost of secondary memory indexes for metric spaces is not well understood, but we can extrapolate the searching cost of main memory indexes. In [CNBYM01, CN01a] they show that the searching cost is sublinear, in the form O(n ) with 0 1 a constant depending on the intrinsic dimension of the data. The searching cost is measured in terms of the number of distance computations. If the dimension is low (less than 10) and the searching ....
E. Chavez, G. Navarro, R. Baeza-Yates, and J.L. Marroqun. Searching in metric spaces. ACM Computing Surveys, 2001. To appear.
....out that the concept of dimensionality can be translated into metric spaces as well. The typical feature of high dimensional spaces in vector spaces is that the probability distribution of distances among elements has a very concentrated histogram (with a larger mean as the dimension grows) In [5, 7] this is used as a definition of intrinsic dimensionality for general metric spaces, which we adopt in this paper: Definition 1. The intrinsic dimension of a database in a metric space is ae = 2 2oe 2 , where and oe 2 are the mean and variance of its histogram of distances. Under this ....
....distances. Under this definition, a database formed by random k dimensional vectors where the coordinates are independent and identically distributed has intrinsic dimension Theta(k) 14] Hence, the definition extends naturally that of vector spaces. Analytical lower bounds and experiments in [5, 7] show that all the algorithms degrade systematically as the intrinsic dimension ae of the space increases. The problem is so hard that it has received the name of curse of dimensionality , and it is due to two possible reasons. On one hand, if ae increases because the variance is reduced, then we ....
[Article contains additional citation context not shown here]
E. Ch'avez, G. Navarro, R. Baeza-Yates, and J. Marroqu'in. Searching in metric spaces. ACM Computing Surveys, 2001. To appear. ftp://ftp.dcc.uchile.cl/- pub/users/gnavarro/survmetric.ps.gz.
....k)j = k and 8u 2 nn d (q; k) v 2 U nn d (q; k) d(q; u) d(q; v) In this work we concentrate on range queries for simplicity. Many of the results, however, can be extended to nearest neighbor searching as well, since the corresponding algorithms are normally built over those for range queries [9]. In most applications the distance d( is very expensive to compute, and therefore the complexity of a search algorithm is measured in terms of number of evaluations of d( It is clear that either type of query can be answered by examining the entire dictionary U. An indexing algorithm is an ....
....in which the points can be embedded while keeping the distances among them) This is in general the case of real applications, where the data is clustered. 3 Proximity Search Algorithms Di erent data structures have been proposed to lter out elements based on the triangular inequality (see [9] for a complete survey) We divide the exposition according to the two main techniques used. Pivot based algorithms are built on a single general idea: select some elements from U (called pivots) and identify all the other elements with their distances to (some of) the pivots. The methods di er ....
E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroqun. Searching in metric spaces. Technical Report TR/DCC-99-3, Dept. of Computer Science, Univ. of Chile, 1999. To appear in ACM Computing Surveys. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survmetric.ps.gz.
....be expensive to compute, the goal is to structure the database so that we perform few distance evaluations. All the existing techniques work by discarding elements using the triangular inequality. We concentrate in range queries in this paper, as the others can be systematically built over these [10]. The set of points of Xthat are at distance at most r to q is called the query ball , so (q; r) is the intersection of the query ball and U. A particular case of this problem arises when the space is R k . There are effective methods for this case, such as kd trees [3] or R trees [13] ....
....and 8y 6= x; d(x; y) 1, where it is impossible to avoid a single distance evaluation at search time. We say that a general metric space is high dimensional when its histogram of distances is concentrated. We use in this paper a quantitative definition of the intrinsic dimensionality proposed in [10]: Definition: The intrinsic dimensionality of a metric space is defined as ae = 2 2oe 2 , where and oe 2 are the mean and variance of its histogram of distances. Under this definition, a random vector space with k coordinates has intrinsic dimension Theta(k) so the definition extends ....
[Article contains additional citation context not shown here]
E. Ch'avez, G. Navarro, R. Baeza-Yates, and J. Marroqu 'in. Searching in metric spaces. Technical Report TR/DCC-99-3, Dept. of Computer Science, Univ. of Chile, 1999. ftp://ftp.dcc.uchile.cl/pub/users/- gnavarro/survmetric.ps.gz.
....to be a prefix of xy, a suffix of yx and a substring of yxz. 2 Metric Spaces We describe in this section the concepts related to searching in metric spaces. We have concentrated only in the part which is relevant for this paper. There exist recent surveys if more complete information is desired [11]. A metric space is, informally, a set of black box objects and a distance function defined among them, which satisfies the triangular inequality. The problem of proximity searching in metric spaces consists of indexing the set such that later, given a query, all the elements of the set which are ....
E. Ch'avez, G. Navarro, R. Baeza-Yates, and J. Marroqu'in. Searching in metric spaces. ACM Computing Surveys, 2001. To appear. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survmetric.ps.gz.
....spaces. We introduce a novel technique for speeding up the process. The idea is to index the set in order to satisfy range queries, i.e. given an element of the set q and a tolerance radius r, nd all the other elements within distance r to q. There are lots of indexing algorithms with this aim [CNBYM01] and any such index can be used for our purposes. Let us assume that the index answers a range query that retrieves the closest n elements performing n distance computations (0 1) We show analytically that our algorithm needs time O(k 1 n 2 1 ) on average. The value ....
....quickly as the dimension of the space grows, a fact known as the curse of dimensionality. In coordinate spaces with points chosen randomly, the dimension is simply D. In metric spaces, where there are no coordinates, the dimension can be de ned using properties of the histogram of distances [CN00, CNBYM01] which extend naturally that of spaces with coordinates. 2 A Generic Solution to the Aknn Problem We present now our approach in detail. To nd all the k nearest neighbors of a set of points we solve n range queries with a radius that ensures to nd at least k nearest neighbors for each point. ....
[Article contains additional citation context not shown here]
E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroqun. Searching in metric spaces. ACM Computing Surveys, 2001. To appear.
....complexity (number of pivots) It is clear that there is an optimum k where the sum of internal plus external complexity is minimized. Figure 5 shows an experiment with random uniformly distributed vectors in ( 0; 1] 8 ; L 2 ) where we have used different number of pivots and the optimum is reached for k close to 110. 0 100 200 300 400 500 600 700 800 900 1000 0 50 100 150 200 250 300 [pivots] 100,000 elements. Radius captures 0.01 of N 8 [ext] 8 [in ext] 8 [in] 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 0 50 100 150 200 250 300 [pivots] 100,000 elements. ....
....are discussed in (Uhlmann, 1991b) and are applicable here as well. 4.2. An Example Consider the FHFQT of Figure 6. Each branch from the root represents a distance to pivot p 1 . Branches from the second level nodes refer to the distances to p 2 , and so on. Given a query (q; r) d , the search paper.tex; 17 11 2000; 19:05; p.13 14 Ch avez, Marroqu in, Navarro algorithm enters, at level i in the tree, only those branches within the interesting interval d(q; p i ) Sigma r. Consider r = 2 and fd(q; p i )g = f3; 4; 5; 4g: Branches labeled [1; 2; 3; 4] in the first level will be examined and, recursively, all ....
[Article contains additional citation context not shown here]
Ch'avez, E., G. Navarro, R. Baeza-Yates, and J. Marroqu'in: 1999c, `Searching in Metric Spaces'. Technical Report TR/DCC-99-3, Dept. of Computer Science, Univ. of Chile. Submitted. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/- survmetric.ps.gz.
.... to metric spaces as well: the typical feature in high dimensional spaces with L p distances is that the probability distribution of distances among elements has a very concentrated histogram (with larger mean as the dimension grows) making the work of any similarity search algorithm more dicult [5, 10]. In the extreme case we have a space where d(x; x) 0 and 8y 6= x; d(x; y) 1, where it is impossible to avoid a single distance evaluation at search time. We say that a general metric space is high dimensional when its histogram of distances is concentrated. There are a number of methods to ....
....while keeping its good search eciency. As a related byproduct of this study, we give new algorithmic insights in the behavior of this data structure. 2. Previous Work Algorithms to search in general metric spaces can be divided in two large areas: pivot based and clustering algorithms. See [10] for a more complete review. Pivot based algorithms. The idea is to use a set of k distinguished elements ( pivots ) p 1 : p k 2 S and storing, for each database element x, its distance to the k pivots (d(x; p 1 ) d(x; p k ) Given the query q, its distance to the k pivots is computed (d(q; ....
E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroqu n. Searching in metric spaces. ACM Computing Surveys, 2001. To appear.
....k)j = k and 8u 2 nnd (q; k) v 2 U nnd (q; k) d(q; u) d(q; v) In this work we concentrate on range queries for simplicity. Many of the results, however, can be extended to nearest neighbor searching as well, since the corresponding algorithms are normally built over those for range queries [13]. Since the operation of leading complexity is computing distances, we will measure the complexity of the searching algorithms in terms of the number of distances computed. 3 Proximity Search Algorithms Di erent data structures have been proposed to lter out elements based on the triangular ....
....operation of leading complexity is computing distances, we will measure the complexity of the searching algorithms in terms of the number of distances computed. 3 Proximity Search Algorithms Di erent data structures have been proposed to lter out elements based on the triangular inequality (see [13] for a complete survey) We divide the exposition according to the two main techniques used. Pivot based algorithms are built on a single general idea: select some elements from U (called pivots) and identify all the other elements with their distances to (some of) the pivots. The methods di er ....
Chavez, E., Navarro, G., Baeza-Yates, R., Marroqun, J.: Searching in metric spaces. Technical Report TR/DCC-99-3, Dept. of Computer Science, Univ. of Chile (1999) To appear in ACM Computing Surveys.
No context found.
E. Ch avez, G. Navarro, R. Baeza-Yates, and J.L. Marroquin, "Searching in metric spaces," ACM Computing Surveys, vol. 33, no. 3, pp. 273--321, Sept. 2001.
No context found.
E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroqu in. Searching in metric spaces. Technical Report TR/DCC-99-3, Dept. of Computer Science, Univ. of Chile, 1999.
No context found.
Edgar Chavez, Gonzalo Navarro, Ricardo Baeza-Yates, and Jose L. Marroqun. 2001. Searching in metric spaces. ACM Computing Surveys, 33(3):273--321, September.
No context found.
Edgar Chavez, Gonzalo Navarro, Ricardo A. Baeza-Yates, and Jose L. Marroquin. Searching in metric spaces. ACM Computing Surveys, 33(3):273--321, 2001.
No context found.
E. Ch avez, G. Navarro, R. Baeza-Yates, and J. Marroqu in, Searching in metric spaces, ACM Computing Surveys, 33 (2001), pp. 273-322.
No context found.
E. Ch avez, G. Navarro, R. Baeza-Yates, and J.L. Marroqu in. Searching in metric spaces. ACM Computing Surveys, 33(3):273--321, 2001.
No context found.
Edgar Chavez, Gonzalo Navarro, Ricardo Baeza-Yates, and Jose Luis Marroquin. Searching in metric spaces. ACM Computing Surveys, 33(3):273--321, September 2001.
No context found.
E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroqu in. Searching in metric spaces. Technical Report TR/DCC-99-3, Dept. of Computer Science, Univ. of Chile, 1999.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC