25 citations found. Retrieving documents...
Araujo, M. D., Navarro, G., and Ziviani, N. (1997). Large text searching allowing errors. In Baeza-Yates, R., editor, Proceedings of the 4th South American Workshop on String Processing, pages 2-20, Valparaiso, Chile. Carleton University Press.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Efficient Distributed Algorithms to Build Inverted Files - Es   (Correct)

.... doubles in size each time we move up in the computation tree of Figure 3) At the end, the vocabulary is broadcast to all other processors (by processor p 0 ) The size v in English words of the vocabulary (for a text of size c) can be computed as v = Kc fi where 0 fi 1 and K is a constant [6, 2]. Thus, the time t 2 spent at phase 2 can be approximated by t 2 = 0 (log 2 p) Gamma1 X i=0 2 i v 1 A w s (t n t cs ) K(pc) fi (w s 4) t n = p Gamma 1) v w s (t n t cs ) v(p) fi (w s 4) t n where w s is the average size in bytes of English words and can be taken ....

....) K(pc) fi (w s 4) t n = p Gamma 1) v w s (t n t cs ) v(p) fi (w s 4) t n where w s is the average size in bytes of English words and can be taken roughly as 5. Further, to illustrate, the values of K and fi for the disk 1 of the TREC collection are roughly K = 4:8 and fi = 0:56 [2]. The second factor in the expression for t 2 accounts for the broadcast of the global vocabulary to all the other processors. The collection size considered in this case is that of the global collection and is given by p c. Besides the terms of the vocabulary, an integer of size 4 is also ....

M.D. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Ricardo Baeza-Yates, editor, IV South American Workshop on String Processing - WSP97 - International Informatics Series, volume 8, pages 2--20, Valpara'iso, Chile, November 1997. Carleton University Press.


Searching the Web: Challenges and Partial Solutions - Baeza-Yates   (Correct)

....states that the vocabulary of a text of n words is of size V = Kn fi = O(n fi ) where K and fi depend Text size F Figure 1. Document size distribution. on the particular text (see Figure 2) K is normally between 10 and 100, and fi is between 0 and 1 (not included) Some recent experiments [8, 12] show that the most common values for fi are between 0.4 and 0.6. Hence, the vocabulary of a text grows sublinearly with the text size, in a proportion close to its square root. V Text size Figure 2. Size of the vocabulary A first inaccuracy appears immediately. Supposedly, the set of ....

....so that the sum of all frequencies is n (see Figure 3) The value of depends on the text. In the most simple formulation, 1, and therefore H V ( O(log n) However, this simplified version is very inexact, and the case 1 (more precisely, between 1.5 and 2. 0) fits better the real data [8]. This case is very different, since the distribution is much more skewed, and H V ( O(1) Words F Figure 3. Sorted frequencies of the words. The fact that the distribution of words is very skewed (that is, there are a few hundreds of words which take up 50 of the text) suggest a concept ....

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. WSP'97, pages 2-- 20. Carleton University Press, 1997.


Indexing Methods for Approximate Text Retrieval (Extended.. - Baeza-Yates, al.   (Correct)

.... Figure 1: Taxonomy of indexes for approximate string searching (categories where we propose new algorithms are highlighted) Current word oriented indexes solve the problem by using a classical on line algorithm on the set of words (i.e. the vocabulary) thus obtaining the set of words to retrieve [25, 7, 1]. The rest may proceed without using approximate matching. Since the vocabulary is sublinear in size with respect to the text, they can achieve good performance. These indexes are not capable of retrieving an occurrence that is not a sequence of words. However, this is in many cases exactly what ....

....index space or query time, as shown experimentally in the paper. They also show analytically that it is not possible to have an index of this type which is sublinear in size and search times simultaneously. 3. 2 Full Inversion Ara ujo, Navarro and Ziviani take the approach of full inversion [1]. For each word, the list of all its occurrences in the text are kept and the text is never accessed. The search on the vocabulary is as before (using [5] but the second phase of the search changes completely: once the matching words in the vocabulary are identified, all their lists are merged. ....

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. Technical report, Dept. of CS, Univ. Federal de Minas Gerais, Brazil, 1996.


Approximate Text Searching - Badino (1998)   (8 citations)  (Correct)

....using an isolated Sun SparcStation 4 with 128 megabytes of RAM running Solaris 2.5.1. The text used was part of a newer trec collection (trec 3) where the ziff collection has near 700 Mb. The experiments use this collection. More details and experiments are found in the original paper [ANZ97, ANZ98] where the Heaps and Zipf s Laws are experimentally validated, and other parameters such as index construction time and space are studied. In this section we only present the results which are relevant to our analysis. First, we experimentally validate the fact that the length of the shortest ....

....for each pattern, but from the intersections. For instance, for k = 2 we have a higher cost for j = 2, which comes from intersecting the first pattern with one error and the second 2 The results come from a joint work with Nivio Ziviani and Marcio Drumond Ara ujo, which is not yet published [ANZ98] The implementation of the index is part of Drumond s Master s Thesis and not of ours. Figure 8.2: Length of the shortest among j lists, for j = 2 to 5. pattern with one error (which we call the combination [1,1] The other alternatives (where one word matches exactly and the others with two ....

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. Journal version of [ANZ97], in preparation, 1998.


Approximate Text Searching - Badino (1998)   (8 citations)  (Correct)

....el texto en bloques y apuntando a los bloques en vez de a las posiciones exactas ( indices de direccionamiento a bloques ) Nosotros obtuvimos t ecnicas nuevas de indexaci on y b usqueda, y nuevos resultados anal iticos sobre varias de ellas. Este trabajo se ha publicado en [BYNST97, BYN97a, ANZ97, BYN97c, BYN98b, NBY98c, BYN98a] y hay otros enviados para su publicaci on. Nuestros principales resultados a este respecto son los siguientes. ffl Consideramos primero los indices de inversi on completa, en el Cap itulo 8. En este caso probamos que para la mayor ia de las consultas razonables ....

....which reduce space requirements by dividing the text in blocks and point to the blocks instead of the exact positions ( block addressing indices ) We have obtained new indexing and searching techniques and novel analytical results on some of them. This work has been published in [BYNST97, BYN97a, ANZ97, BYN97c, BYN98b, NBY98c, BYN98a] and there are others submitted. Our main achievements in this regard follow. ffl We consider first full inverted indices in Chapter 8. In this case, we prove that for most reasonable queries (i.e. those with reasonably high precision) the search time on those ....

[Article contains additional citation context not shown here]

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. WSP'97, pages 2--20. Carleton University Press, 1997.


An Efficient Compression Code for Text Databases - Brisaboa, Iglesias, Navarro, ..   Self-citation (Navarro)   (Correct)

....typical of natural language texts. It is well known [3] that, in natural language texts, the vocabulary distribution closely follows a generalized Zipf s law [14] that is, p i = A=i and N = 1, for suitable constants A and . In practice is between 1.4 and 1. 8 and depends on the text [1, 2], while A = i 1 1=i = makes sure that the distribution adds up 1 . Under this distribution the entropy is p i A log 2 i log 2 A ( ln ( b ln 2 On the other hand, we have D b = A s i 1 j=s i 1 j s i 1 At this ....

....lower bound Tagged Huffman upper bound Fig. 3. Analytical bounds on the average code length for byte oriented Plain Hu man, Tagged Hu man, and our new method. We assume a Zipf distribution with parameter (which is the x axis) these lines and from previous results on other large collections [1], which shows that this kind of analysis applies well to large collections only. 6 Conclusions We have presented a new compression code useful for text databases. The code inherits from previous work, where byte oriented word based Hu man codes were shown to be an excellent choice. To permit ....

M. D. Araujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In R. Baeza-Yates, editor, Proc. 4th South American Workshop on String Processing (WSP'97), pages 2-20. Carleton University Press, 1997.


A Metric Index for Approximate String Matching - Chávez, Navarro (2002)   (1 citation)  Self-citation (Navarro)   (Correct)

....n=m) Indexing text for approximate string matching has received attention only recently. Despite some progress in the last decade, the indexing schemes for this problem are still rather immature. There exist some indexing schemes specialized to word wise searching on natural language text [24, 4, 3]. These indexes perform quite well in that case but they cannot be extended to handle the general case. Extremely important applications such as DNA, proteins or oriental languages fall outside this case. The indexes that solve the general problem can be divided in three classes. A first one [19, ....

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. WSP'97, pages 2--20. Carleton University Press, 1997.


Adding Compression to Block Addressing Inverted Indexes - Navarro, de Moura.. (2000)   (9 citations)  Self-citation (Navarro Ziviani)   (Correct)

....It uses the simplest scheme, where the index points to all the positions of all the words in the text. However, its construction time and the space requirement are higher. The occurrences take nearly 60 of the text size. This can be reduced to 35 by omitting the stopwords from the vocabulary (Ara ujo et al. 1997). Stopwords are articles, prepositions, and other words that carry no meaning and therefore do not appear in or that can be removed from user queries. Stopwords represent 40 to 50 of all the text words. However, 35 of extra space can still be a high space requirement for a large text ....

....Huffman compression method 1 . An important consideration is the size of the text vocabulary. An empirical law widely accepted in IR is Heaps Law (Heaps, 1978) which states that the vocabulary of a text of n words is of size V = O(n fi ) where 0 fi 1 depends on the text. As shown in (Ara ujo et al. 1997), fi is between 0.4 and 0.6 in practice, so the vocabulary needs space proportional to the square root of the text size. Hence, for large texts the overhead of storing the vocabulary is minimal. Another useful law related to the vocabulary is the Zipf s Law (Zipf, 1949) which states that the ....

[Article contains additional citation context not shown here]

Ara'ujo, M. D., G. Navarro, and N. Ziviani: 1997, `Large text searching allowing errors'. In: R. Baeza-Yates (ed.): Proc. of the 4th South American Workshop on String Processing, Vol. 8. pp. 2--20.


Indexing Text with Approximate q-grams - Navarro, Sutinen, Tanninen, Tarhio (2000)   (8 citations)  Self-citation (Navarro)   (Correct)

....problems in this area [27, 3] Despite some progress in the last years, the indexing schemes for this problem are still rather immature. There are two types of indexing mechanisms for approximate string matching, which we call word retrieving and sequence retrieving . Word retrieving indexes [18, 5, 2] are more oriented to natural language text and information retrieval. They can retrieve every word whose edit distance to the pattern word is at most k. Hence, they are not able to recover from an error involving a separator, such as recovering the word flowers from the misspelled text flo ....

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. WSP'97, pages 2--20. Carleton University Press, 1997.


Fast and Flexible Word Searching on Compressed Text - de Moura, Navarro.. (2000)   (4 citations)  Self-citation (Navarro Ziviani)   (Correct)

....compression ratio. However, this is not the case on large texts. Heaps Law [Heaps 1978] an empirical law widely accepted in information retrieval, establishes that a natural language text of O(u) words has a vocabulary of size v = O(u fi ) for 0 fi 1. Typically, fi is between 0.4 and 0. 6 [Ara ujo et al. 1997; Moura et al. 1997] and therefore v is close to O( p u) Hence, for large texts the overhead of storing the vocabulary is minimal. On the other hand, storing the vocabulary represents an important overhead when the text is small. This is why we chose to compress the vocabulary (that is, the ....

.... = 1. In this case, H = O(log v) Although this simplified form is popular because it is simpler to handle mathematically, it does not follow well the real distribution of natural language texts. There is strong evidence that most real texts have in fact a more biased vocabulary. We performed in [Ara ujo et al. 1997] a thorough set of experiments on the trec collection, finding out that the values are roughly between 1.5 and 2.0 depending on the text, which gives experimental evidence in favor of the generalized Zipf s Law (i.e. 1) Under this assumption, 1 The reason why both Ziv Lempel compressors ....

[Article contains additional citation context not shown here]

Ara' ujo, M. D., Navarro, G., and Ziviani, N. 1997. Large text searching allowing errors. In R. Baeza-Yates Ed., Proc. of the Fourth South American Workshop on String Processing, Volume 8 (1997), pp. 2--20. Carleton University Press International Informatics Series.


Linear Time Sorting of Skewed Distributions - de Moura, Navarro   Self-citation (Navarro Ziviani)   (Correct)

....The combination of the algorithm presented in [MK95] with our new sorting algorithm results in a fast linear time method to construct wordbased Huffman codes. Experiments with natural language texts show that the value of the constant for natural language texts is between 1:5 and 2:0 [ANZ97]. Further, the least frequent word of a text (xn ) has a small number of occurrences that is close to 1 (in almost all natural language the texts there are many words with frequency 1 [BYN97] Therefore, the extra space used by the remainingsort algorithm in this application is K = O( log n) 2 ....

M. D. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In R. Baeza-Yates, editor, Proc. of the Fourth South American Workshop on String Processing, volume 8, pages 2--20. Carleton University Press International Informatics Series, 1997.


Adding Compression to Block Addressing Inverted Indices - Navarro, de Moura.. (2000)   (9 citations)  Self-citation (Navarro Ziviani)   (Correct)

....simplest scheme, where the index points to all the positions of all the words in the text. However, the construction times and the space requirements are higher in this case. The occurrences take nearly 60 of the text size. This can be reduced to 35 by omitting the stopwords from the vocabulary [1]. Stopwords are articles, prepositions, and other words that carry no meaning and therefore do not appear in or that can be removed from user queries. Stopwords represent 40 to 50 of all the text words. However, 35 of extra space can still be a high space requirement for a large text ....

....compression method. An important consideration is the size of the text vocabulary. An empirical law widely accepted in Information Retrieval is the Heaps Law [13] which states that the vocabulary of a text of n words is of size V = O(n fi ) where 0 fi 1 depends on the text. As shown in [1], fi is between 0.4 and 0.6 in practice, so the vocabulary needs in practice space proportional to the square root of the text size. Hence, for large texts the overhead of storing the vocabulary is minimal. Another useful law related to the vocabulary is the Zipf s Law [24] which states that the ....

[Article contains additional citation context not shown here]

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. WSP'97, pages 2--20. Carleton University Press, 1997.


Fast Approximate String Matching in a Dictionary - Baeza-Yates, Navarro (1998)   Self-citation (Navarro)   (Correct)

....of positions where the word appears in the text) Approximate string matching is solved by first running a classical on line algorithm on the vocabulary (as if it was a text) thus obtaining the set of words to retrieve. The rest depends on the particular index. Full inverted indices such as Igrep [1] simply make the union of the lists of occurrences of all matching words to obtain the final answer. Block oriented indices such as Glimpse and variations on it [19, 5] which reduce space requirements by making the occurrences point to blocks of text instead of exact positions) must traverse the ....

....is very small compared to the text. For instance, in the 2 Gb TREC collection [14] the vocabulary takes no more than 2 Mb. An empirical law known as Heaps Law [15] states that the vocabulary for a text of n words grows as O(n fi ) where 0 fi 1. In practice, fi is between 0.4 and 0. 6 [1]. The fastest on line approximate search algorithms run at 1 4 megabytes per second (depending on some parameters of the problem) and therefore they find the answer in the vocabulary in a few seconds. While this is acceptable for single user environments, the search time may be excessive in a ....

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. 4th South American Workshop on String Processing, WSP'97, 1997. Valpara'iso, Chile. To appear.


Block Addressing Indices for Approximate Text Retrieval - Baeza-Yates, Navarro (1997)   (2 citations)  Self-citation (Navarro)   (Correct)

....positions in the text are retrieved. Since the vocabulary is very small compared to the text, they achieve acceptable performance. These indices can only retrieve whole words or phrases. However, this is in many cases exactly what is wanted. Examples of these indices are Glimpse [5] and Igrep [1]. Glimpse uses block addressing (i.e. pointing to blocks of text instead of words) to reduce the size of the index, at the expense of more sequential processing at query time. This work is focused on block addressing for word retrieving indices. Not only there exist few indexing schemes, but also ....

....of the index with respect to the text are: 2 4 for blocks, 10 15 for files, 25 30 for words. Note that the last percentage is similar to the overheads of classical inverted lists. To overcome the need of sequentially searching parts of the text, the approach of full inversion is 3 taken in [1]. For each word, the list of all its occurrences in the text are kept. This changes completely the second phase of the search: once the matching words in the vocabulary are identified, all their occurrence lists are merged and the text is never accessed. This makes the approach much more ....

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. Technical report, Dept. of CS, Univ. Federal de Minas Gerais, Brazil, 1996.


Compressed Pattern Matching Approximate Compressed .. - de Moura..   Self-citation (Navarro Ziviani)   (Correct)

....Results We analyze the performance of our searching algorithm. The analysis considers a random text, which is very appropriate because the compressed text is mainly random. 6 For the analysis we consider that: the vocabulary has v = O(u fi ) O( p u) words (typically fi = 0:4: 0:6)[ANZ97, MNZ97], the compressed search patterns are of length c (typically c = 3: 4) the original text has u characters, the compressed text has n characters, k is the number of errors allowed, the pattern has m characters and j different words of length w 1 ; w j ( P j i=1 w i = m) We first consider ....

M. D. Ara'ujo, G. Navarro and N. Ziviani. Large text searching allowing errors. In R. Baeza-Yates, editor, Proc. of the Fourth South American Workshop on String Processing, Carleton University Press International Informatics Series, v. 8, pages 2--20, 1997.


Adding Compression to Block Addressing Inverted Indexes - Navarro, de Moura.. (2000)   (9 citations)  Self-citation (Navarro Ziviani)   (Correct)

....It uses the simplest scheme, where the index points to all the positions of all the words in the text. However, its construction time and the space requirement are higher. The occurrences take nearly 60 of the text size. This can be reduced to 35 by omitting the stopwords from the vocabulary (Ara ujo et al. 1997). Stopwords are articles, prepositions, and other words that carry no meaning and therefore do not appear in or that can be removed from user queries. Stopwords represent 40 to 50 of all the text words. However, 35 of extra space can still be a high space requirement for a large text ....

....Hu man compression method 1 . An important consideration is the size of the text vocabulary. An empirical law widely accepted in IR is Heaps Law (Heaps, 1978) which states that the vocabulary of a text of n words is of size V = O(n ) where 0 1 depends on the text. As shown in (Ara ujo et al. 1997), is between 0.4 and 0.6 in practice, so the vocabulary needs space proportional to the square root of the text size. Hence, for large texts the overhead of storing the vocabulary is minimal. Another useful law related to the vocabulary is the Zipf s Law (Zipf, 1949) which states that the ....

[Article contains additional citation context not shown here]

Araujo, M. D., G. Navarro, and N. Ziviani: 1997, `Large text searching allowing errors'. In: R. Baeza-Yates (ed.): Proc. of the 4th South American Workshop on String Processing, Vol. 8. pp. 2-20.


A New Indexing Method for Approximate String Matching - Navarro, Baeza-Yates (1999)   (4 citations)  Self-citation (Navarro)   (Correct)

....for this problem are still rather immature. There are two types of indexing mechanisms for approximate string matching, which we call word retrieving and sequence retrieving . Word retrieving This work has been supported in part by Fondecyt grant 1 990627 and Fondef grant 96 1064. indices [22, 6, 2] are more oriented to natural language text and information retrieval. They can retrieve every word whose edit distance to the pattern is at most k. Hence, they are not able to recover from an error involving a separator, such as recovering the word flowers from the misspelled text flo wers ....

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. WSP'97, pages 2--20. Carleton University Press, 1997.


Block Addressing Indices for Approximate Text Retrieval - Baeza-Yates, Navarro (1997)   (2 citations)  Self-citation (Navarro)   (Correct)

....index. Typical figures for the index size with respect to the text are: 2 4 for blocks (tiny) 10 15 for files (small) 25 30 for words (medium) The last percentage is similar to the overheads of classical full inverted files. As an example of full inversion we mention a recent one, Igrep [2], which inherits from Glimpse the ability of (and the technique used for) searching extended patterns and allowing errors. Since every text word is referenced in the list of occurrences, the index poses a fixed overhead over the text size, close to 30 35 in this case. Since the sequential ....

....extended patterns and to search a complete phrase allowing a given number of errors across the whole phrase. The index is built in a single pass over the text using an in place construction. A detailed analysis of the search times for different types of simple and extended patterns is presented in [2]. The analysis shows that the retrieval costs are sublinear for useful searches (i.e. those with reasonable precision) We cannot finish this section without mentioning compression. Compressing the text and or the inverted file is an orthogonal technique to reduce the overall space usage. This ....

[Article contains additional citation context not shown here]

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. WSP'97, pages 2--20. Carleton University Press, 1997.


Block Addressing Indices for Approximate Text Retrieval - Baeza-Yates, Navarro (1997)   (2 citations)  Self-citation (Navarro)   (Correct)

....positions in the text are retrieved. Since the vocabulary is very small compared to the text, they achieve acceptable performance. These indices can only retrieve whole words or phrases. However, this is in many cases exactly what is wanted. Examples of these indices are Glimpse [5] and Igrep [1]. Glimpse uses block addressing (i.e. pointing to blocks of text instead of words) to reduce the size of the index, at the expense of more sequential processing at query time. This work is focused on block addressing for word retrieving indices. Not only there exist few indexing schemes, but ....

....of the index with respect to the text are: 2 4 for blocks, 10 15 for files, 25 30 for words. Note that the last percentage is similar to the overheads of classical inverted lists. To overcome the need of sequentially searching parts of the text, the approach of full inversion is taken in Igrep [1]. For each word, the list of all its occurrences in the text are kept. This changes completely the second phase of the search: once the matching words in the vocabulary are identified, all their occurrence lists are merged and the text is never accessed. This makes the approach much more resistant ....

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. 4th South American Workshop on String Processing, WSP'97, 1997. Valpara 'iso, Chile. To appear.


A Practical Index for Text Retrieval Allowing Errors - Baeza-Yates, Navarro (1997)   (4 citations)  Self-citation (Navarro)   (Correct)

....occurrence that is not a complete word. For instance, if an OCR system has erroneously inserted a space in the middle of a word in the text, it will be not possible to search that word with one error and retrieve it using a word retrieving index. Examples of these indices are Glimpse [18] Igrep [1] and [5] In the indices of the second kind, the words are disregarded. They also apply if words do not exist in the text, such as in DNA or protein databases. One class of indices for this case is based on building the suffix tree of the text and traversing it instead of the text, to avoid its ....

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. 4th South American Workshop on String Processing, WSP'97, 1997. Valpara'iso, Chile. To appear.


A Practical q-Gram Index for Text Retrieval Allowing Errors - Navarro, Baeza-Yates   Self-citation (Navarro)   (Correct)

....of all different words of the text (the vocabulary) and use an on line algorithm on the vocabulary, thus obtaining the set of words to retrieve. From that point on, the problem does not need to involve approximate matching anymore. Since the vocabulary is sublinear in size with respect to the text [14, 1], they achieve acceptable performance. These indices are not capable, however, of retrieving an occurrence that is not a complete word. For instance, if an OCR system has erroneously inserted a space in the middle of a word in the text, or removed the space between two words, these indices will ....

....is not a complete word. For instance, if an OCR system has erroneously inserted a space in the middle of a word in the text, or removed the space between two words, these indices will not be able to retrieve those words if just one error is allowed. Examples of such indices are Glimpse [21] Igrep [1] and [5] In the indices of the second kind, the words are disregarded. This makes them suitable not only for natural language text but also in scenarios where there exist no words, such as in DNA or protein databases. This is also useful for text retrieval on some agglutinating languages (e.g. ....

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. 4th South Americal Workshop on String Processing (WSP'97), pages 2--20. Carleton University Press, 1997.


Indexing Compressed Text - de Moura, Navarro, Ziviani (1997)   (8 citations)  Self-citation (Navarro Ziviani)   (Correct)

....version, and still less than the original text with no index. Moreover, indexing and querying are nearly twice as fast than in the uncompressed version of the index. This scheme can be readily adapted to meet other requirements. For example, it is not difficult to mix it with the approach of [ANZ97] to allow searching for regular expressions, approximate patterns, etc. This is because that approach is mainly based on processing the vocabulary, which is stored in our index. We are currently working on this, as well as on a more complete system capable of compressing and indexing whole ....

M. D. Ara'ujo, G. Navarro and N. Ziviani. Large text searching allowing errors. In R. Baeza-Yates, editor, Proc. of WSP'97.


Fast Searching on Compressed Text Allowing Errors - de Moura, Navarro, al. (1998)   Self-citation (Navarro Ziviani)   (Correct)

.... = 1. In this case, H = O(log v) Although this simplified form is popular because it is simpler to handle mathematically, it does not follow well the real distribution of natural language texts. There is strong evidence that most real texts have in fact a more biased vocabulary. We performed in [ANZ97] a thorough set of experiments on the TREC collection, finding out that the values are roughly between 1.5 and 2.0 depending on the text, which gives experimental evidence in favor of the generalized Zipf Law (i.e. 1) Under this assumption, H = O(1) We have also tested the distribution ....

....query supported by our system. For complex patterns the preprocessing phase corresponds to a sequential search in the vocabulary to mark all the words that match the pattern. This technique has been already used in block oriented indexing schemes for searching allowing errors in uncompressed texts [MW93, ANZ97]. Since the vocabulary is very small compared to the text size, the sequential search time on the vocabulary is negligible, and there is no other additional cost to allow complex queries. This is very difficult to achieve with online plain text searching, since we take advantage of the knowledge ....

[Article contains additional citation context not shown here]

M. D. Ara'ujo, G. Navarro and N. Ziviani. Large text searching allowing errors. In R. Baeza-Yates, editor, Proc. of the Fourth South American Workshop on String Processing, Carleton University Press International Informatics Series, v. 8, pages 2--20, 1997.


Fast Approximate String Matching in a Dictionary - Baeza-Yates, Navarro (1998)   Self-citation (Navarro)   (Correct)

....of positions where the word appears in the text) Approximate string matching is solved by first running a classical online algorithm on the vocabulary (as if it was a text) thus obtaining the set of words to retrieve. The rest depends on the particular index. Full inverted indices such as Igrep [1] simply make the union of the lists of occurrences of all matching words to obtain the final answer. Blockoriented indices such as Glimpse and variations on it [14, 5] reduce space requirements by making the occurrences point to blocks of text instead of exact positions, and must traverse the ....

....is very small compared to the text. For instance, in the 1 Gb TREC collection [11] the vocabulary takes no more than 5 Mb. An empirical law known as Heaps Law [12] states that the vocabulary for a text of n words grows as O(n fi ) where 0 fi 1. In practice, fi is between 0.4 and 0. 6 [1]. An online algorithm can search such vocabulary in a few seconds. While improving this vocabulary occurrences TEXT INDEX x x x Online Approx. Search Online Search Figure 1. Approximate searching on an inverted index. The online search on the text may or may not be necessary. may not be ....

M. Ara'ujo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. WSP'97, pages 2-- 20. Carleton University Press, 1997.


The Maximum-Margin Approach to Learning Text Classifiers -.. - Joachims (2000)   (17 citations)  (Correct)

No context found.

Araujo, M. D., Navarro, G., and Ziviani, N. (1997). Large text searching allowing errors. In Baeza-Yates, R., editor, Proceedings of the 4th South American Workshop on String Processing, pages 2-20, Valparaiso, Chile. Carleton University Press.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC