| G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/- thesis98.ps.gz. |
....is the Hamming distance, that does not allow insertions and deletions, i.e. it is the number of nonmatching characters for strings of the same length. The indexed version of the problem allows preprocessing the text to build an index while the online version does not. Good surveys are given in [10, 11]. Filtering is a way to speed up approximate string matching, particularly in the indexed case but also in the online case. A lter is an algorithm that quickly discards large parts of the text based on some lter criterium, leaving the remaining part to be checked with a proper (online) ....
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, University of Chile, 1998.
....variant is the Hamming distance, that does not allow insertions and deletions, i.e. it is the number of nonmatching characters for strings of the same length. The indexed version of the problem allows preprocessing the text to build an index while the online version does not. Surveys are given in [15, 16, 18]. Filtering is a way to speed up approximate string matching, particularly in the indexed case but also in the online case. A lter is an algorithm that quickly discards large parts of the text based on some lter criterium, leaving the remaining part to be checked with a proper (online) ....
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, University of Chile, 1998.
....is the Hamming distance, that does not allow insertions and deletions, i.e. it is the number of nonmatching characters for strings of the same length. The indexed version of the problem allows preprocessing the text to build an index while the online version does not. Good surveys are given in [11, 12]. Filtering is a way to speed up approximate string matching, particularly in the indexed case but also in the online case. A lter is an algorithm that quickly discards large parts of the text based on some lter criterium, leaving the remaining part to be checked with a proper (online) ....
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, University of Chile, 1998.
....i.e. the number of mismatching characters. The fastest algorithm in practice for the k di erences problem is the bitparallel dynamic programming algorithm of Myers [9] It works in time O(nm=w) where w is the word size of the machine. An extensive survey and comparison of algorithms is given in [10]. The k mismatches problem is a simpler problem but Supported by the DFG Initiative Bioinformatik grant BIZ 4 1 1. Partially supported by the IST Programme of the EU under contract number IST1999 14186 (ALCOM FT) we do not know of any asymptotically faster algorithms for it. As the ....
....the text with some ltering method (the ltering phase) and search only those areas using a proper approximate string matching algorithm (the veri cation phase) A good ltering method is fast and ecient, i.e. leaves only a small area to be veri ed. A good survey of ltering methods is given in [10]. Among the most popular and studied ltering methods is the q gram method. A q gram is a substring of length q. The basic q gram method works as follows. First, nd all matching q grams between the pattern and the text. That is, nd all pairs (i; j) such that the q gram at position i in the ....
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, University of Chile, 1998.
....string matching The problem of searching a database of melodic pitch contours can be considered as a case of simple text retrieval [Downie99] The field of text retrieval, and the more relevant approximate text retrieval, is well researched. The field is comprehensively explained by Navarro [Navarro98] only a brief overview and details of information relevant to this work are presented here. Developments in the area of text retrieval allowing errors were first used in applications such as biology and comparing DNA sequences. Researchers wanted to know how similar two sequences were, and what ....
....sliding window method is suitable for queries which are not much greater than the atomic length, it is wholly inappropriate for very long queries, such as an entire MIDI file. A more intelligent way of splitting the query is to use a principle identified by Wu and Manber and discussed by Navarro [Navarro98] The idea is based on the fact that for a search allowing one error in the query, the error must be either in the first half of the query or the second. An exact search can be performed for both halves of the query and the results combined. As at least one half of the query is correct, the ....
Navarro, G., Approximate Text Searching, Ph.D. thesis, Dept. of Computer Science, University of Chile, December 1998.
....is another important area for research and business devlopment. One of the promising new approaches to indexing is the use of metadata, i.e. summaries of Web page content or sites which are placed in the page for the purpose of aiding automatic indexers. 40 Preliminary studies documented in [Navarro 1998] indicate that on the average site 1 in 200 common words and 1 in 3 foreign surnames are misspelled. 41 FAQs, or frequently asked questions, are essays on topics on a wide range of interests, with pointers and references. For an extensive list of FAQs, see: ....
Navarro, G., Approximate Text Searching, Ph.D. Thesis, Dept. Computer Science, Univ. of Chile (1998).
....to multipattern search, as explained next. A special requirement of our application is the need for multipattern search. That is, we are given r patterns pt. p and we have to report all their occurrences. Very little work has been done on multipattern search for the e differences problem [14, 4, 16, 5, 17]. In Sections 4 and 5 we adapt two of those approaches to the e insertions problem. The first one obtains a speedup of a (1 )t (where k m) over the basic bit parallel algorithm of Section 3. This speedup is larger than 1 for e 1. The second one obtains a speedup of w log. m k) but ....
....bits of the simulation do not fit in the computer word we set up as many computer words as needed. Since each one is updated in 0(1) time per text character, the total complexity is O( mlog(k) For short patterns (i.e. mlogk = O(w) this is O(n) 4 A Multipattern Filter As already noted in [4, 5, 17], the ability of bit parallel algorithms to allow classes of characters can be used to build multipattern filters. Imagine that the pattern is not a sequence of letters but a sequence of classes of letters. A letter a is said to match P at position i if a E P , i.e. if it belongs to the ....
[Article contains additional citation context not shown here]
G. Navarro. Approximate Text Searching. PhD thesis, Dept. ot Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14. ftp://ftp.dcc.uchile. cl/pub/users/gnavarro/- thesis98 .ps .z.
....generalize to multipattern search, as explained next. A special requirement of our application is the need for multipattern search. That is, we are given r patterns P and we have to report all their occurrences. Little work has been done on multipattern search for the k di erences problem [19, 21, 5, 22]. In Sections 5 and 6 we adapt two of those approaches to the k insertions problem. The rst one obtains a speedup of = 1 ) where = k=m is the error level) over the basic bit parallel algorithm of Section 4. This speedup is larger than 1 for =e 1. The second one obtains a speedup ....
....bits of the simulation do not t in the computer word we set up as many computer words as needed. Since each one is updated in O(1) time per text character, the total complexity is O(nm log(k) w) For short patterns (i.e. m log k = O(w) this is O(n) 5 A Multipattern Filter As already noted in [5, 22], the ability of bit parallel algorithms to allow classes of characters can be used to build multipattern lters. Imagine that the pattern is not a sequence of letters but a sequence of classes of letters. A letter a is said to match P at position i if a 2 P i , i.e. if it belongs to the ....
[Article contains additional citation context not shown here]
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14.
....from 0 to 1; b) this point does not depend on m asymptotically; and (c) it depends on oe linearly as predicted by the analysis (ff = oe=e) Gamma 1) except because the e has been changed to about 1.09. Interestingly, this is similar to the result obtained for the k differences problem in [6, 7] when relating their analytical predictions (ff 1 Gamma e= oe) with the experiments (ff = 1 Gamma 1:09= oe) and shows a consistent behavior of the pessimistic analytical model used in both cases. 3 Experimental Results We experimentally studied how the probabilistic model of string ....
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14. ftp://- ftp.dcc.uchile.cl/pub/users/gnavarro/thesis98.ps.gz.
....tree) and therefore it can select the best partition at query time. We show analytically that the average search time can be made O(n 2( H ( 1 ) where H ( is the base entropy function. This is sublinear for 1 e= p , where e = 2:718: On the other hand, the results of [7, 25] show that sublinearity cannot be achieved for 1 e= p . We implement the index using a suffix array instead of a suffix tree. The suffix array takes only 4 times the text size and multiplies the above search cost only by O(log n) We also use a faster node processing algorithm based on ....
....complexity for 1 c= p , as well as formulas for the optimum j to use, which is (m= log n) with a complicated constant. For larger values the pattern partitioning method gives linear complexity and we need to resort to the traditional suffix tree traversal (j = 1) As shown in [7, 25], it is very unlikely that this limit of 1 c= p can be improved, since there are too many real approximate occurrences in the text. A simplified technique that gives a reasonable result in most cases is to select j = m k) log n, for a complexity of O n 1 log (1= 1 = O ....
[Article contains additional citation context not shown here]
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14. ftp://ftp.dcc.uchile.cl/pub/users/- gnavarro/thesis98.ps.gz.
....polynomial in m takes O(mn= log m) time. A special requirement of our application is the need for multipattern search. That is, we are given r patterns P 1 : P r and we have to report all their occurrences. Very little work has been done on multipattern search for the k di erences problem [13, 4, 15, 5, 16]. In Sections 5 and 6 we adapt two of those approaches to the k insertions problem. The rst one obtains a speedup of = 1 ) 1 (where = k=m) over the basic bit parallel algorithm of Section 4. This speedup is larger than 1 for =e 1. The second one obtains a speedup of w= log 2 ....
....the superimposition is found for the presence of any of the individual patterns. That is, each time the algorithm nds the superimposed pattern at text position j, we check each of the patterns separately (with the same algorithm) in the text area T j m k 1: j . A similar idea was proposed in [4, 5, 16] for the k di erences problem. To avoid re veri cation due to overlapping areas, we keep track of the last position veri ed and the state of the veri cation algorithm. If a new veri cation requirement starts before the last veri ed position, we start the veri cation from the last veri ed ....
[Article contains additional citation context not shown here]
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14. ftp://ftp.dcc.uchile.cl/pub/users/- gnavarro/thesis98.ps.gz.
....be by far superior to all other implemented proposals, and we show analytically that the average retrieval time can be made O(n 2(ff H oe (ff) 1 ff) where H oe (ff) is the base oe entropy function. This is sublinear for ff 1 Gamma e= p oe. This limit on ff cannot probably be improved [8, 25]. We finally propose an alternative data structure to reduce the space requirements of the suffix tree, with little time penalty. 2 Combining Suffix Trees and Pattern Partitioning We present now our alternative proposal. The general idea is to partition the pattern in pieces, search each piece in ....
....for ff 1 Gamma c= p oe, as well as formulas for the optimum j to use, which is Theta(m= log oe n) with a complicated constant. For larger ff values the pattern partitioning method gives linear complexity and we need to resort to the traditional suffix tree traversal (j = 1) As shown in [8, 25], it is very unlikely that this limit of 1 Gamma c= p oe can be improved, since there are too many real approximate occurrences in the text. An interesting fact that is shown in the experiments is that in many cases the optima are out of bounds and hence the best is to put j in the limit of ....
[Article contains additional citation context not shown here]
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14. ftp://- ftp.dcc.uchile.cl/pub/users/gnavarro/thesis98.ps.gz.
....the segments (or occurrences ) of the text whose edit distance to the pattern is at most k (the number of allowed errors ) This problem has a number of other applications in computational biology, signal processing, etc. There exist a number of solutions for the on line version of this problem [31] (i.e. the pattern can be preprocessed but the text cannot) All these algorithms traverse the whole text sequentially. If the text database is large, even the fastest on line algorithms are not practical, and preprocessing the text becomes mandatory. This is normally the case in IR. However, the ....
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14. ftp://ftp.dcc.uchile.cl/pub/- users/gnavarro/thesis98.ps.gz.
No context found.
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/- thesis98.ps.gz.
No context found.
G. Navarro. Approximate Text Searching. PhD thesis, University of Chile, Santiago, Chile, 1998.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC