| G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proceedings of the 11st Annual Symposium on Combinatorial Pattern Matching (CPM'2000. |
....is also lossless but produces more potential matches, which increases the time required for verification. Filters that use q grams are particularly suitable for indexed string matching since there are efficient data structures for indexing all text q grams. Indexed q gram filters are described in [9, 10, 15, 17, 19, 26]. A generalization of the q gram filter uses gapped q grams, subsets of q characters of a fixed noncontiguous shape. For example, the 3 grams of shape ## # in the string ACAGCT are ACG, CAC and AGT. Gapped q grams have been used in [7, 20, 13] In [7, 20] the motivation is to increase the ....
....where contiguous q grams and related methods have been found to be useful. We have taken some first steps in studying these possibilities [6, 11] but a lot more remains to be done. There are several variants of the basic q gram filter, including sampling (use only a subset of all q grams) [8, 15, 17, 19, 24, 25, 26, 27], approximate q grams (allow errors in the q grams) 8, 15, 19, 25] and multiple shapes [7, 20] A generic technique for computing thresholds applicable to these and other variants of q gram filters is described in [11] This technique is applicable to the Levenshtein distance as well as the ....
[Article contains additional citation context not shown here]
Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing Text with Approximate q-Grams, Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, number 1848 in LNCS, Springer, 2000.
....sequences. Another related area concerns the problems of indexing for exact and approximate string searching, which have received considerable attention [27] Some examples of indexing methods developed for string matching are suffix trees [9] suffix array [19] Q grams [10] and Qsamples [28]. Also related is the work on episode matching, with which only insertions in the text are permitted [6,36] However, the unordered nature of sequence elements and the freedom to represent their order in the sequence makes the techniques developed for string and episode matching inappropriate for ....
G. Navarro, E. Sutinen, J. Tanninen, J. Tarhio, Indexing text with approximate q-grams, Proceedings of the 11th Annual Symposium On Combinatorial Pattern Matching (CPM'2000.
....index structures for strings have not produced viable, full sensitivity search tools, applicable in biology. In this domain the following structures have been tested in the persistent context, with approximate matching. N grams (q grams) have been found to be useful where close matches are sought [34, 25, 5, 22, 27], but could not deliver more distant matches [24] The sux array [19] was tested with small amounts of DNA under a unit cost model [3] and was found to be superior to the sux tree. We tested the sux tree [14] and found it to be potentially useful, but not delivering fast performance, due to its ....
G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing Text with Approximate q-grams. In CPM'2000.
....signi cantly. The added complexity makes online lters slower, but on indexed lters the e ect is negligible. Sampling A popular way to reduce time requirement and or index size at the cost of ltration eciency is to consider only a sample, say every 5th, of the q grams of the text or the pattern [8, 24, 14, 22, 23, 21, 17, 19]. Multiple shapes As an opposite to sampling, the number of q grams can be increased by using gapped q grams of several di erent shapes [7, 20] This improves ltration eciency but increases time and or space requirements. Approximate q grams Another way to improve ltration eciency at the cost ....
....the number of q grams can be increased by using gapped q grams of several di erent shapes [7, 20] This improves ltration eciency but increases time and or space requirements. Approximate q grams Another way to improve ltration eciency at the cost of slower ltering is to allow errors in q grams [14, 8, 22, 19]. Of course, various combinations of these methods are possible. For example, sampling and approximate q grams have often been used together [14, 8, 22, 19] Gapped q grams, in particular, o er a lot of possibilities for combination through the use of (possibly multiple) di erent shapes. The ....
[Article contains additional citation context not shown here]
G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching, volume 1848 of LNCS, pages 350-363. Springer, 2000.
....to the symmetry of the edit distance Lemmas 1 3 can also be applied in the other direction, where we pick substrings from the text and consider how many of them must be contained in the pattern if the substrings are part of an approximate match. This would be reminiscent of the approach used in [5]. 2.2. Overview of Some Filtering Schemes In this section we briefly overview some of the filtering schemes we have seen in the literature. A more in depth review on the subject is presented by Navarro in [1] or [2] and an interested reader is strongly encouraged to read the referenced original ....
....condition of Wu and Manber and states that if we choose j non overlapping substrings from the pattern, then at least one of them must be present with at most k j errors in an approximate match. The original paper [4] of Baeza Yates and Navarro considered on line searching, but in [23] and [5] this filtering condition has also been used in indexed off line searching. Both of these papers were based on running a bit parallel approximate string matching algorithm [4] on a trie data structure composed from the text (a suffix tree or a suffix array in the former, and a trie of a chosen set ....
Navarro G, Sutinen E, Tanninen J and Tarhio J. Indexing text with approximate q-grams. In Proceedings of CPM'2000.
No context found.
G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM'2000), LNCS 1848, pages 350-- 363, 2000.
....t to b that to be at the beach or to be at work, that is the real question Index points Text Supra index Figure 17: A supra index for our example sux array. have higher search time and do not scale well for very large texts. This line has been pursued by Sutinen, Tarhio, and others [63, 59], and can cope with both exact and approximate searching. This is particularly attractive for computational biology applications. 4.4 Inverted Indices When the text is (Western) natural language and users are satis ed with retrieving just words and phrases (not any substring) inverted indices ....
G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM'2000.
....row wise left to right. We need only the previous row in order to compute the current row. The minimum distance is nally min 0 j m C ;j . We present now a method to reduce the preprocessing time to O(rm ) which has been used before in the context of indexed approximate string matching [10]. Instead of running the grams one by one over a pattern P , we form a trie data structure of all the grams. For every trie node whose path from the root spells out the string S, we compute the last row of the C matrix corresponding to searching for S inside P . For this sake we use the ....
G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proc. 11th Combinatorial Pattern Matching (CPM 2000), LNCS 1848, pages 350-363, 2000.
.... in Pattern [10] Jokinen Suffix Tree Ukkonen 91 [18] Shi 96 [23] Ukkonen 93 [5] Cobbs 95 Suffix Array [7] Gonnet 88 [16] Navarro Baeza Yates 99 [10] Jokinen Q grams n a Ukkonen 91 [15] Navarro Myers 90 [13] 9] Holsti Baeza Yates 97 Sutinen 94 Q samples n a [20] Sutinen n a [17] Navarro n a Tarhio 96 et al. 2000 Table 1: Taxonomy of indexes for approximate text searching. A n a means that the data structure is unsuitable to implement that search approach because not enough information is maintained. 2 Basic Concepts 2.1 Suffix Trees Suffix trees [1] are widely ....
....be chosen at steps of h b(m Gamma k Gamma q 1) jc. By Lemma 2, one of the q samples must appear in the pattern with bk=jc errors at most. Moreover, if every q sample i appears in the pattern block Q i = P hi Gammak: hi q Gamma1 k with k i errors, then it must hold P k i k. This method [20, 17] searches every block Q i in the index of q samples using backtracking, so as to find the least number of errors to match each text q sample inside Q i , using a slight modification to 7 the algorithm of Section 3.2. If a zone of consecutive samples is found whose errors add up at most k, the ....
[Article contains additional citation context not shown here]
G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proc. 11th Ann. Symp. on Combinatorial Pattern Matching (CPM'2000.
.... and Myers can be considered as a third class of algorithms, based on reducing the problem to approximate search of pattern pieces (an intermediate between the two extremes of pure suffix tree searching and partitioning into exact searching of pattern pieces) A very recent work in this line is [31], although they show no analysis and their comparison against previous work shows that our search times are superior (albeit they need less space) A Hybrid Indexing Method for Approximate String Matching 5 3 Basics We present in this section the basic algorithms on which our approach builds. ....
....the idea is to partition the pattern in less than k 1 pieces, so one cannot guarantee that there are pieces free of errors. However, one can reduce the number of errors that may appear in at least one of the pieces. There exist filtration approaches based on different interpretations for A and B [24, 31]. The one we use in this paper corresponds to P = A, x i = and B = T 0 , where T 0 is an occurrence of P in T . The pattern P is split in j pieces and these are searched allowing bk=jc errors in the text. Only the text areas surrounding those occurrences can contain a complete occurrence of ....
G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM'2000), Montreal, Canada, 2000. To appear.
No context found.
G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proceedings of the 11st Annual Symposium on Combinatorial Pattern Matching (CPM'2000.
No context found.
G. Navarro, E. Sutinen, J. Tanninen, J. Tarhio. Indexing text with approximate q-grams. In: CPM2000, Lecture Notes in Computer Science, vol. 1848. Springer, Berlin Heidelberg New York, 2000, pp. 350--365
No context found.
G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proc. on CPM, number 1848, pages 350--363, 2000.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC