52 citations found. Retrieving documents...
E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191--211, Jan. 1992.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

Using q-grams in a DBMS for Approximate String Processing - Gravano, Ipeirotis.. (2001)   (Correct)

....proposed techniques can be used for a variety of other distance metrics as well. 2. 2 Q grams: A Foundation for Approximate String Processing Below, we briefly review the notion of positional q grams from the literature, and we give the intuition behind their use for approximate string matching [7, 6, 4]. Given a string oe, its positional q grams are obtained by sliding a window of length q over the characters of oe. Since q grams at the beginning and the end of the string can have fewer than q characters from oe, we introduce new characters # and not in Sigma, and conceptually extend ....

....of all the joej q Gamma 1 pairs constructed from all q grams of oe. The intuition behind the use of q grams as a foundation for approximate string processing is that when two strings oe 1 and oe 2 are within a small edit distance of each other, they share a large number of q grams in common [6, 4]. Consider the following example. The positional q grams of length q=3 for string john smith are f(1,##j) 2,#jo) 3,joh) 4,ohn) 5,hn ) 6,n s) 7, sm) 8,smi) 9,mit) 10,ith) 11,th ) 12,h )g. Similarly, the positional q grams of length q=3 for john a smith, which is at an edit ....

Esko Ukkonen. Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191--211, 1992.


One-Gapped q-Gram Filters for Levenshtein Distance - Burkhardt, Kärkkäinen (2002)   (Correct)

....algorithm. A lter is lossless if it never discards an actual match; we consider only lossless lters. The ability of a lter to reduce the text area is called its ( ltration) eciency. Many lters are based on q grams, substrings of length q. The q gram similarity (de ned as a distance in [13]) of two strings is the number of q grams shared by the strings. The q gram lter is based on the q gram lemma: Lemma 1 ( 6] Let P and S be strings with (Levenshtein or Hamming) distance k. Then the q gram similarity of P and S is at least t = jP j q 1 kq. Supported by the DFG Initiative ....

....the threshold and gives the minimum number of q grams that an approximate match must share with the pattern, which is used as the lter criterium. There are actually many possible ways to count the number of shared q grams o ering di erent tradeo s between speed and ltration eciency (see, e.g. [6, 13, 5, 1]) However, in all cases the value of the threshold is the one given by the lemma. A generalization of the q gram lter uses gapped q grams, subsets of q characters of a xed non contiguous shape. For example, the 3 grams of shape ## # in the string ACAGCT are ACG, CAC and AGT. In [2] we showed ....

E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191-212, 1992.


Using q-grams in a DBMS for Approximate String Processing - Gravano, Ipeirotis.. (2001)   (Correct)

....proposed techniques can be used for a variety of other distance metrics as well. 2. 2 Q grams: A Foundation for Approximate String Processing Below, we briefly review the notion of positional q grams from the literature, and we give the intuition behind their use for approximate string matching [7, 6, 4]. Given a string #, its positional q grams are obtained by sliding a window of length q over the characters of #. Since q grams at the beginning and the end of the string can have fewer than q characters from #, we introduce new characters # and not in #, and conceptually extend the string ....

....# is the set of all the 1 pairs constructed from all q grams of #. The intuition behind the use of q grams as a foundation for approximate string processing is that when two strings # 1 and # 2 are within a small edit distance of each other, they share a large number of q grams in common [6, 4]. Consider the following example. The positional q grams of length q=3 for string john smith are (2,#jo) 3,joh) 4,ohn) 5,hn ) 6,n s) 7, sm) 8,smi) 9,mit) 10,ith) 11,th ) 12,h ) Similarly, the positional q grams of length q=3 for john a smith, which is at an edit distance ....

Esko Ukkonen. Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191--211, 1992.


Better Filtering with Gapped q-Grams - Burkhardt, Kärkkäinen (2001)   (11 citations)  (Correct)

....phases. A filter is lossless if it never discards an actual match; we consider only lossless filters. The ability of a filter to reduce the text area is called its filtration efficiency. Many filters are based on q grams, substrings of length q. The q gram similarity (defined as a distance in [28]) of two strings is the number of q grams shared by the strings. The q gram filter is based on the q gram lemma: Lemma 1.1. The q gram lemma [10] Let P and S be strings with (Levenshtein or Hamming) distance k. Then the q gram similarity of P and S is at least t = jP j q 1 kq. As an ....

Ukkonen, E.: Approximate string matching with q-grams and maximal matches, Theoretical Computer Science, 92(1), 1992, 191--212.


The Suffix Sequoia Index for Approximate String Matching - Hunt (2003)   (32 citations)  (Correct)

....index structures for strings have not produced viable, full sensitivity search tools, applicable in biology. In this domain the following structures have been tested in the persistent context, with approximate matching. N grams (q grams) have been found to be useful where close matches are sought [34, 25, 5, 22, 27], but could not deliver more distant matches [24] The sux array [19] was tested with small amounts of DNA under a unit cost model [3] and was found to be superior to the sux tree. We tested the sux tree [14] and found it to be potentially useful, but not delivering fast performance, due to its ....

E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191-212, 1992.


CLUSEQ: Efficient and Effective Sequence Clustering - Yang, Wang (2003)   (1 citation)  (Correct)

....25, 29] A direct extension of this method to generic symbol sequences is to use short segments of fixed length q (generated using a sliding window through each sequence) as the set of words in the similarity measure. This method is also referred to as the q gram based method in the literature[8, 22, 26, 27]. While the q gram based approach enables significant segments (i.e. keywords phrases q grams) to be identified and used to measure the similarity between sequences regardless of their relative positions in different sequences, valuable information may be lost as a result of ignoring sequential ....

E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., vol 92(1), pp. 191-202, 1992.


Computing the Threshold for q-Gram Filters - Kärkkäinen (2002)   (1 citation)  (Correct)

....to reduce the text area is called its ( ltration) eciency. Partially supported by the Future and Emerging Technologies programme of the EU under contract number IST 1999 14186 (ALCOM FT) Many lters are based on q grams, substrings of length q. The q gram similarity (de ned as a distance in [25]) of two strings is the number of q grams shared by the strings. The q gram lter is based on the q gram lemma: Lemma 1 ( 12] Let P and S be strings with (Levenshtein or Hamming) distance k. Then the q gram similarity of P and S is at least t = jP j q 1 kq. The value t in the lemma is called ....

E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191-212, 1992.


One-gapped q-Gram Filters for Levenshtein Distance - Burkhardt, Kärkkäinen (2002)   (Correct)

....by the DFG Initiative Bioinformatik grant BIZ 4 1 1. Partially supported by the Future and Emerging Technologies programme of the EU under contract number IST 1999 14186 (ALCOM FT) Many lters are based on q grams, substrings of length q. The q gram similarity (de ned as a distance in [14]) of two strings is the number of q grams shared by the strings. The q gram lter is based on the q gram lemma: Lemma 1 ( 7] Let P and S be strings with (Levenshtein or Hamming) distance k. Then the q gram similarity of P and S is at least t = jP j q 1 kq. The value t in the lemma is called ....

....the threshold and gives the minimum number of q grams that an approximate match must share with the pattern, which is used as the lter criterium. There are actually many possible ways to count the number of shared q grams o ering di erent tradeo s between speed and ltration eciency (see, e.g. [7, 14, 6, 2]) However, in all cases the value of the threshold is the one given by the lemma. A generalization of the q gram lter uses gapped q grams, subsets of q characters of a xed non contiguous shape. For example, the 3 grams of shape ## # in the string ACAGCT are ACG, CAC and AGT. In [3] we showed ....

E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191-212, 1992.


Better Filtering with Gapped q-Grams - Burkhardt, Kärkkäinen (2001)   (11 citations)  (Correct)

....(see Lemmas 2 and 4) For example, strings ACAGCTTA and ACACCTTA have Hamming and Levenshtein distance 1 and have 8 3(1 1) 1 = 3 common 3 grams: ACA, CTT and TTA. The above description of the q gram method leaves many details open. Di erent realizations of the method are described in [6, 16, 5, 2]. There are also many variations, e.g. not using all q grams [15, 14] The q gram method is particularly suitable for indexed string matching. An index of all text q grams is simple to implement using table lookup, hashing or a trie. This makes the q gram method very fast unless the number of ....

E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191-212, 1992.


Matching Techniques for Large Music Databases - Uitdenbogerd, Zobel (1999)   (1 citation)  (Correct)

....of the first column of Table 1. The third column shows the frequencies of these 4 grams in the contour for part of Old MacDonald Had a Farm : ssdusdusdsddussdusdusdsd. The score is the sum of these frequencies, in this case 6. The other version of n gram counting is based on the Ukkonen measure [26]: v## n G(x) v] G(y) v] where # n isthesetofpossiblen grams,x and y represent the strings being compared and G(x) v]isthe number of occurrences of the n gram v in string x. In the above example, the Ukkonen measure results in a score of 15, which is the sum of the di#erences between ....

E. Ukkonen. Approximate string-matching with qgrams and maximal matches. Theoretical Computer Science, 92:191--211, 1992.


Similarity Searching in the CORDIS Text Database - Petrakis, Tzeras (2001)   (Correct)

....substitutes t) and edit(cordis# codris) 1 (transposition of rd) 2.1.2 n Grams An n gram of a string s is a substring of s of length n. A simple measure of the similarity between strings s and t counts the number of n grams that s and t have in common. The Ukkonen version of n gram distance [10] takes string lengths into account: n ; gram; distance(s# t) X g2Gs[G t js[g] t[g]j# (1) where, G s , G t are the sets of n grams in s, t respectively and s[g] t[g] denote the number of occurrences of n gram g in strings s, t respectively. For example, if n = 2, G cordis = fco# or#rd#di# ....

E. Ukkonen. Approximate String Matching With q-grams and Maximal Matches. Theoretical Computer Science, 92:191--211, 1992.


Fast String Correction with Levenshtein-Automata - Schulz, Mihov (2002)   (Correct)

....If necessary, appropriate statistical data can be used for re nement of ranking. Similarity between two words can be measured in several ways. Most useful are (dis)similarity measures based on variants of the Levenshteindistance [Lev66, WF74, WBR95, SKS96, OL97] or on n gram distances [AFW83, Ukk92, KST92, KST94] In this paper, we take the Levenshtein distance as a basis. The standard algorithm for computing the Levenshtein distance between two words by Wagner and Fisher [WF74] uses a dynamic programming scheme that leads to quadratic time complexity. Even with more sophisticated ....

E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92:191-211, 1992.


A 3-way Merging Algorithm for Synchronizing Ordered Trees - the.. - Lindholm (2001)   (2 citations)  (Correct)

....In this section we will define two measures of node similarity: content similarity and child list similarity. These similarity measures are used by the matching algorithm for choosing a best match, when no exact matches exists. The basis for both distances is the q gram string distance defined in [Ukk92] The q gram distance was used instead of the more commonly used edit distance due to running time considerations. The q gram distance can be computed in O(n) Ukk92] whereas the commonly used edit distance algorithm by [Mye86] requires O(nd) where d is the edit distance. As several of the ....

....for choosing a best match, when no exact matches exists. The basis for both distances is the q gram string distance defined in [Ukk92] The q gram distance was used instead of the more commonly used edit distance due to running time considerations. The q gram distance can be computed in O(n) Ukk92] whereas the commonly used edit distance algorithm by [Mye86] requires O(nd) where d is the edit distance. As several of the comparisons are made between totally unrelated strings, d is often close to n, yielding an average complexity close to O(n 2 ) for calculating the edit distance using ....

[Article contains additional citation context not shown here]

Ukkonen E. "Approximate string matching with q-grams and maximal matches." Theoretical Computer Science, vol. 92 no. 1, 1992, pp. 191--211


New and Faster Filters for Multiple Approximate String Matching - Baeza-Yates, Navarro   (Correct)

....of j subpatterns of length m=j with k=j errors. Only the text areas surrounding occurrences of pieces must be checked for complete matches. An important particular case of Lemma 1 arises when one considers j = k 1, since in this case some pattern piece appears unaltered (zero errors) Lemma 2: [32] If there are i j such that ed(T i: j ; P ) k, then T j Gammam 1: j includes at least m Gamma k characters of P . Proof: Suppose the opposite. If j Gamma i m, then we observe that there are less than m Gamma k characters of P in T i: j . Hence, more than k characters must be deleted from ....

....in both cases. Note that in case of repeated characters in the pattern, they must be counted as different occurrences. For example, if we search aaaa with one error in the text, the last four letters of each occurrence must include at least three a s. Lemma 2 (a simplification of that in [32]) says essentially that we can design a filter for approximate searching based on finding enough characters of the pattern in a text window (without regarding their ordering) For instance, the pattern survey cannot appear with one error in the text window surger because there are not five ....

[Article contains additional citation context not shown here]

E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 1:191--211, 1992.


Approximate String Joins in a Database (Almost) for Free - Gravano, Ipeirotis.. (2001)   (Correct)

....of single characters needed to transform the first string into the second. # 2. 2 Q grams: A Foundation for Approximate String Processing Below, we briefly review the notion of positional q grams from the literature, and we give the intuition behind their use for approximate string matching [16, 15, 13]. Given a string #, its positional q grams are obtained by sliding a window of length q over the characters of #. Since q grams at the beginning and the end of the string can have fewer than q characters from #, we introduce new characters # and not in #, and conceptually extend the ....

....set of all the # q 1 pairs constructed from all q grams of #. # The intuition behind the use of q grams as a foundation for approximate string processing is that when two strings # 1 and # 2 are within a small edit distance of each other, they share a large number of q grams in common [15, 13]. The following example illustrates this observation. Example 2.1 [Positional q gram] The positional q grams of length q=3 for string john smith are (1,##j) 2,#jo) 3,joh) 4,ohn) 5,hn ) 6,n s) 7, sm) 8,smi) 9,mit) 10,ith) 11,th ) 12,h ) Similarly, the positional q ....

E. Ukkonen. Approximate string matching with q-grams and maximal matches. In Theoretical Computer Science (TCS), 92(1):191-- 211, 1992


SST: An algorithm for searching sequence databases in time .. - Eldar Giladi Michael (2000)   (8 citations)  (Correct)

....tree structured index for nearest neighbor windows in the database. Only windows which are at L 1 distance less than a threshold T are returned. However, SST is not guaranteed to return all nearest neighbors at distance T , as we shall see in our computations. Our work is most closely related to [1, 3, 4]. 3 Computational results We illustrate the performance of SST by applying it to detecting overlapping fragments in shotgun sequence assembly. We fragment a 1.5 megabase sequence of genomic DNA several times using a Poisson process with = 300 nucleotides. From the pool of fragments we generate ....

Ukkonen, E., Approximate string-matching with q-grams and maximal matches, Theoretical Computer Science, 92(1):191--211, 1992.


Fast String Correction with Levenshtein-Automata - Schulz, Mihov (2002)   (Correct)

....rst checked if the word is in the dictionary. In the negative case, the words of the dictionary that are most similar to W are good suggestions for correction. Similarity between two words can be measured in several ways. Most popular are (dis)similarity measures based on n gram distances [AFW83, Ukk92, KST92, KST94] or on variants of the Levenshtein distance [Lev66, WF74, WBR95, SKS96, OL97] In this paper, we take the Levenshtein distance as a basis. 1 The standard algorithm for computing the Levenshtein distance between two words by Wagner and Fisher [WF74] uses a dynamic programming ....

....Even with more sophisticated algorithms (cf. Ukk85, Mye99] it is not realistic to compute the Levenshtein distance between the input word W and each of the words in the dictionary, already for dictionaries of a modest size. Even if there are linear algorithms for computing n gram distances [Ukk92] the situation is basically the same for these metrics. Several solutions have been proposed to overcome this problem. In most approaches, the correction is divided in two phases. In a rst step, a ltering technique is used to select a small number of dictionary words that is guaranteed to ....

E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92:191-211, 1992.


The Expected Number of Missing Words in a Random Text - Rahmann, Rivals (2000)   (Correct)

....where the exact computation is required. Key words: average case analysis, approximate pattern matching, q gram filtration, monkey test, autocorrelation. 1 Introduction The number of words of length q (q grams) is an important statistic of a text. It serves to search for a pattern in a text [9, 10], to measure distances between texts, to estimate the entropy of their source, or to construct monkey tests for pseudorandom number generators (PRNG) 6, 7] The literature does not report any exact systematic statistical study of the number of missing q grams in a random text, nor of the number ....

....checked with a O(n 2 ) dynamic programming algorithm. Their performance relies on the fact that a match is not expected to occur very often. Several filtration strategies are based on the condition that an approximate match and P should share a sufficient number of q grams (among many others see [5, 10, 9]) The average running time depends on the NCW. The ability to compute its expectation should allow to analyze their average complexity and to determine in practice under which conditions which algorithm performs best. Those algorithms are extensively used with static or dynamic texts, especially ....

[Article contains additional citation context not shown here]

E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191--211, Jan. 1992.


Combinatorial Pattern Matching in Musical Sequences - Perttu (2000)   Self-citation (Ukkonen)   (Correct)

....up to some threshold distance. In MIR, the applicability of the Levenshtein distance, longest 1 Dynamic programming is also a generic algorithmic technique; we use the term to refer to the use of dynamic programming in approximate string matching. 5 common subsequence, and two n gram measures [37] are compared by Uitdenbogerd and Zobel [34] the Levenshtein distance is found to be superior to the other methods. In the thesis, our focus is on bit parallel approximate string matching algorithms with no index structure. Approximate string matching is the topic of Chapter 4. In the above ....

E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 92:


Approximate Matching of Hierarchical Data Using pq-Grams - Augsten, Böhlen, Gamper (2005)   (Correct)

No context found.

E. Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191--211, Jan. 1992.


Using q-grams in a DBMS for Approximate String Processing - Gravano, Ipeirotis.. (2001)   (Correct)

No context found.

Esko Ukkonen. Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191--211, 1992.


A Similarity-Based Approach And Evaluation Methodology For.. - Kondrak, Dorr (2003)   (Correct)

No context found.

Esko Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92:191--211, 1992.


Database indexing for large DNA and protein sequence.. - Hunt, Atkinson, Irving (2002)   (Correct)

No context found.

E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theoret Comput Sci 92(1):191--212, 1992


Finding Approximate Matches in Large Lexicons - Zobel, Dart (1995)   (7 citations)  (Correct)

No context found.

E. Ukkonen, `Approximate string-matching with q-grams and maximal matches', Theoretical Computer Science, 92, 191--211 (1992).


Finding Approximate Matches in Large Lexicons - Zobel, Dart (1995)   (7 citations)  (Correct)

No context found.

E. Ukkonen, `Approximate string-matching with q-grams and maximal matches', Theoretical Computer Science, 92, 191--211, (1992).

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC