| R. A. Baeza-Yates and G. H. Gonnet, A new approach to text searching, CACM, Vol 35, (1992), pp. 74-82. |
....to 2 m 1 when it is viewed as a decimal integer. Example. For =f1, 9g let us consider p=3,4,6,2, t=3,4,6,2,8,2,4,5,7,1 and =1. In the preprocessing table, DT ( denotes the positions where j p i j . For example, DT [3] 1011 because j3 p i j 1 for i = 1; 2; 4. i p i DT [1] DT [2] DT [3] DT [4] DT [5] DT [6] DT [7] DT [8] DT [9] 4 2 1 1 1 0 0 0 0 0 0 3 6 0 0 0 0 1 1 1 0 0 2 4 0 0 1 1 1 0 0 0 0 1 3 0 1 1 1 0 0 0 0 0 Table 1. The table DT for pattern p = 2; 6; 4; 3 and alphabet = f1; 9g. The table below evaluates DState j using the relation (3.1) For ....
....1 1 0 0 2 4 0 0 1 1 1 0 0 0 0 1 3 0 1 1 1 0 0 0 0 0 Table 1. The table DT for pattern p = 2; 6; 4; 3 and alphabet = f1; 9g. The table below evaluates DState j using the relation (3. 1) For example, DState 4 = LeftShift(DState 3 ) OR 1) AND DT [t 4 ] LeftShif t(0100) OR 1) AND DT [2] = 1000 OR 1) AND 1001 = 1001 AND 1001 = 1001 which implies that there is a match starting at position 1 of t, since the 4 th bit of DState 4 is 1. j 1 2 3 4 5 6 7 8 9 10 t j 3 4 6 2 8 2 4 5 7 1 LeftShift(DState j 1 ) OR 1 0001 0011 0111 1001 0011 0001 0011 0111 11 01 1001 DT [t j ] 1011 ....
R. A. Baeza-Yates and G. H. Gonnet, A new approach to text searching, CACM, Vol 35, (1992), pp. 74-82.
....have been designed, i.e. input or output sensitive algorithms. In this paper we present a new way of computing an LCS of two strings by using bit vector operations which is really fast in practice. The idea of using the bits of the computer word has been used extensively in the last years. In [4], Baeza Yates and Gonnet presented an O(nm=w) algorithm for the exact matching case and an O(nm log k=w) algorithm for the k mismatches problem, where w is the number of bits in a machine word, n the length of the text and m the length of the pattern. Wu and Manber in [16] showed an O(nkm=w) ....
R. A. Baeza-Yates and G. H. Gonnet, A new approach to text searching, Comm. Assoc. Comput. Mach., 35, 74-82, (1992).
....[4] In these fields, the huge amount of data to be processed sometimes billions of characters calls for algorithms that are better than linear. One way to accelerate the computations is to exploit the parallelism of vector operations, especially bit vector operations. For example, in [2] and [5] bitvectors are used to code the set of states of a non deterministic automaton. In this paper, as in [M99] we want to accelerate computations done with deterministic automata, and we use vectors to represent sequences of events or sequences of states. Given a deterministic finite ....
R. A. Baeza-Yates and G. H. Gonnet, A New Approach to Text Searching, Communications of the ACM, 35, (1992) 74-82.
....m of visited states. Since executing one transition is usually considered to be a constant time operation, the output sequence can be obtained in O(m) time. One way to accelerate the computations is to exploit the parallelism of vector operations, especially bit vector operations. For example, in [2] and [5] bit vectors are used to code the set of states of a non deterministic automaton. Another approach, developed in [9] uses bit vectors to code both the input and output sequence, and computes the output with a bounded number of bit wise operations on the input. This work prompts the ....
R. A. Baeza-Yates and G. H. Gonnet, A New Approach to Text Searching, Communications of the ACM, 35, (1992), 74-82.
....string matching method with their heuristic may be applied to the problem at the cost of losing information. Holub et al. 6] presented an algorithm based on the so called bit parallel algorithmic technique. They started by using the well known shift or algorithm of Baeza Yates and Gonnet [1] to find multi templates (several templates combined in a single query) within monophonic datasets, and arrived at an algorithm capable of finding multitemplates within polyphonic datasets in O(n 0 q) time 1 with a preprocessing phase taking O(rm j 0 j) time. Here q; r and j 0 j denote ....
R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Communications of the ACM, 35(10):74--82, 1992.
....independent on any parameter other than the length of the sequences. The idea of packing bits in a computer word to speed up algorithms has been used extensively in the last few years. One of the simplest and best known algorithm is the Shift And algorithm, originally by BaezaYates and Gonnet [3] and subsequently modified by Wu and Manber [19] that solves the exact pattern matching problem in O( nm w ) where n and m are the length of the two input strings and w the number of bits in a machine word (normally 32 or 64) Navarro and Raffinot [14] obtained a fast exact matching algorithm ....
R. A. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Commun. ACM, 35(10):74--82, 1992.
....we have also the choice of using it and matching P 1: i Gamma1 with the best suffix of T 1: j Gamma1 (the cost is C i Gamma1 ) This algorithm is O(mn) time and O(m) space. 3. A Bit parallel Simulation Bit parallelism is a technique of common use in string matching [2] firstly proposed in [1, 3]. The technique consists in taking advantage of the intrinsic parallelism of the bit operations inside a computer word. By using cleverly this fact, the number of operations that an algorithm performs can be cut down by a factor of at most w, where w is the number of bits in the computer word. ....
R. Baeza-Yates and G. Gonnet. A new approach to text searching. Comm. of the ACM, 35(10):74--82, Oct. 1992.
....previous ones. These algorithms make plenty use of bit parallelism , that consists in using the intrinsic parallelism of the bit manipulations inside computer words to perform many operations in parallel. Competitive algorithms have been obtained using bit parallelism for exact string matching [2, 22], approximate string matching [2, 22, 23, 3, 14] and REs matching [12, 21, 17] Although these algorithms generally work well only on patterns of moderate length, they are simpler, more exible (e.g. they can easily handle classes of characters) and have very low memory requirements. We ....
....plenty use of bit parallelism , that consists in using the intrinsic parallelism of the bit manipulations inside computer words to perform many operations in parallel. Competitive algorithms have been obtained using bit parallelism for exact string matching [2, 22] approximate string matching [2, 22, 23, 3, 14], and REs matching [12, 21, 17] Although these algorithms generally work well only on patterns of moderate length, they are simpler, more exible (e.g. they can easily handle classes of characters) and have very low memory requirements. We performed two di erent types of experiments, comparing ....
[Article contains additional citation context not shown here]
R. Baeza-Yates and G. Gonnet. A new approach to text searching. CACM, 35(10):74-82, October 1992.
....A computer word is used to represent the active (1) and inactive (0) states of the NFA. If the states are properly arranged and the Thompson construction [10] is used, then all the arrows carry 1 s from bit positions i to i 1, except for the transitions. Then, a generalization of Shift Or [2] (the canonical bit parallel algorithm for exact string matching) is presented, where for each text character two steps are performed. First, a forward step moves all the 1 s that can move from a state to the next one. This is achieved by precomputing a table B : Sigma 2 O(m) such that the ....
R. Baeza-Yates and G. Gonnet. A new approach to text searching. CACM, 35(10):74--82, October 1992.
....In this paper we present a new algorithm which is very fast in practice, reasonably simple to implement, and supports a large number of variations of the approximate string matching problem. The algorithm is based on a numeric scheme for exact string matching developed by Baeza Yates and Gonnet [BG89] (See also [BG91] The algorithm can handle several variations of measures and most of the common types of queries, including arbitrary regular expressions. In our experiments, the algorithm was at least twice as fast as other algorithms we tested (which are not as flexible) and for many cases ....
....describe a tool called agrep for approximate string matching based on the algorithm. Agrep is available through anonymous ftp from cs.arizona.edu. 3 2. The Algorithm We first describe the case of exact string matching. The algorithm for this case is identical with that of Baeza Yates and Gonnet [BG89]. We then show how to extend the algorithm to search with errors. We then describe how to speed up the search with errors by using an exact search most of the time. 2.1. Exact Matching Let R be a bit array of size m (the size of the pattern) We denote by R j the value of the array R after the j ....
[Article contains additional citation context not shown here]
Baeza-Yates R. A., and G. H. Gonnet, "A new approach to text searching," Proceedings of the 12th Annual ACM-SIGIR conference on Information Retrieval, Cambridge, MA (June 1989), pp. 168-175.
.... data, retrieval, information 30 5. 1 Full text scanning ffl single term: Knuth, Morris and Pratt [KMP77] Boyer and Moore [BM77] and improvements [Sun90] ffl multiple terms: Aho and Corasick [AC75] ffl randomized algorithm: Fingerprints [KR87] ffl approximate match: agrep [WM92] BYG92] documents c a t NO space overhead BUT slow. 31 5.2 Inversion Aaron zoo . document file BUT: ffl Space overhead ffl Expensive insertions 32 data . data . new document document file . zoo . Aaron ffl STAIRS [IBM] ffl MEDLARS ....
Ricardo Baeza-Yates and Gaston H. Gonnet. A new approach to text searching. Comm. of ACM (CACM), 35(10):74--82, October 1992.
....previous ones. These algorithms make plenty use of bit parallelism , that consists in using the intrinsic parallelism of the bit manipulations inside computer words to perform many operations in parallel. Competitive algorithms have been obtained using bit parallelism for exact string matching [2, 22], approximate string matching [2, 22, 23, 3, 14] and REs matching [12, 21, 17] Although these algorithms generally work well only on patterns of moderate length, they are simpler, more flexible (e.g. they can easily handle classes of characters) and have very low memory requirements. We ....
....plenty use of bit parallelism , that consists in using the intrinsic parallelism of the bit manipulations inside computer words to perform many operations in parallel. Competitive algorithms have been obtained using bit parallelism for exact string matching [2, 22] approximate string matching [2, 22, 23, 3, 14], and REs matching [12, 21, 17] Although these algorithms generally work well only on patterns of moderate length, they are simpler, more flexible (e.g. they can easily handle classes of characters) and have very low memory requirements. We performed two different types of experiments, ....
[Article contains additional citation context not shown here]
R. Baeza-Yates and G. Gonnet. A new approach to text searching. CACM, 35(10):74--82, October 1992.
....text character matches P i then we have also the choice of using it and matching P 1: i 1 with the best sux of T 1: j 1 (the cost is C i 1 ) This algorithm is O(mn) time and O(m) space. 3 Bit parallelism Bit parallelism is a technique of common use in string matching [2] rstly proposed in [1, 3]. The technique consists in taking advantage of the intrinsic parallelism of the bit operations inside a computer word. By using cleverly this fact, the number of operations that an algorithm performs can be cut down by a factor of at most w, where w is the number of bits in the computer word. ....
R. Baeza-Yates and G. Gonnet. A new approach to text searching. Comm. of the ACM, 35(10):74{ 82, October 1992.
....string matching is done with the Horspool algorithm [11] a variant of the Boyer Moore family [6] The speed of the BoyerMoore string matching algorithms comes from their ability to skip (i.e. not inspect) some text characters. Agrep deals with more complex patterns using a variant of Shift Or [2], an algorithm exploiting bit parallelism (a concept that we explain later) to simulate nondeterministic automata (NFA) efficiently. Shift Or, however, cannot skip text characters. Multipattern searching is treated with bit parallelism or with a different algorithm depending on the case. As a ....
....ffl ( Dr. Prof. Mr. #) Knuth , which matches with Knuth preceded by a sequence of titles. 3 Pattern Matching Algorithms We explain in this section the basic string and regular expression searching algorithms our software builds on. 3. 1 Bit Parallelism and the Shift Or Algorithm In [2], a new approach to text searching was proposed. It is based on bit parallelism [1] This technique consists in taking advantage of the intrinsic parallelism of the bit operations inside a computer word. By using cleverly this fact, the number of operations that an algorithm performs can be cut ....
[Article contains additional citation context not shown here]
R. Baeza-Yates and G. Gonnet. A new approach to text searching. Communications of the ACM, 35(10):74--82, 1992.
....In the preprocessing table, DT ( denotes the positions where j p i j . For example, DT [3] 1011 because j3 p i j 1 for i=1,2,4. The table below evaluates DState j using the relation (3. 1) For example, DState 4 = LeftShift(DState 3 ) OR 1) AND DT [t 4 ] LeftShif t(0100) OR 1) AND DT [2] = 1000 OR 1) AND 1001 = 1001 AND 1001 = 1001 i p i DT [1] DT [2] DT [3] DT [4] DT [5] DT [6] DT [7] DT [8] DT [9] 4 2 1 1 1 0 0 0 0 0 0 3 6 0 0 0 0 1 1 1 0 0 2 4 0 0 1 1 1 0 0 0 0 1 3 0 1 1 1 0 0 0 0 0 Table 1. The table DT for pattern p = 2; 6; 4; 3 and alphabet = f1; 9g. j ....
....j . For example, DT [3] 1011 because j3 p i j 1 for i=1,2,4. The table below evaluates DState j using the relation (3. 1) For example, DState 4 = LeftShift(DState 3 ) OR 1) AND DT [t 4 ] LeftShif t(0100) OR 1) AND DT [2] 1000 OR 1) AND 1001 = 1001 AND 1001 = 1001 i p i DT [1] DT [2] DT [3] DT [4] DT [5] DT [6] DT [7] DT [8] DT [9] 4 2 1 1 1 0 0 0 0 0 0 3 6 0 0 0 0 1 1 1 0 0 2 4 0 0 1 1 1 0 0 0 0 1 3 0 1 1 1 0 0 0 0 0 Table 1. The table DT for pattern p = 2; 6; 4; 3 and alphabet = f1; 9g. j 1 2 3 4 5 6 7 8 9 10 t j 3 4 6 2 8 2 4 5 7 1 LeftShift(DState j 1 ) ....
R. A. Baeza-Yates and G. H. Gonnet, A new approach to text searching, CACM, Vol 35, (1992), pp. 74-82.
....previous ones. These algorithms make plenty use of bit parallelism , that consists in using the intrinsic parallelism of the bit manipulations inside computer words to perform many operations in parallel. Competitive algorithms have been obtained using bit parallelism for exact string matching [2, 26], approximate string matching [2, 26, 27, 3, 17] and REs matching [15, 25, 20] Although these algorithms generally work well only on patterns of moderate length, they are simpler, more exible (e.g. they can easily handle classes of characters) and have very low memory requirements. We ....
....plenty use of bit parallelism , that consists in using the intrinsic parallelism of the bit manipulations inside computer words to perform many operations in parallel. Competitive algorithms have been obtained using bit parallelism for exact string matching [2, 26] approximate string matching [2, 26, 27, 3, 17], and REs matching [15, 25, 20] Although these algorithms generally work well only on patterns of moderate length, they are simpler, more exible (e.g. they can easily handle classes of characters) and have very low memory requirements. We performed two di erent types of time experiments, ....
[Article contains additional citation context not shown here]
R. Baeza-Yates and G. Gonnet. A new approach to text searching. CACM, 35(10):74-82, October 1992.
....models of machines accounting for bitwise boolean operations and shifts with those of conventional machines, such as Turing machines, RAMs etc. TRL92, BG95] Going back to the first motivation of [PRS74] concrete applications of this technique to varieties of string matching problems began with [BYG92, WM92]: they are known as bit parallelism or shift OR. We follow this path with our problem, which is close to the problems treated in [BYG92, WM92, BYN96] although it is different from these problems. In what follows, we use a refinement of the RAM model, which is a more realistic model of ....
....[TRL92, BG95] Going back to the first motivation of [PRS74] concrete applications of this technique to varieties of string matching problems began with [BYG92, WM92] they are known as bit parallelism or shift OR. We follow this path with our problem, which is close to the problems treated in [BYG92, WM92, BYN96], although it is different from these problems. In what follows, we use a refinement of the RAM model, which is a more realistic model of computation. Moreover, we encode A in such a way that (i) each state of A can be stored in a single memory location, and (ii) only the most basic ....
R. Baeza-Yates, G. Gonnet, A new approach to text searching, Communications of the ACM, Vol 3 (1992), 74--82.
....complexity of several matching problems and presented algorithms for strings with classes, which are sets of characters, but as he himself noted, even though the worst case running times were better than quadratic, the algorithms were not practical. Baeza Yates and Gonnet s shift or algorithm [BG92] can handle all of the above options and it is very practical but only for short patterns. The approximate string matching problem is a generalization of the exact string matching problem in that now we are looking for all substrings in the text that are similar to the pattern. There are many ....
Baeza-Yates R. A., and G. H. Gonnet, "A new approach to text searching," Communications of the ACM 35 (October 1992), pp. 74-82.
....by taking into account separators that contain a period ( Those leaves of the Huffman tree will have a bit mask composed of zeros and therefore no phrase occurrence will contain them. The remaining problem is how to implement this automaton efficiently. The algorithm of choice is Shift Or (Baeza Yates and Gonnet, 1992), which is able to simulate an automaton of up to w states (where w is the length in bits of the computer word) performing a constant number of operations per text character. In our case, it means that we can solve phrases of up to 32 or 64 words, depending on the machine, extremely fast. Longer ....
Baeza-Yates, R. and G. Gonnet: 1992, `A new approach to text searching'. Communications of the ACM 35(10), 74--82.
....tree is reached its bit mask is sent to the automaton. An active state i Gamma 1 will activate the state i only if the i th bit of the mask is active. Therefore, the automaton makes one transition per word of the text. This automaton can be implemented efficiently using the Shift Or algorithm [1]. This algorithm is able to simulate an automaton of up to w 1 states (where w is the length in bits of the computer word) performing just two operations per text character. This means that it can search phrases of up to 32 or 64 words, depending on the machine. This is more than enough for ....
R. Baeza-Yates and G. Gonnet. A new approach to text searching. Comm. of the ACM, 35(10):74--82, October 1992.
.... surger because there are not five letters of the pattern in the text. However, the filter cannot discard the possibility that the pattern appears in the text window yevrus . 3 2. 2 Bit Parallelism Bit parallelism is a technique of common use in string matching [3] It was first proposed in [2, 4]. The technique consists in taking advantage of the intrinsic parallelism of the bit operations inside a computer word. By using cleverly this fact, the number of operations that an algorithm performs can be cut down by a factor of at most w, where w is the number of bits in the computer word. ....
R. Baeza-Yates and G. Gonnet. A new approach to text searching. CACM, 35(10):74--82, October 1992.
....bit set in 1. Any conventional pattern matching algorithm can be used for exact searching and a multi pattern matching algorithm is used for searching allowing errors, as explained later on. The second algorithm searches on a plain Huffman code and is based on a wordoriented Shift Or algorithm [Baeza Yates and Gonnet 1992]. In this case the com Fast and Flexible Word Searching on Compressed Text Delta 3 pression obtained is better than with tagged Huffman code because the search algorithm does not need any special marks on the compressed text. The third algorithm is a combination of the previous ones, where the ....
....to state i 1 whenever the i th word of the pattern is recognized. Notice that this automaton depends only on the number of words in the phrase query. After reaching a leaf we return to the root of the tree and proceed in the compressed text. The automaton is simulated with the Shift Or algorithm [Baeza Yates and Gonnet 1992]. We perform one transition in the automaton for each text word. The ShiftOr algorithm simulates efficiently the nondeterministic automaton using only two operations per transition. In a 32 bit architecture it can search a phrase of up to 32 elements using a single computer word as the bit mask. ....
Baeza-Yates, R. and Gonnet, G. 1992. A new approach to text searching. Communications of the ACM 35, 10, 74--82.
....by taking into account separators that contain a period ( Those leaves of the Huffman tree will have a bit mask composed of zeros and therefore no phrase occurrence will contain them. The remaining problem is how to implement this automaton efficiently. The algorithm of choice is Shift Or [3], which is able to simulate an automaton of up to w states (where w is the length in bits of the computer word) performing a constant number of operations per text character. In our case, it means that we can solve phrases of up to 32 or 64 words, depending on the machine, extremely fast. Longer ....
R. Baeza-Yates and G. Gonnet. A new approach to text searching. Comm. of the ACM, 35(10):74--82, October 1992.
....error ratios. The longer the patterns, the smaller the error ratios for which our algorithm is the best. In the other cases, 6] is the best, except for k very close to m, where automaton partitioning becomes O(mn= log n) and outperforms the others. As in the shift or algorithm for exact matching [4], we can specify a set of characters at each position of the pattern instead of a single one (e.g. to search for text in case insensitive, we search for ft; Tgfe; Egfx; Xgft;Tg) In fact, we can represent any limited expression , as defined in [26] This is achieved by modifying the t table, ....
R. Baeza-Yates and G. Gonnet. A new approach to text searching. CACM, 35(10):74--82, October 1992.
No context found.
R. Baeza-Yates, G.H. Gonnet. A new approach to text searching. Communications of the ACM, 35(10): 74--82, 1992.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC