MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Fast and flexible string matching by combining bit-parallelism and suffix automata [28 citations — 11 self]

Download:
Download as a PDF | Download as a PS
by Gonzalo Navarro
ACM Journal of Experimental Algorithmics (JEA
http://www-igm.univ-mlv.fr/~raffinot/ftp/jacm98.ps.gz
Add To MetaCart

Abstract:

Several string matching algorithms exist, the most famous are Knuth-Morris-Pratt (KMP), BoyerMoore (BM), and some variations on BM, like Hoorspool and Sunday. Most of these algorithms rely on different kinds of automata to speed up the search, which were traditionally made deterministic. After 1990, two new approaches have been studied separately. The first one simulates the automata in their nondeterministic form by using bits and exploiting the intrinsic parallelism inside the computer word, e.g. Shift-Or. Those algorithms are extended to handle classes of characters and errors in the pattern and/or in the text, their drawback being their inability to skip characters. The second one uses "suffix automata " to design new optimal string matching algorithms, e.g. BDM and Turbo BDM. In this paper we merge both approaches to obtain a new algorithm, called BNDM, which uses a nondeterministic suffix automaton simulated using bit-parallelism. This algorithm is 20%-25 % faster than BDM, uses less memory, it is 2-3 times faster than Shift-Or, it is 10%-40 % faster than all the BM family, and it is very simple to implement. The algorithm becomes the fastest in all cases, except for extremely short or extremely long patterns (e.g. on English we are the fastest between 2 and 110 characters). Moreover, the algorithm inherits all the flexibility of the bit-parallel paradigm: we show that all the extensions devised for Shift-Or to handle classes of characters, multiple patterns and even errors can be speeded-up with the technique to skip characters. We obtain faster, very competitive algorithms for all these extensions. In particular ours is by far the fastest technique to deal with classes of characters. As a theoretical development related with this, we introduce a new automaton to recognize suffixes of patterns with classes of characters. To the best of our knowledge, this automaton has not been studied before.

Citations

447 Fast pattern matching in strings – Knuth, Morris, et al. - 1977
377 A fast string searching algorithm – Boyer, Moore - 1977
307 Text Algorithms – Crochemore, Rytter - 1994
236 Fast text searching allowing errors – Wu, Manber - 1992
187 A guided tour to approximate string matching – Navarro
165 A new approach to Text searching – Baeza-Yates, Gonnet - 1992
101 A fast bit-vector algorithm for approximate string matching based on dynamic programming – Myers - 1998
100 AGREP-a fast approximate pattern-matching tool – Wu, Manber - 1992
93 String-matching and other products – Fischer, Paterson - 1974
82 Transducers and Repetitions – Crochemore - 1986
82 A very fast substring search algorithm – Sunday - 1990
74 Generalized String Matching – Abrahamson - 1987
72 Speeding up two string matching algorithms – Crochemore, Czumaj, et al. - 1994
67 Faster approximate string matching – Baeza-Yates, Navarro - 1999
60 Practical fast searching in strings – Horspool - 1980
44 Text retrieval: Theory and practice – Baeza-Yates - 1992
42 A subquadratic algorithm for approximate limited expression matching – Wu, Manber, et al. - 1996
38 The complexity of pattern matching for a random string – Yao - 1979
33 A Comparison of Approximate String Matching Algorithms – JOKINEN, TARHIO, et al. - 1996
29 Fast and practical approximate pattern matching – Baeza-Yates, Perleberg - 1992
26 Approximate Text Searching – Navarro - 1998
24 A bit-parallel approach to suffix automata: Fast extended string matching – Navarro, Raffinot - 1998
23 NR-grep: a fast and flexible pattern matching tool – Navarro - 2001
21 Very fast and simple approximate string matching – Navarro, Baeza-Yates - 1999
21 Approximate Boyer-Moore string matching – Tarhio, Ukkonen - 1993
21 G.: Faster Approximate String Matching, Algorithmica – Baeza-Yates, Navarro - 1999
18 Fast string matching with mismatches – Baeza-Yates, Gonnet
14 Fast regular expression search – Navarro, Raffinot - 1999
13 Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching – Navarro, Raffinot
11 Average Sizes of Suffix Trees and DAWGs – Blumer, Ehrenfeucht, et al. - 1989
9 Efficient string matching with don’t care patterns – Pinter - 1985
7 A partial deterministic automaton for approximate string matching – Navarro - 1997
6 Boyer-Moore Strategy to Efficient Approximate String Matching – El-Mabrouk, Crochemore - 1996
6 On the multi backward dawg matching algorithm (MultiBDM – Raffinot - 1997
4 Fast practical multi-pattern matching. Rapport 93-3, Institut Gaspart Monge, Université de Marne la Vallée – Crochemore, Czuma, et al. - 1993
2 Asymptotic estimation of the average number of terminal states in dawgs – Raffinot - 1997
2 Recherches de mot. Th`ese de doctorat, Universit'e d'Orl'eans – Lecroq - 1992
1 Recherches de mot. Ph. D. thesis, Universit'e d'Orl'eans – Lecroq - 1992