Abstract:
Several string matching algorithms exist, the most famous are Knuth-Morris-Pratt (KMP), BoyerMoore (BM), and some variations on BM, like Hoorspool and Sunday. Most of these algorithms rely on different kinds of automata to speed up the search, which were traditionally made deterministic. After 1990, two new approaches have been studied separately. The first one simulates the automata in their nondeterministic form by using bits and exploiting the intrinsic parallelism inside the computer word, e.g. Shift-Or. Those algorithms are extended to handle classes of characters and errors in the pattern and/or in the text, their drawback being their inability to skip characters. The second one uses "suffix automata " to design new optimal string matching algorithms, e.g. BDM and Turbo BDM. In this paper we merge both approaches to obtain a new algorithm, called BNDM, which uses a nondeterministic suffix automaton simulated using bit-parallelism. This algorithm is 20%-25 % faster than BDM, uses less memory, it is 2-3 times faster than Shift-Or, it is 10%-40 % faster than all the BM family, and it is very simple to implement. The algorithm becomes the fastest in all cases, except for extremely short or extremely long patterns (e.g. on English we are the fastest between 2 and 110 characters). Moreover, the algorithm inherits all the flexibility of the bit-parallel paradigm: we show that all the extensions devised for Shift-Or to handle classes of characters, multiple patterns and even errors can be speeded-up with the technique to skip characters. We obtain faster, very competitive algorithms for all these extensions. In particular ours is by far the fastest technique to deal with classes of characters. As a theoretical development related with this, we introduce a new automaton to recognize suffixes of patterns with classes of characters. To the best of our knowledge, this automaton has not been studied before.
Citations
|
447
|
Fast pattern matching in strings
– Knuth, Morris, et al.
- 1977
|
|
377
|
A fast string searching algorithm
– Boyer, Moore
- 1977
|
|
307
|
Text Algorithms
– Crochemore, Rytter
- 1994
|
|
236
|
Fast text searching allowing errors
– Wu, Manber
- 1992
|
|
187
|
A guided tour to approximate string matching
– Navarro
|
|
165
|
A new approach to Text searching
– Baeza-Yates, Gonnet
- 1992
|
|
101
|
A fast bit-vector algorithm for approximate string matching based on dynamic programming
– Myers
- 1998
|
|
100
|
AGREP-a fast approximate pattern-matching tool
– Wu, Manber
- 1992
|
|
93
|
String-matching and other products
– Fischer, Paterson
- 1974
|
|
82
|
Transducers and Repetitions
– Crochemore
- 1986
|
|
82
|
A very fast substring search algorithm
– Sunday
- 1990
|
|
74
|
Generalized String Matching
– Abrahamson
- 1987
|
|
72
|
Speeding up two string matching algorithms
– Crochemore, Czumaj, et al.
- 1994
|
|
67
|
Faster approximate string matching
– Baeza-Yates, Navarro
- 1999
|
|
60
|
Practical fast searching in strings
– Horspool
- 1980
|
|
44
|
Text retrieval: Theory and practice
– Baeza-Yates
- 1992
|
|
42
|
A subquadratic algorithm for approximate limited expression matching
– Wu, Manber, et al.
- 1996
|
|
38
|
The complexity of pattern matching for a random string
– Yao
- 1979
|
|
33
|
A Comparison of Approximate String Matching Algorithms
– JOKINEN, TARHIO, et al.
- 1996
|
|
29
|
Fast and practical approximate pattern matching
– Baeza-Yates, Perleberg
- 1992
|
|
26
|
Approximate Text Searching
– Navarro
- 1998
|
|
24
|
A bit-parallel approach to suffix automata: Fast extended string matching
– Navarro, Raffinot
- 1998
|
|
23
|
NR-grep: a fast and flexible pattern matching tool
– Navarro
- 2001
|
|
21
|
Very fast and simple approximate string matching
– Navarro, Baeza-Yates
- 1999
|
|
21
|
Approximate Boyer-Moore string matching
– Tarhio, Ukkonen
- 1993
|
|
21
|
G.: Faster Approximate String Matching, Algorithmica
– Baeza-Yates, Navarro
- 1999
|
|
18
|
Fast string matching with mismatches
– Baeza-Yates, Gonnet
|
|
14
|
Fast regular expression search
– Navarro, Raffinot
- 1999
|
|
13
|
Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching
– Navarro, Raffinot
|
|
11
|
Average Sizes of Suffix Trees and DAWGs
– Blumer, Ehrenfeucht, et al.
- 1989
|
|
9
|
Efficient string matching with don’t care patterns
– Pinter
- 1985
|
|
7
|
A partial deterministic automaton for approximate string matching
– Navarro
- 1997
|
|
6
|
Boyer-Moore Strategy to Efficient Approximate String Matching
– El-Mabrouk, Crochemore
- 1996
|
|
6
|
On the multi backward dawg matching algorithm (MultiBDM
– Raffinot
- 1997
|
|
4
|
Fast practical multi-pattern matching. Rapport 93-3, Institut Gaspart Monge, Université de Marne la Vallée
– Crochemore, Czuma, et al.
- 1993
|
|
2
|
Asymptotic estimation of the average number of terminal states in dawgs
– Raffinot
- 1997
|
|
2
|
Recherches de mot. Th`ese de doctorat, Universit'e d'Orl'eans
– Lecroq
- 1992
|
|
1
|
Recherches de mot. Ph. D. thesis, Universit'e d'Orl'eans
– Lecroq
- 1992
|