Results 1  10
of
51
A Guided Tour to Approximate String Matching
 ACM COMPUTING SURVEYS
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract

Cited by 598 (36 self)
 Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems.
A New Approach to Text Searching
"... We introduce a family of simple and fast algorithms for solving the classical string matching problem, string matching with classes of symbols, don't care symbols and complement symbols, and multiple patterns. In addition we solve the same problems allowing up to k mismatches. Among the feature ..."
Abstract

Cited by 293 (15 self)
 Add to MetaCart
We introduce a family of simple and fast algorithms for solving the classical string matching problem, string matching with classes of symbols, don't care symbols and complement symbols, and multiple patterns. In addition we solve the same problems allowing up to k mismatches. Among the features of these algorithms are that they don't need to buffer the input, they are real time algorithms (for constant size patterns), and they are suitable to be implemented in hardware. 1 Introduction String searching is a very important component of many problems, including text editing, bibliographic retrieval, and symbol manipulation. Recent surveys of string searching can be found in [17, 4]. The string matching problem consists of finding all occurrences of a pattern of length m in a text of length n. We generalize the problem allowing "don't care" symbols, the complement of a symbol, and any finite class of symbols. We solve this problem for one or more patterns, with or without mismatches. Fo...
Faster Approximate String Matching
 Algorithmica
, 1999
"... We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, ..."
Abstract

Cited by 79 (23 self)
 Add to MetaCart
We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e. whenever mk = O(log n)), where m is the pattern length and k ! m the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m = O(log n). Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk=w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps and others, at essentially the same search cost. We then explore other novel techniques t...
Fast and Flexible String Matching by Combining Bitparallelism and Suffix Automata
 ACM JOURNAL OF EXPERIMENTAL ALGORITHMICS (JEA
, 1998
"... ... In this paper we merge bitparallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bitparallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as ShiftOr. It inher ..."
Abstract

Cited by 74 (10 self)
 Add to MetaCart
... In this paper we merge bitparallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bitparallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as ShiftOr. It inherits from ShiftOr the ability to handle flexible patterns and from BDM the ability to skip characters. BNDM is 30%40% faster than BDM and up to 7 times faster than ShiftOr. When compared to the fastest existing algorithms on exact patterns (which belong to the BM family), BNDM is from 20% slower to 3 times faster, depending on the alphabet size. With respect to flexible pattern searching, BNDM is by far the fastest technique to deal with classes of characters and is competitive to search allowing errors. In particular, BNDM seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that
Indexing Methods for Approximate String Matching
 IEEE Data Engineering Bulletin
, 2000
"... Indexing for approximate text searching is a novel problem receiving much attention because of its applications in signal processing, computational biology and text retrieval, to name a few. We classify most indexing methods in a taxonomy that helps understand their essential features. We show that ..."
Abstract

Cited by 66 (10 self)
 Add to MetaCart
(Show Context)
Indexing for approximate text searching is a novel problem receiving much attention because of its applications in signal processing, computational biology and text retrieval, to name a few. We classify most indexing methods in a taxonomy that helps understand their essential features. We show that the existing methods, rather than completely different as they are regarded, form a range of solutions whose optimum is usually somewhere in between.
A Hybrid Indexing Method for Approximate String Matching
"... We present a new indexing method for the approximate string matching problem. The method is based on a suffix array combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the average retrieval time is Ç Ò � ÐÓ � Ò,forsome�� that depends on the error fraction t ..."
Abstract

Cited by 63 (10 self)
 Add to MetaCart
(Show Context)
We present a new indexing method for the approximate string matching problem. The method is based on a suffix array combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the average retrieval time is Ç Ò � ÐÓ � Ò,forsome�� that depends on the error fraction tolerated « and the alphabet size �. Itisshownthat �� for approximately « � � � Ô �,where � � � � ����. Thespace required is four times the text size, which is quite moderate for this problem. We experimentally show that this index can outperform by far all the existing alternatives for indexed approximate searching. These are also the first experiments that compare the different existing schemes.
A General Practical Approach to Pattern Matching over ZivLempel Compressed Text
, 1998
"... . We address the problem of string matching on ZivLempel compressed text. The goal is to search a pattern in a text without uncompressing it. This is a highly relevant issue to keep compressed text databases where efficient searching is still possible. We develop a general technique for string matc ..."
Abstract

Cited by 60 (10 self)
 Add to MetaCart
. We address the problem of string matching on ZivLempel compressed text. The goal is to search a pattern in a text without uncompressing it. This is a highly relevant issue to keep compressed text databases where efficient searching is still possible. We develop a general technique for string matching when the text comes as a sequence of blocks. This abstracts the essential features of ZivLempel compression. We then apply the scheme to each particular type of compression. We present the first algorithm to find all the matches of a pattern in a text compressed using LZ77. When we apply our scheme to LZ78, we obtain a much more efficient search algorithm, which is faster than uncompressing the text and then searching on it. Finally, we propose a new hybrid compression scheme which is between LZ77 and LZ78, being in practice as good to compress as LZ77 and as fast to search in as LZ78. 1 Introduction String matching is one of the most pervasive problems in computer science, with appli...
A Faster Algorithm for Approximate String Matching
 Algorithmica
, 1996
"... . We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length O(log n), being n the maxi ..."
Abstract

Cited by 59 (27 self)
 Add to MetaCart
(Show Context)
. We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length O(log n), being n the maximum size of the text. The running time achieved is O(n) for small patterns (i.e. of length m = O( p log n)), independently of the maximum number of errors allowed, k. This algorithm is then used to design two general algorithms. One of them partitions the problem into subproblems, while the other partitions the automaton into subautomata. These algorithms are combined to obtain a hybrid algorithm which on average is O(n) for moderate k=m ratios, O( p mk= log n n) for medium ratios, and O((m \Gamma k)kn= log n) for large ratios. We show experimentally that this hybrid algorithm is faster than previous ones for moderate size of patterns and error ratios, which is the case in text search...
A Bitparallel Approach to Suffix Automata: Fast Extended String Matching
, 1998
"... . We present a new algorithm for string matching. The algorithm, called BNDM, is the bitparallel simulation of a known (but recent) algorithm called BDM. BDM skips characters using a "suffix automaton " which is made deterministic in the preprocessing. BNDM, instead, simulates the nondete ..."
Abstract

Cited by 53 (5 self)
 Add to MetaCart
. We present a new algorithm for string matching. The algorithm, called BNDM, is the bitparallel simulation of a known (but recent) algorithm called BDM. BDM skips characters using a "suffix automaton " which is made deterministic in the preprocessing. BNDM, instead, simulates the nondeterministic version using bitparallelism. This algorithm is 20%25% faster than BDM, 23 times faster than other bitparallel algorithms, and 10%40% faster than all the BoyerMoore family. This makes it the fastest algorithm in all cases except for very short or very long patterns (e.g. on English text it is the fastest between 5 and 110 characters). Moreover, the algorithm is very simple, allowing to easily implement other variants of BDM which are extremely complex in their original formulation. We show that, as other bitparallel algorithms, BNDM can be extended to handle classes of characters in the pattern and in the text, multiple patterns and to allow errors in the pattern or in the text, combin...
NRgrep: A Fast and Flexible Pattern Matching Tool
 Software Practice and Experience (SPE
, 2000
"... We present nrgrep ("nondeterministic reverse grep"), a new pattern matching tool designed for efficient search of complex patterns. Unlike previous tools of the grep family, such as agrep and Gnu grep, nrgrep is based on a single and uniform concept: the bitparallel simulation of a non ..."
Abstract

Cited by 46 (7 self)
 Add to MetaCart
(Show Context)
We present nrgrep ("nondeterministic reverse grep"), a new pattern matching tool designed for efficient search of complex patterns. Unlike previous tools of the grep family, such as agrep and Gnu grep, nrgrep is based on a single and uniform concept: the bitparallel simulation of a nondeterministic suffix automaton. As a result, nrgrep can find from simple patterns to regular expressions, exactly or allowing errors in the matches, with an efficiency that degrades smoothly as the complexity of the searched pattern increases. Another concept fully integrated into nrgrep and that contributes to this smoothness is the selection of adequate subpatterns for fast scanning, which is also absent in many current tools. We show that the efficiency of nrgrep is similar to that of the fastest existing string matching tools for the simplest patterns, and by far unpaired for more complex patterns.