Results 1 - 10
of
25
A Guided Tour to Approximate String Matching
- ACM COMPUTING SURVEYS
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract
-
Cited by 598 (36 self)
- Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems.
On-Line Construction of Suffix Trees
, 1995
"... An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string. The new algorithm has the desirable property of processing the string symbol by symbol from left to right. It has always the suffix tree for the scanned part of the strin ..."
Abstract
-
Cited by 437 (2 self)
- Add to MetaCart
An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string. The new algorithm has the desirable property of processing the string symbol by symbol from left to right. It has always the suffix tree for the scanned part of the string ready. The method is developed as a linear-time version of a very simple algorithm for (quadratic size) suffix tries. Regardless of its quadratic worst-case this latter algorithm can be a good practical method when the string is not too long. Another variation of this method is shown to give in a natural way the well-known algorithms for constructing suffix automata (DAWGs).
Fast and Practical Approximate String Matching
- In Combinatorial Pattern Matching, Third Annual Symposium
, 1992
"... We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searchin ..."
Abstract
-
Cited by 68 (0 self)
- Add to MetaCart
(Show Context)
We present new algorithms for approximate string matching based in simple, but efficient, ideas. First, we present an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases. This is a new approach to string searching. Second, we present an algorithm for string matching with errors based on partitioning the pattern that requires linear expected time for typical inputs. 1 Introduction Approximate string matching is one of the main problems in combinatorial pattern matching. Recently, several new approaches emphasizing the expected search time and practicality have appeared [3, 4, 27, 32, 31, 17], in contrast to older results, most of them are only of theoretical interest. Here, we continue this trend, by presenting two new simple and efficient algorithms for approximate string matching. First, we present an algorithm for string matching with k mismatches. This problem consists of finding all instances o...
A Comparison of Approximate String Matching Algorithms
, 1991
"... Experimental comparison of the running time of approximate string matching algorithms for the�differences problem is presented. Given a pattern string, a text string, and integer�, the task is to find all approximate occurrences of the pattern in the text with at most�differences (insertions, deleti ..."
Abstract
-
Cited by 47 (1 self)
- Add to MetaCart
Experimental comparison of the running time of approximate string matching algorithms for the�differences problem is presented. Given a pattern string, a text string, and integer�, the task is to find all approximate occurrences of the pattern in the text with at most�differences (insertions, deletions, changes). We consider seven algorithms based on different approaches including dynamic programming, Boyer-Moore string matching, suffix automata, and the distribution of characters. It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable. 2��� KEY WORDS String matching Edit distance k differences problem
Suffix Cactus: A Cross between Suffix Tree and Suffix Array
, 1995
"... The suffix cactus is a new alternative to the suffix tree and the suffix array as an index of large static texts. Its size and its performance in searches lies between those of the suffix tree and the suffix array. Structurally, the suffix cactus can be seen either as a compact variation of the suff ..."
Abstract
-
Cited by 36 (2 self)
- Add to MetaCart
(Show Context)
The suffix cactus is a new alternative to the suffix tree and the suffix array as an index of large static texts. Its size and its performance in searches lies between those of the suffix tree and the suffix array. Structurally, the suffix cactus can be seen either as a compact variation of the suffix tree or as an augmented suffix array.
On Compact Directed Acyclic Word Graphs
- Structures in Logic and Computer Science
, 1997
"... The Directed Acyclic Word Graph (DAWG) is a space-efficient data structure to treat and analyze repetitions in a text, especially in DNA genomic sequences. Here, we consider the Compact Directed Acyclic Word Graph of a word. We give the first direct algorithm to construct it. It runs in time lin ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
The Directed Acyclic Word Graph (DAWG) is a space-efficient data structure to treat and analyze repetitions in a text, especially in DNA genomic sequences. Here, we consider the Compact Directed Acyclic Word Graph of a word. We give the first direct algorithm to construct it. It runs in time linear in the length of the string on a fixed alphabet. Our implementation requires half the memory space used by DAWGs.
On-line construction of compact directed acyclic word graphs.
- In Proc. 12th Annual Symposium on Combinatorial Pattern Matching (CPM’01), volume 2089 of Lecture Notes in Computer Science,
, 2001
"... Abstract Many different index structures, providing efficient solutions to problems related to pattern matching, have been introduced so far. Examples of these structures are suffix trees and directed acyclic word graphs (DAWGs), which can be efficiently constructed in linear time and space. Compac ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
(Show Context)
Abstract Many different index structures, providing efficient solutions to problems related to pattern matching, have been introduced so far. Examples of these structures are suffix trees and directed acyclic word graphs (DAWGs), which can be efficiently constructed in linear time and space. Compact directed acyclic word graphs (CDAWGs) are an index structure preserving some features of both suffix trees and DAWGs, and require less space than both of them. An algorithm which directly constructs CDAWGs in linear time and space was first introduced by Crochemore and Vérin, based on McCreight's algorithm for constructing suffix trees. In this work, we present a novel on-line linear-time algorithm that builds the CDAWG for a single string as well as for a set of strings, inspired by Ukkonen's on-line algorithm for constructing suffix trees.
Pattern Discovery from Biosequences
, 2002
"... In this thesis we have developed novel methods for analyzing biological data, the primary sequences of the DNA and proteins, the microarray based gene expression data, and other functional genomics data. The main contribution is the development of the pattern discovery algorithm SPEXS, accompanied b ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
In this thesis we have developed novel methods for analyzing biological data, the primary sequences of the DNA and proteins, the microarray based gene expression data, and other functional genomics data. The main contribution is the development of the pattern discovery algorithm SPEXS, accompanied by several practical applications for analyzing real biological problems. For performing these biological studies that integrate different types of biological data we have developed a comprehensive web-based biological data analysis environment Expression Profiler (http://ep.ebi.ac.uk/).
Similarity Search in Time Series Data Sets
, 1997
"... Similarity search on time-series data sets is of growing importance in data mining. With the increasing amount of data of time-series in many applications, from financial to scientific, it is important to study the methods of retrieving similarity patterns efficiently and user friendly for business ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Similarity search on time-series data sets is of growing importance in data mining. With the increasing amount of data of time-series in many applications, from financial to scientific, it is important to study the methods of retrieving similarity patterns efficiently and user friendly for business decision making. The thesis proposes methods of efficient retrieval of all objects in the time-series database with a shape similar to a search template. The search template can be either a shape or a sequence of data. Two search modules, subsequence search and whole sequence search, are designed and implemented. We study a set of linear transformations that can be used as the basis for similarity queries on time-series data, and design an innovative representation technique which abstracts the shape notion so that the user can interactively query and answer the multi-level similarity patterns. The wavelet analysis technique and the OLAP technique used in knowledge discovery and data warehou...