Results 1 - 10
of
111
Suffix arrays: A new method for on-line string searches
, 1991
"... A new and conceptually simple data structure, called a suffix array, for on-line string searches is intro-duced in this paper. Constructing and querying suffix arrays is reduced to a sort and search paradigm that employs novel algorithms. The main advantage of suffix arrays over suffix trees is that ..."
Abstract
-
Cited by 835 (0 self)
- Add to MetaCart
A new and conceptually simple data structure, called a suffix array, for on-line string searches is intro-duced in this paper. Constructing and querying suffix arrays is reduced to a sort and search paradigm that employs novel algorithms. The main advantage of suffix arrays over suffix trees is that, in practice, they use three to five times less space. From a complexity standpoint, suffix arrays permit on-line string searches of the type, "Is W a substring of A?" to be answered in time O(P + log N), where P is the length of W and N is the length of A, which is competitive with (and in some cases slightly better than) suffix trees. The only drawback is that in those instances where the underlying alphabet is finite and small, suffix trees can be constructed in O(N) time in the worst case, versus O(N log N) time for suffix arrays. However, we give an augmented algorithm that, regardless of the alphabet size, constructs suffix arrays in O(N) expected time, albeit with lesser space efficiency. We believe that suffix arrays will prove to be better in practice than suffix trees for many applications.
A Guided Tour to Approximate String Matching
- ACM COMPUTING SURVEYS
, 1999
"... We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining t ..."
Abstract
-
Cited by 598 (36 self)
- Add to MetaCart
We survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices according to each case. We conclude with some future work directions and open problems.
On-Line Construction of Suffix Trees
, 1995
"... An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string. The new algorithm has the desirable property of processing the string symbol by symbol from left to right. It has always the suffix tree for the scanned part of the strin ..."
Abstract
-
Cited by 437 (2 self)
- Add to MetaCart
An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string. The new algorithm has the desirable property of processing the string symbol by symbol from left to right. It has always the suffix tree for the scanned part of the string ready. The method is developed as a linear-time version of a very simple algorithm for (quadratic size) suffix tries. Regardless of its quadratic worst-case this latter algorithm can be a good practical method when the string is not too long. Another variation of this method is shown to give in a natural way the well-known algorithms for constructing suffix automata (DAWGs).
A survey of sequence alignment algorithms for nextgeneration sequencing
- Brief. Bioinform
, 2010
"... Abstract Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently develop ..."
Abstract
-
Cited by 186 (2 self)
- Add to MetaCart
(Show Context)
Abstract Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently developed over the past two years. In this article, we will systematically review the current development of these algorithms and introduce their practical applications on different types of experimental data. We come to the conclusion that short-read alignment is no longer the bottleneck of data analyses. We also consider future development of alignment algorithms with respect to emerging long sequence reads and the prospect of cloud computing.
Reducing the Space Requirement of Suffix Trees
- Software – Practice and Experience
, 1999
"... We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average ..."
Abstract
-
Cited by 145 (12 self)
- Add to MetaCart
(Show Context)
We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained. Copyright © 1999 John Wiley & Sons, Ltd. KEY WORDS: data structures; suffix trees; implementation techniques; space reduction
Speeding Up Two String-Matching Algorithms
- ALGORITHMICA
, 1994
"... We show how to speed up two string-matching algorithms: the Boyer-Moore algorithm (BM algorithm), and its version called here the reverse factor algorithm (RF algorithm). The RF algorithm is based on factor graphs for the reverse of the pattern.The main feature of both algorithms is that they scan ..."
Abstract
-
Cited by 106 (16 self)
- Add to MetaCart
We show how to speed up two string-matching algorithms: the Boyer-Moore algorithm (BM algorithm), and its version called here the reverse factor algorithm (RF algorithm). The RF algorithm is based on factor graphs for the reverse of the pattern.The main feature of both algorithms is that they scan the text right-to-left from the supposed right position of the pattern. The BM algorithm goes as far as the scanned segment (factor) is a suffix of the pattern. The RF algorithm scans while the segment is a factor of the pattern. Both algorithms make a shift of the pattern, forget the history, and start again. The RF algorithm usually makes bigger shifts than BM, but is quadratic in the worst case. We show that it is enough to remember the last matched segment (represented by two pointers to the text) to speed up the RF algorithm considerably (to make a linear number of inspections of text symbols, with small coefficient), and to speed up the BM algorithm (to make at most 2.n comparisons). Only a constant additional memory is needed for the search phase. We give alternative versions of an accelerated RF algorithm: the first one is based on combinatorial properties of primitive words, and the other two use the power of suffix trees extensively. The paper demonstrates the techniques to transform algorithms, and also shows interesting new applications of data structures representing all subwords of the pattern in compact form.
Transducers and repetitions
- Theoretical Computer Science
, 1986
"... Abstract. The factor transducer of a word associates to each of its factors (or subwc~rds) their first occurrence. Optimal bounds on the size of minimal factor transducers together with an algorithm for building them are given. Analogue results and a simple algorithm are given for the case of subseq ..."
Abstract
-
Cited by 94 (18 self)
- Add to MetaCart
(Show Context)
Abstract. The factor transducer of a word associates to each of its factors (or subwc~rds) their first occurrence. Optimal bounds on the size of minimal factor transducers together with an algorithm for building them are given. Analogue results and a simple algorithm are given for the case of subsequential suffix transducers. Algorithms are applied to repetition searching in words. Rl~sum~. Le transducteur des facteurs d'un mot associe a chacun de ses facteurs leur premiere occurrence. On donne des bornes optimales sur la taille du transducteur minimal d'un mot ainsi qu'un algorithme pour sa construction. On donne des r6sultats analogues et un algorithme simple dans le cas du transducteur sous-s~luentiel des suffixes d'un mot. On donne une application la d6tection de r6p6titions dans les mots. Contents
From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Constructions
- ALGORITHMICA
, 1997
"... We review the linear time suffix tree constructions by Weiner, McCreight, and Ukkonen. We use the terminology of the most recent algorithm, Ukkonen's online construction, to explain its historic predecessors. This reveals relationships much closer than one would expect, since the three algorith ..."
Abstract
-
Cited by 86 (7 self)
- Add to MetaCart
(Show Context)
We review the linear time suffix tree constructions by Weiner, McCreight, and Ukkonen. We use the terminology of the most recent algorithm, Ukkonen's online construction, to explain its historic predecessors. This reveals relationships much closer than one would expect, since the three algorithms are based on rather different intuitive ideas. Moreover, it completely explains the dierences between these algorithms in terms of simplicity, efficiency, and implementation complexity.