Results 11  20
of
207
Fullycompressed suffix trees
 IN: PACS 2000. LNCS
, 2000
"... Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog ..."
Abstract

Cited by 31 (20 self)
 Add to MetaCart
Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog 2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra bits. This is already spectacular, but still unsatisfactory when σ is small as in DNA sequences. In this paper we introduce the first compressed suffix tree representation that breaks this linearspace barrier. Our representation requires sublinear extra space and supports a large set of navigational operations in logarithmic time. An essential ingredient of our representation is the lowest common ancestor (LCA) query. We reveal important connections between LCA queries and suffix tree navigation.
Faster EntropyBounded Compressed Suffix Trees
, 2009
"... Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix t ..."
Abstract

Cited by 31 (15 self)
 Add to MetaCart
Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix tree representation could fit in a faster memory, outweighing by far the theoretical slowdown brought by the space reduction. We present a novel compressed suffix tree, which is the first achieving at the same time sublogarithmic complexity for the operations, and space usage that asymptotically goes to zero as the entropy of the text does. The main ideas in our development are compressing the longest common prefix information, totally getting rid of the suffix tree topology, and expressing all the suffix tree operations using range minimum queries and a novel primitive called next/previous smaller value in a sequence. Our solutions to those operations are of independent interest.
Cognitive Engineering, in
 User Centered System Design, Lawrence Erlbaum Associates, Inc
, 1986
"... www.elsevier.com/locate/procedia ..."
(Show Context)
The Vmatch large scale sequence analysis software. [http://www.vmatch.de
"... This is the website for Vmatch, a versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Features of Vmatch The Vmatchma ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
(Show Context)
This is the website for Vmatch, a versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Features of Vmatch The Vmatchmanual gives many examples on how to use Vmatch. Here are the program’s most important features. Persistent index Usually, in a large scale matching problem, extensive portions of the sequences under consideration are static, i.e. they do not change much over time. Therefore it makes sense to preprocess this static data to extract information from it and to store this in a structured manner, allowing efficient searches. Vmatch does exactly this: it preprocesses a set of sequences into an index structure. This is stored as a collection of several files constituting the persistent index. The index efficiently represents all substrings of the preprocessed sequences and, unlike many other sequence comparison tools, allows matching tasks to be solved in time, independent of the size of the index. Different matching tasks require different parts of the index, but only the required
Practical Compressed Suffix Trees
"... The suffix tree is an extremely important data structure for stringology, with a wealth of applications in bioinformatics. Classical implementations require much space, which renders them useless for large problems. Recent research has yielded two implementations offering widely different spacetime ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
(Show Context)
The suffix tree is an extremely important data structure for stringology, with a wealth of applications in bioinformatics. Classical implementations require much space, which renders them useless for large problems. Recent research has yielded two implementations offering widely different spacetime tradeoffs. However, each of them has practicality problems regarding either space or time requirements. In this paper we implement a recent theoretical proposal and show it yields an extremely interesting structure that lies in between, offering both practical times and affordable space. The implementation of the theoretical proposal is by no means trivial and involves significant algorithm engineering.
Incorporating sequence quality data into alignment improves DNA read mapping
 NUCLEIC ACIDS RES
, 2010
"... New DNA sequencing technologies have achieved breakthroughs in throughput, at the expense of higher error rates. The primary way of interpreting biological sequences is via alignment, but standard alignment methods assume the sequences are accurate. Here, we describe how to incorporate the perbase ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
(Show Context)
New DNA sequencing technologies have achieved breakthroughs in throughput, at the expense of higher error rates. The primary way of interpreting biological sequences is via alignment, but standard alignment methods assume the sequences are accurate. Here, we describe how to incorporate the perbase error probabilities reported by seq uencers into alignment. Unlike existing tools for DNA read mapping, our method models both sequencer errors and real sequence differences. This approach consistently improves mapping accuracy, even when the rate of real sequence dif ference is only 0.2%. Furthermore, when mapping Drosophila melanogaster reads to the Drosophila simulans genome, it increased the amount of correctly mapped reads from 49 to 66%. This approach enables more effective use of DNA reads from organisms that lack reference genomes, are extinct or are highly polymorphic.
Fast and space efficient string kernels using suffix arrays
 In Proceedings, 23rd ICMP
, 2006
"... String kernels which compare the set of all common substrings between two given strings have recently been proposed by Vishwanathan & Smola (2004). Surprisingly, these kernels can be computed in linear time and linear space using annotated suffix trees. Even though, in theory, the suffix tree ba ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
(Show Context)
String kernels which compare the set of all common substrings between two given strings have recently been proposed by Vishwanathan & Smola (2004). Surprisingly, these kernels can be computed in linear time and linear space using annotated suffix trees. Even though, in theory, the suffix tree based algorithm requires O(n) space for an n length string, in practice at least 40n bytes are required – 20n bytes for storing the suffix tree, and an additional 20n bytes for the annotation. This large memory requirement coupled with poor locality of memory access, inherent due to the use of suffix trees, means that the performance of the suffix tree based algorithm deteriorates on large strings. In this paper, we describe a new linear time yet space efficient and scalable algorithm for computing string kernels, based on suffix arrays. Our algorithm is a) faster and easier to implement, b) on the average requires only 19n bytes of storage, and c) exhibits strong locality of memory access. We show that our algorithm can be extended to perform linear time prediction on a test string, and present experiments to validate our claims. 1.
Permuted longestcommonprefix array
 In Proc. 20th CPM, LNCS 5577
, 2009
"... Abstract. The longestcommonprefix (LCP) array is an adjunct to the suffix array that allows many string processing problems to be solved in optimal time and space. Its construction is a bottleneck in practice, taking almost as long as suffix array construction. In this paper, we describe algorithm ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
Abstract. The longestcommonprefix (LCP) array is an adjunct to the suffix array that allows many string processing problems to be solved in optimal time and space. Its construction is a bottleneck in practice, taking almost as long as suffix array construction. In this paper, we describe algorithms for constructing the permuted LCP (PLCP) array in which the values appear in position order rather than lexicographical order. Using the PLCP array, we can either construct or simulate the LCP array. We obtain a family of algorithms including the fastest known LCP construction algorithm and some extremely space efficient algorithms. We also prove a new combinatorial property of the LCP values. 1
On Space Efficient Two Dimensional Range Minimum Data Structures
"... Abstract. The two dimensional range minimum query problem is to preprocess a static two dimensional m by n array A of size N = m · n, such that subsequent queries, asking for the position of the minimum element in a rectangular range within A, can be answered efficiently. We study the tradeoff betw ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
Abstract. The two dimensional range minimum query problem is to preprocess a static two dimensional m by n array A of size N = m · n, such that subsequent queries, asking for the position of the minimum element in a rectangular range within A, can be answered efficiently. We study the tradeoff between the space and query time of the problem. We show that every algorithm enabled to access A during the query and using O(N/c) bits additional space requires Ω(c) query time, for any c where 1 ≤ c ≤ N. This lower bound holds for any dimension. In particular, for the one dimensional version of the problem, the lower bound is tight up to a constant factor. In two dimensions, we complement the lower bound with an indexing data structure of size O(N/c) bits additional space which can be preprocessed in O(N) time and achieves O(c log 2 c) query time. For c = O(1), this is the first O(1) query time algorithm using optimal O(N) bits additional space. For the case where queries can not probe A, we give a data structure of size O(N · min{m, log n}) bits with O(1) query time, assuming m ≤ n. This leaves a gap to the lower bound of Ω(N log m) bits for this version of the problem. 1