Results 1 - 10
of
124
A new succinct representation of RMQ-information and improvements in the enhanced suffix array
- PROC. ESCAPE. LNCS
, 2007
"... The Range-Minimum-Query-Problem is to preprocess an array of length n in O(n) time such that all subsequent queries asking for the position of a minimal element between two specified indices can be obtained in constant time. This problem was first solved by Berkman and Vishkin [1], and Sadakane [2] ..."
Abstract
-
Cited by 31 (13 self)
- Add to MetaCart
The Range-Minimum-Query-Problem is to preprocess an array of length n in O(n) time such that all subsequent queries asking for the position of a minimal element between two specified indices can be obtained in constant time. This problem was first solved by Berkman and Vishkin [1], and Sadakane [2] gave the first succinct data structure that uses 4n+o(n) bits of additional space. In practice, this method has several drawbacks: it needs O(nlog n) bits of intermediate space when constructing the data structure, and it builds on previous results on succinct data structures. We overcome these problems by giving the first algorithm that never uses more than 2n + o(n) bits, and does not rely on rank- and select-queries or other succinct data structures. We stress the importance of this result by simplifying and reducing the space consumption of the Enhanced Suffix Array [3], while retaining its capability of simulating top-down-traversals of the suffix tree, used, e.g., to locate all occ positions of a pattern p in a text in optimal O(|p | + occ) time (assuming constant alphabet size). We further prove a lower bound of 2n − o(n) bits, which makes our algorithm asymptotically optimal.
A taxonomy of suffix array construction algorithms
- ACM Computing Surveys
, 2007
"... In 1990, Manber and Myers proposed suffix arrays as a space-saving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abunda ..."
Abstract
-
Cited by 30 (10 self)
- Add to MetaCart
In 1990, Manber and Myers proposed suffix arrays as a space-saving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple high-level descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms ’ worst-case time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.
Rank and select revisited and extended
- Workshop on SpaceConscious Algorithms, University of
, 2006
"... The deep connection between the Burrows-Wheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corr ..."
Abstract
-
Cited by 29 (17 self)
- Add to MetaCart
The deep connection between the Burrows-Wheeler transform (BWT) and the socalled rank and select data structures for symbol sequences is the basis of most successful approaches to compressed text indexing. Rank of a symbol at a given position equals the number of times the symbol appears in the corresponding prefix of the sequence. Select is the inverse, retrieving the positions of the symbol occurrences. It has been shown that improvements to rank/select algorithms, in combination with the BWT, turn into improved compressed text indexes. This paper is devoted to alternative implementations and extensions of rank and select data structures. First, we show that one can use gap encoding techniques to obtain constant time rank and select queries in essentially the same space as what is achieved by the best current direct solution (and sometimes less). Second, we extend symbol rank and select to substring rank and select, giving several space/time tradeoffs for the problem. An application of these queries is in position-restricted substring searching, where one can specify the range in the text where the search is restricted to, and only occurrences residing in that range are to be reported. In addition, arbitrary occurrences are reported in text position order. Several byproducts of our results display connections with searchable partial sums, Chazelle’s two-dimensional data structures, and Grossi et al.’s wavelet trees.
A simple storage scheme for strings achieving entropy bounds
, 2007
"... We propose a storage scheme for a string S[1, n], drawn from an alphabet Σ, that requires space close to the k-th order empirical entropy of S, and allows to retrieve any ℓ-long substring of S in optimal O(1 + ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
We propose a storage scheme for a string S[1, n], drawn from an alphabet Σ, that requires space close to the k-th order empirical entropy of S, and allows to retrieve any ℓ-long substring of S in optimal O(1 +
Practical rank/select queries over arbitrary sequences
- In Proc. 15th SPIRE, LNCS 5280
, 2008
"... Abstract. We present a practical study on the compact representation of sequences supporting rank, select, and access queries. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform, especially in the cas ..."
Abstract
-
Cited by 25 (20 self)
- Add to MetaCart
Abstract. We present a practical study on the compact representation of sequences supporting rank, select, and access queries. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform, especially in the case of sequences with very large alphabets. We first present a new practical implementation of the compressed representation for bit sequences proposed by Raman, Raman, and Rao [SODA 2002], that is competitive with the existing ones when the sequences are not too compressible. It also has nice local compression properties, and we show that this makes it an excellent tool for compressed text indexing in combination with the Burrows-Wheeler transform. This shows the practicality of a recent theoretical proposal [Mäkinen and Navarro, SPIRE 2007], achieving spaces never seen before. Second, for general sequences, we tune wavelet trees for the case of very large alphabets, by removing their pointer information. We show that this gives an excellent solution for representing a sequence within zero-order entropy space, in cases where the large alphabet poses a serious challenge to typical encoding methods. We also present the first implementation of Golynski et al.’s representation [SODA 2006], which offers another interesting time/space trade-off. 1
Space-efficient algorithms for document retrieval
- IN PROC. CPM, VOLUME 4580 OF LNCS
, 2007
"... We study the Document Listing problem, where a collection D of documents d1,..., dk of total length � di = n is to be pre-i processed, so that one can later efficiently list all the ndoc documents containing a given query pattern P of length m as a substring. Muthukrishnan (SODA 2002) gave an opti ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
We study the Document Listing problem, where a collection D of documents d1,..., dk of total length � di = n is to be pre-i processed, so that one can later efficiently list all the ndoc documents containing a given query pattern P of length m as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with O(n) time preprocessing, one can answer the queries in O(m + ndoc) time. In this paper, we improve the space-requirement of the Muthukrishnan’s solution from O(nlog n) bits to |CSA | + 2n + nlog k(1 + o(1)) bits, where |CSA | ≤ nlog |Σ|(1 + o(1)) is the size of any suitable compressed suffix array (CSA), and Σ is the underlying alphabet of documents. The time requirement depends on the CSA used, but we can obtain e.g. the optimal O(m+ndoc) time when |Σ|, k = O(polylog(n)). For general |Σ|, k the time requirement becomes O(m log |Σ | + ndoc log k). Sadakane (ISAAC
Fast generation of result snippets in web search
- In Kraaij et al
"... The presentation of query biased document snippets as part of results pages presented by search engines has become an expectation of search engine users. In this paper we explore the algorithms and data structures required as part of a search engine to allow efficient generation of query biased snip ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
The presentation of query biased document snippets as part of results pages presented by search engines has become an expectation of search engine users. In this paper we explore the algorithms and data structures required as part of a search engine to allow efficient generation of query biased snippets. We begin by proposing and analysing a document compression method that reduces snippet generation time by 58 % over a baseline using the zlib compression library. These experiments reveal that finding documents on secondary storage dominates the total cost of generating snippets, and so caching documents in RAM is essential for a fast snippet generation process. Using simulation, we examine snippet generation performance for different size RAM caches. Finally we propose and analyse document reordering and compaction, revealing a scheme that increases the number of document cache hits with only a marginal affect on snippet quality. This scheme effectively doubles the number of documents that can fit in a fixed size cache.
Fully-compressed suffix trees
- IN: PACS 2000. LNCS
, 2000
"... Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog ..."
Abstract
-
Cited by 17 (12 self)
- Add to MetaCart
Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the nlog 2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra bits. This is already spectacular, but still unsatisfactory when σ is small as in DNA sequences. In this paper we introduce the first compressed suffix tree representation that breaks this linear-space barrier. Our representation requires sublinear extra space and supports a large set of navigational operations in logarithmic time. An essential ingredient of our representation is the lowest common ancestor (LCA) query. We reveal important connections between LCA queries and suffix tree navigation.
Position-restricted substring searching
- OF LECTURE NOTES IN COMPUTER SCIENCE
, 2006
"... A full-text index is a data structure built over a text string T[1, n]. The most basic functionality provided is (a) counting how many times a pattern string P[1, m] appears in T and (b) locating all those occ positions. There exist several indexes that solve (a) in O(m) time and (b) in O(occ) tim ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
A full-text index is a data structure built over a text string T[1, n]. The most basic functionality provided is (a) counting how many times a pattern string P[1, m] appears in T and (b) locating all those occ positions. There exist several indexes that solve (a) in O(m) time and (b) in O(occ) time. In this paper we propose two new queries, (c) counting how many times P[1, m] appears in T[l, r] and (d) locating all those occl,r positions. These can be solved using (a) and (b) but this requires O(occ) time. We present two solutions to (c) and (d) in this paper. The first is an index that requires O(n log n) bits of space and answers (c) in O(m + log n) time and (d) in O(log n) time per occurrence (that is, O(occl,r log n) time overall). A variant of the first solution answers (c) in O(m + log log n) time and (d) in constant time per occurrence, but requires O(nlog 1+ǫ n) bits of space for any constant ǫ> 0. The second solution requires O(nm log σ) bits of space, solving (c) in O(m⌈log σ/log log n⌉) time and (d) in O(m⌈log σ/log log n⌉) time per

