Results 1  10
of
31
Colored Range Queries and Document Retrieval
"... Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colore ..."
Abstract

Cited by 31 (18 self)
 Add to MetaCart
(Show Context)
Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colored range listing, colored range topk queries and colored range counting — and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the highorder entropies of the library of documents. We then show how (approximate) colored topk queries can be reduced to (approximate) rangemode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.
On Compressing and Indexing Repetitive Sequences
, 2011
"... We introduce LZEnd, a new member of the LempelZiv family of text compressors, which achieves compression ratios close to those of LZ77 but performs much faster at extracting arbitrary text substrings. We then build the first selfindex based on LZ77 (or LZEnd) compression, which in addition to te ..."
Abstract

Cited by 25 (6 self)
 Add to MetaCart
We introduce LZEnd, a new member of the LempelZiv family of text compressors, which achieves compression ratios close to those of LZ77 but performs much faster at extracting arbitrary text substrings. We then build the first selfindex based on LZ77 (or LZEnd) compression, which in addition to text extraction offers fast indexed searches on the compressed text. This selfindex is particularly effective to represent highly repetitive sequence collections, which arise for example when storing versioned documents, software repositories, periodic publications, and biological sequence databases.
SelfIndexed GrammarBased Compression
, 2001
"... Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several appl ..."
Abstract

Cited by 19 (7 self)
 Add to MetaCart
Selfindexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current selfindexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammarbased compression is well suited to exploit such repetitiveness. We introduce the first grammarbased selfindex. It builds on StraightLine Programs (SLPs), a rather general kind of contextfree grammars. If an SLP of n rules represents a text T [1, u], then an SLPcompressed representation of T requires 2n log 2 n bits. For that same SLP, our selfindex takes O(n log n) + n log 2 u bits. It extracts any text substring of length m in time O((m + h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our selfindex to grammar
Improved grammarbased compressed indexes
 In Proc. 19th SPIRE, LNCS 7608
, 2012
"... Abstract. We introduce the first grammarcompressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T [1..u] that is represented by a (contextfree) grammar of n (terminal and nonterminal) symbols and size N (meas ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
(Show Context)
Abstract. We introduce the first grammarcompressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T [1..u] that is represented by a (contextfree) grammar of n (terminal and nonterminal) symbols and size N (measured as the sum of the lengths of the right hands of the rules), a basic grammarbased representation of T takes N lg n bits of space. Our representation requires 2N lg n + N lg u + ɛ n lg n + o(N lg n) bits of space, for any 0 < ɛ ≤ 1. It can find the positions of the occ occurrences of a pattern of length m in T in O (m 2 /ɛ) lg lg u lg n + (m + occ) lg n time, and extract any substring of length ℓ of T in time O(ℓ + h lg(N/h)), where h is the height of the grammar tree.
Faster approximate pattern matching in compressed repetitive texts
 IN PROCEEDINGS OF THE 22ND INTERNATIONAL SYMPOSIUM ON ALGORITHMS AND COMPUTATION (ISAAC
, 2011
"... Motivated by the imminent growth of massive, highly redundant genomic databases we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, gi ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
Motivated by the imminent growth of massive, highly redundant genomic databases we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straightline program with r rules for a string s of length n, we can build an O(r)word data structure that allows us to extract any substring s[i..j] in O(log n + j − i) time. They also showed how, given a pattern p of length m and an edit distance k ≤ m, their data structure supports finding all occ approximate matches to p in s in O(r(min(mk, k4 + m) + log n) + occ) time. Rytter (2003) and Charikar et al. (2005) showed that r is always at least the number z of phrases in the LZ77 parse of s, and gave algorithms for building straightline programs with O(z log n) rules. In this paper we give a simple O(z log n)word data structure that takes the same time for substring extraction but only O(z(min(mk, k4 + m)) + occ) time for approximate pattern matching.
Algorithmics on SLPcompressed strings: a survey,
 Groups Complex. Cryptol.
, 2012
"... Abstract Results on algorithmic problems on strings that are given in a compressed form via straightline programs are surveyed. A straightline program is a contextfree grammar that generates exactly one string. In this way, exponential compression rates can be achieved. Among others, we study pat ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
Abstract Results on algorithmic problems on strings that are given in a compressed form via straightline programs are surveyed. A straightline program is a contextfree grammar that generates exactly one string. In this way, exponential compression rates can be achieved. Among others, we study pattern matching for compressed strings, membership problems for compressed strings in various kinds of formal languages, and the problem of querying compressed strings. Applications in combinatorial group theory and computational topology and to the solution of word equations are discussed as well. Finally, extensions to compressed trees and pictures are considered.
Indexing Highly Repetitive Collections
"... Abstract. The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
Abstract. The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to address the problem: compressed suffix arrays, grammar compressed indexes, and LempelZiv compressed indexes. 1
XML Compression via DAGs
, 2013
"... Unranked trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size ( = nu ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Unranked trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size ( = number of edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One main finding is that the size of the dag of the binary tree can never be smaller than the square root of the size of the minimal dag, and that there are examples that match this bound. We introduce a new combined structure, the hybrid dag, which is guaranteed to be smaller than (or equal in size to) both dags. Interestingly, we find through experiments that last child/previous sibling encodings are much better for XML compression via dags, than fcns encodings. This is because optional elements are more likely to appear towards the end of child sequences.
Fast and tiny structural selfindexes for XML
 CoRR
"... XML document markup is highly repetitive and therefore well compressible using dictionarybased methods such as DAGs or grammars. In the context of selectivity estimation, grammarcompressed trees were used before as synopsis for structural XPath queries. Here a fullyfledged index over such gramm ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
(Show Context)
XML document markup is highly repetitive and therefore well compressible using dictionarybased methods such as DAGs or grammars. In the context of selectivity estimation, grammarcompressed trees were used before as synopsis for structural XPath queries. Here a fullyfledged index over such grammars is presented. The index allows to execute arbitrary tree algorithms with a slowdown that is comparable to the space improvement. More interestingly, certain algorithms execute much faster over the index (because no decompression occurs). E.g., for structural XPath count queries, evaluating over the index is faster than previous XPath implementations, often by two orders of magnitude. The index also allows to serialize XML results (including texts) faster than previous systems, by a factor of ca. 2–3. This is due to efficient copy handling of grammar repetitions, and because materialization is totally avoided. In order to compare with twig join implementations, we implemented a materializer which writes out preorder numbers of result nodes, and show its competitiveness. 1.