Results 1  10
of
22
Random Access to GrammarCompressed Strings
, 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammarcompressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{P k, k 4 + P } + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammarcompressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavypaths in grammars.
New lower and upper bounds for representing sequences
 CoRR
"... Abstract. Sequence representations supporting queries access, select and rank are at the core of many data structures. There is a considerable gap between different upper bounds, and the few lower bounds, known for such representations, and how they interact with the space used. In this article we p ..."
Abstract

Cited by 21 (13 self)
 Add to MetaCart
(Show Context)
Abstract. Sequence representations supporting queries access, select and rank are at the core of many data structures. There is a considerable gap between different upper bounds, and the few lower bounds, known for such representations, and how they interact with the space used. In this article we prove a strong lower bound for rank, which holds for rather permissive assumptions on the space used, and give matching upper bounds that require only a compressed representation of the sequence. Within this compressed space, operations access and select can be solved within almostconstant time. 1
J.I.: Succinct representations of dynamic strings
 CoRR
, 2010
"... Abstract. The rank and select operations over a string of length n from an alphabet of size σ have been used widely in the design of succinct data structures. In many applications, the string itself must be maintained dynamically, allowing characters of the string to be inserted and deleted. Under ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
(Show Context)
Abstract. The rank and select operations over a string of length n from an alphabet of size σ have been used widely in the design of succinct data structures. In many applications, the string itself must be maintained dynamically, allowing characters of the string to be inserted and deleted. Under the word RAM model with word size w = Ω(lg n), we design a succinct representation of dynamic strings using nH0+o(n) · lg σ+O(w) bits to support rank, select, insert and delete in O ( lgn lg lgn ( lg σ lg lgn +1)) time1. When the alphabet size is small, i.e. when σ = O(polylog(n)), including the case in which the string is a bit vector, these operations are supported in O ( lgn lg lgn) time. Our data structures are more efficient than previous results on the same problem, and we have applied them to improve results on the design and construction of spaceefficient text indexes. 1
Efficient FullyCompressed Sequence Representations
, 2010
"... We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in nH0(s) + o(n)(H0(s)+1) bits, where H0(s) is the zeroorder entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in ..."
Abstract

Cited by 15 (11 self)
 Add to MetaCart
(Show Context)
We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in nH0(s) + o(n)(H0(s)+1) bits, where H0(s) is the zeroorder entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in worstcase time O (lg lg σ) and average time O (lg H0(s)). The worstcase complexity matches the best previous results, yet these had been achieved with data structures using nH0(s) + o(n lg σ) bits. On highly compressible sequences the o(n lg σ) bits of the redundancy may be significant compared to the the nH0(s) bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our averagecase complexity is unprecedented. Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy. The result also improves upon the best current compressed representations of several other data structures. For example, we achieve (i) compressed redundancy, retaining the best time complexities, for the smallest existing fulltext selfindexes; (ii) compressed permutations π with times for π() and π −1 () improved to loglogarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors. Our structure is practical on large alphabets. Our experiments show that, as predicted by theory, it dominates the space/time tradeoff map of all the sequence representations, both in synthetic and application scenarios.
Optimal Dynamic Sequence Representations
"... We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on a string S[1, n] over alphabet [1..σ] in time O(lg n / lg lg n), which is optimal. The time is worstcase for the queries and amortized for the updates. This complexity is bette ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
(Show Context)
We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on a string S[1, n] over alphabet [1..σ] in time O(lg n / lg lg n), which is optimal. The time is worstcase for the queries and amortized for the updates. This complexity is better than the best previous ones by a Θ(1 + lg σ / lg lg n) factor. Our structure uses nH0(S) + O(n + σ(lg σ + lg 1+ε n)) bits, where H0(S) is the zeroorder entropy of S and 0 < ε < 1 is any constant. This space redundancy over nH0(S) is also better, almost always, than that of the best previous dynamic structures, o(n lg σ)+O(σ(lg σ+lg n)). We can also handle general alphabets in optimal time, which has been an open problem in dynamic sequence representations.
Compressed selfindices supporting conjunctive queries on document collections
 in: Proc. 17th SPIRE, 2010
"... Abstract. We prove that a document collection, represented as a unique sequence T of n terms over a vocabulary Σ, can be represented in nH0(T) + o(n)(H0(T) + 1) bits of space, such that a conjunctive query t1 ∧ · · · ∧ tk can be answered in O(kδ log log Σ) adaptive time, where δ is the instanc ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We prove that a document collection, represented as a unique sequence T of n terms over a vocabulary Σ, can be represented in nH0(T) + o(n)(H0(T) + 1) bits of space, such that a conjunctive query t1 ∧ · · · ∧ tk can be answered in O(kδ log log Σ) adaptive time, where δ is the instance difficulty of the query, as defined by Barbay and Kenyon in their SODA’02 paper, and H0(T) is the empirical entropy of order 0 of T. As a comparison, using an inverted index plus the adaptive intersection algorithm by Barbay and Kenyon takes O(kδ log nM δ), where nM is the length of the shortest and longest occurrence lists, respectively, among those of the query terms. Thus, we can replace an inverted index by a more spaceefficient inmemory encoding, outperforming the query performance of inverted indices when the ratio nM δ is ω(log Σ).
Fast compressed tries through path decompositions
 CORR
, 2014
"... Tries are popular data structures for storing a set of strings, where common prefixes are represented by common roottonode paths. More than 50 years of usage have produced many variants and implementations to overcome some of their limitations. We explore new succinct representations of pathdecom ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Tries are popular data structures for storing a set of strings, where common prefixes are represented by common roottonode paths. More than 50 years of usage have produced many variants and implementations to overcome some of their limitations. We explore new succinct representations of pathdecomposed tries and experimentally evaluate the corresponding reduction in space usage and memory latency, comparing with the state of the art. We study the following applications: compressed string dictionary andmonotone minimal perfect hash for strings. In compressed string dictionary, we obtain data structures that outperform other stateoftheart compressed dictionaries in space efficiency while obtaining predictable query times that are competitive with data structures preferred by the practitioners. On realworld datasets, our compressed tries obtain the smallest space (except for one case) and have the fastest lookup times, whereas access times are within 20 % slower than the bestknown solutions. In monotone minimal perfect hash for strings, our compressed tries perform several times faster than other triebased monotone perfect hash functions while occupying nearly the same space. On realworld datasets, our tries are approximately 2 to 5 times faster than previous solutions, with a space occupancy less than 10 % larger.
Compressing IP Forwarding Tables: Towards Entropy Bounds and Beyond
"... Lately, there has been an upsurge of interest in compressed data structures, aiming to pack ever larger quantities of information into constrained memory without sacrificing the efficiency of standard operations, like random access, search, or update. The main goal of this paper is to demonstrate ho ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Lately, there has been an upsurge of interest in compressed data structures, aiming to pack ever larger quantities of information into constrained memory without sacrificing the efficiency of standard operations, like random access, search, or update. The main goal of this paper is to demonstrate how data compression can benefit the networking community, by showing how to squeeze the IP Forwarding Information Base (FIB), the giant table consulted by IP routers to make forwarding decisions, into informationtheoretical entropy bounds, with essentially zero cost on longest prefix match and FIB update. First, we adopt the stateoftheart in compressed data structures, yielding a static entropycompressed FIB representation with asymptotically optimal lookup. Then, we redesign the venerable prefix tree, used commonly for IP lookup for at least 20 years in IP routers, to also admit entropy bounds and support lookup in optimal time and update in nearly optimal time. Evaluations on a Linux kernel prototype indicate that our compressors encode a FIB comprising more than 440K prefixes to just about 100–400 KBytes of memory, with a threefold increase in lookup throughput and no penalty on FIB updates.
The wavelet trie: maintaining an indexed sequence of strings in compressed space
 in Proc. 31st ACM Symposium on Principles of Database Systems (PODS), 2012
"... An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of columnoriented databases, log processing, and other storage ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of columnoriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, spaceefficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory. We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearlyoptimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations. We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the stateoftheart compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence. 1
Succinct data structures for path queries
 IN: PROCEEDINGS OF THE 20TH ANNUAL EUROPEAN SYMPOSIUM ON ALGORITHMS
, 2012
"... Consider a tree T on n nodes, each having a weight drawn from [1..σ]. In this paper, we design succinct data structures to encode T using nH(WT) + o(n lg σ) bits of space, such that we can support path counting queries in O ( lg σ lg lgn + 1) time, path reporting queries in O((occ+1) ( lg σ lg lgn ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
(Show Context)
Consider a tree T on n nodes, each having a weight drawn from [1..σ]. In this paper, we design succinct data structures to encode T using nH(WT) + o(n lg σ) bits of space, such that we can support path counting queries in O ( lg σ lg lgn + 1) time, path reporting queries in O((occ+1) ( lg σ lg lgn +1)) time, and path median and path selection queries in O ( lg σ lg lg σ) time, where H(WT) is the entropy of the multiset of the weights of the nodes in T. Our results not only improve the best known linear space data structures [15], but also match the lower bounds for these path queries [18, 19, 16] when σ = Ω(n/polylog(n)).