Results 1  10
of
21
Wavelet Trees for All
"... The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabli ..."
Abstract

Cited by 32 (12 self)
 Add to MetaCart
(Show Context)
The wavelet tree is a versatile data structure that serves a number of purposes, from string processing to geometry. It can be regarded as a device that represents a sequence, a reordering, or a grid of points. In addition, its space adapts to various entropy measures of the data it encodes, enabling compressed representations. New competitive solutions to a number of problems, based on wavelet trees, are appearing every year. In this survey we give an overview of wavelet trees and the surprising number of applications in which we have found them useful: basic and weighted point grids, sets of rectangles, strings, permutations, binary relations, graphs, inverted indexes, document retrieval indexes, fulltext indexes, XML indexes, and general numeric sequences.
Colored Range Queries and Document Retrieval
"... Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colore ..."
Abstract

Cited by 31 (18 self)
 Add to MetaCart
Colored range queries are a wellstudied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important onedimensional colored range queries — colored range listing, colored range topk queries and colored range counting — and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the highorder entropies of the library of documents. We then show how (approximate) colored topk queries can be reduced to (approximate) rangemode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.
Alphabetindependent compressed text indexing
 In ESA
, 2011
"... Abstract. Selfindexes can represent a text in asymptotically optimal space under the kth order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, f ..."
Abstract

Cited by 25 (17 self)
 Add to MetaCart
Abstract. Selfindexes can represent a text in asymptotically optimal space under the kth order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, for the first time, full alphabetindependence in the time complexities of selfindexes, while retaining space optimality. We obtain also some relevant byproducts on compressed suffix trees. 1
Efficient FullyCompressed Sequence Representations
, 2010
"... We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in nH0(s) + o(n)(H0(s)+1) bits, where H0(s) is the zeroorder entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in ..."
Abstract

Cited by 15 (11 self)
 Add to MetaCart
(Show Context)
We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in nH0(s) + o(n)(H0(s)+1) bits, where H0(s) is the zeroorder entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in worstcase time O (lg lg σ) and average time O (lg H0(s)). The worstcase complexity matches the best previous results, yet these had been achieved with data structures using nH0(s) + o(n lg σ) bits. On highly compressible sequences the o(n lg σ) bits of the redundancy may be significant compared to the the nH0(s) bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our averagecase complexity is unprecedented. Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy. The result also improves upon the best current compressed representations of several other data structures. For example, we achieve (i) compressed redundancy, retaining the best time complexities, for the smallest existing fulltext selfindexes; (ii) compressed permutations π with times for π() and π −1 () improved to loglogarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors. Our structure is practical on large alphabets. Our experiments show that, as predicted by theory, it dominates the space/time tradeoff map of all the sequence representations, both in synthetic and application scenarios.
Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences
 CoRR
"... Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text coll ..."
Abstract

Cited by 14 (9 self)
 Add to MetaCart
Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text collections, where inverted indexes are the preferred solution. As successful as this paradigm has been, it fails to properly handle various East Asian languages and other scenarios where the “natural language ” assumptions do not hold. In this survey we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications in bioinformatics, data and Web mining, chemoinformatics, software engineering, multimedia information retrieval, and many other fields. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and other areas.
Optimal Dynamic Sequence Representations
"... We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on a string S[1, n] over alphabet [1..σ] in time O(lg n / lg lg n), which is optimal. The time is worstcase for the queries and amortized for the updates. This complexity is bette ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
(Show Context)
We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on a string S[1, n] over alphabet [1..σ] in time O(lg n / lg lg n), which is optimal. The time is worstcase for the queries and amortized for the updates. This complexity is better than the best previous ones by a Θ(1 + lg σ / lg lg n) factor. Our structure uses nH0(S) + O(n + σ(lg σ + lg 1+ε n)) bits, where H0(S) is the zeroorder entropy of S and 0 < ε < 1 is any constant. This space redundancy over nH0(S) is also better, almost always, than that of the best previous dynamic structures, o(n lg σ)+O(σ(lg σ+lg n)). We can also handle general alphabets in optimal time, which has been an open problem in dynamic sequence representations.
Better Space Bounds for Parameterized Range Majority and Minority
"... Abstract. Karpinski and Nekrich (2008) introduced the problem of parameterized range majority, which asks to preprocess a string of length n such that, given the endpoints of a range, one can quickly find all the distinct elements whose relative frequencies in that range are more than a threshold τ. ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Karpinski and Nekrich (2008) introduced the problem of parameterized range majority, which asks to preprocess a string of length n such that, given the endpoints of a range, one can quickly find all the distinct elements whose relative frequencies in that range are more than a threshold τ. Subsequent authors have reduced their time and space bounds such that, when τ is given at preprocessing time, we need either O(n lg(1/τ)) space and optimal O(1/τ) query time or linear space and O((1/τ) lg lg σ) query time, where σ is the alphabet size. In this paper we give the first linearspace solution with optimal O(1/τ) query time. For the case when τ is given at query time, we significantly improve previous bounds, achieving either O(n lg lg σ) space and optimal O(1/τ) query time or compressed space and O ( (1/τ) lg lg(1/τ) query time. Along the lg lg n way, we consider the complementary problem of parameterized range minority that was recently introduced by Chan et al. (2012), who achieved linear space and O(1/τ) query time even for variable τ. We improve their solution to use either nearly optimally compressed space with no slowdown, or optimally compressed space with nearly no slowdown. Some of our intermediate results, such as densitysensitive query time for onedimensional range counting, may be of independent interest. 1
Grammar Compressed Sequences with Rank/Select Support?
"... Abstract. Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. In several recent applications, the need to represent highly repetitive sequences arises, where statistical co ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. In several recent applications, the need to represent highly repetitive sequences arises, where statistical compression is ineffective. We introduce grammarbased representations for repetitive sequences, which use up to 10 % of the space needed by representations based on statistical compression, and support direct access and rank/select operations within tens of microseconds. 1
Topk document retrieval in compact space and nearoptimal time
 In Proc. 24th Annual International Symposium on Algorithms and Computation
"... Abstract. Let D={d1, d2,...dD} be a given set of D string documents of total length n. Our task is to index D such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. There exist linear space data structures of O(n) words for answering such quer ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Let D={d1, d2,...dD} be a given set of D string documents of total length n. Our task is to index D such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. There exist linear space data structures of O(n) words for answering such queries in optimal O(p + k) time. In this paper, we describe a compact index of size CSA  + n lg D + o(n lg D) bits with near optimal time, O(p+k lg ∗ n), for the basic relevance metric termfrequency, where CSA  is the size (in bits) of a compressed fulltext index of D, and lg ∗ n is the iterated logarithm of n. 1 Introduction and Related Work Topk document retrieval is the problem of preprocessing a text collection so that, given a search pattern P [1..p] and a threshold k, we retrieve the k documents most “relevant ” to P, for some definition of relevance. This is the basic problem of search engines and forms the core of the Information Retrieval (IR) field [5].
Linearspace data structures for range frequency queries on arrays and trees
 In Proc. MFCS, volume 8087 of LNCS
, 2013
"... Abstract. We present O(n)space data structures to support various range frequency queries on a given array A[0: n − 1] or tree T with n nodes. Given a query consisting of an arbitrary pair of preorder rank indices (i, j), our data structures return a least frequent element, mode, or αminority of ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Abstract. We present O(n)space data structures to support various range frequency queries on a given array A[0: n − 1] or tree T with n nodes. Given a query consisting of an arbitrary pair of preorder rank indices (i, j), our data structures return a least frequent element, mode, or αminority of the multiset of elements in the unique path with endpoints at indices i and j in A or T. We describe a data structure that supports range least frequent element queries on arrays in O( n/w) time, improving the Θ( n) worstcase time required by the data structure of Chan et al. (SWAT 2012), where w ∈ Ω(logn) is the word size in bits. We describe a data structure that supports range mode queries on trees in O(log log n n/w) time, improving the Θ( n logn) worstcase time required by the data structure of Krizanc et al. (ISAAC 2003). Finally, we describe a data structure that supports range αminority queries on trees in O(α−1 log log n) time, where α ∈ [0, 1] can be specified at query time. 1