Results 1  10
of
28
Practical rank/select queries over arbitrary sequences
 In Proc. 15th SPIRE, LNCS 5280
, 2008
"... Abstract. We present a practical study on the compact representation of sequences supporting rank, select, and access queries. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform, especially in the cas ..."
Abstract

Cited by 49 (26 self)
 Add to MetaCart
(Show Context)
Abstract. We present a practical study on the compact representation of sequences supporting rank, select, and access queries. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform, especially in the case of sequences with very large alphabets. We first present a new practical implementation of the compressed representation for bit sequences proposed by Raman, Raman, and Rao [SODA 2002], that is competitive with the existing ones when the sequences are not too compressible. It also has nice local compression properties, and we show that this makes it an excellent tool for compressed text indexing in combination with the BurrowsWheeler transform. This shows the practicality of a recent theoretical proposal [Mäkinen and Navarro, SPIRE 2007], achieving spaces never seen before. Second, for general sequences, we tune wavelet trees for the case of very large alphabets, by removing their pointer information. We show that this gives an excellent solution for representing a sequence within zeroorder entropy space, in cases where the large alphabet poses a serious challenge to typical encoding methods. We also present the first implementation of Golynski et al.’s representation [SODA 2006], which offers another interesting time/space tradeoff. 1
Compact RichFunctional Binary Relation Representations
"... Binary relations are an important abstraction arising in a number of data representation problems. Each existing data structure specializes in the few basic operations required by one single application, and takes only limited advantage of the inherent redundancy of binary relations. We show how t ..."
Abstract

Cited by 18 (13 self)
 Add to MetaCart
(Show Context)
Binary relations are an important abstraction arising in a number of data representation problems. Each existing data structure specializes in the few basic operations required by one single application, and takes only limited advantage of the inherent redundancy of binary relations. We show how to support more general operations efficiently, while taking better advantage of some forms of redundancy in practical instances. As a basis for a more general discussion on binary relation data structures, we list the operations of potential interest for practical applications, and give reductions between operations. We identify a set of operations that yield the support of all others. As a first contribution to the discussion, we present two data structures for binary relations, each of which achieves a distinct tradeoff between the space used to store and index the relation, the set of operations supported in sublinear time, and the time in which those operations are supported. The experimental performance of our data structures shows that they not only offer good time complexities to carry out many operations, but also take advantage of regularities that arise in practical instances in order to reduce space usage.
Extended Compact Web Graph Representations
"... Abstract. Many relevant Web mining tasks translate into classical algorithms on the Web graph. Compact Web graph representations allow running these tasks on larger graphs within main memory. These representations at least provide fast navigation (to the neighbors of a node), yet more sophisticated ..."
Abstract

Cited by 18 (14 self)
 Add to MetaCart
(Show Context)
Abstract. Many relevant Web mining tasks translate into classical algorithms on the Web graph. Compact Web graph representations allow running these tasks on larger graphs within main memory. These representations at least provide fast navigation (to the neighbors of a node), yet more sophisticated operations are desirable for several Web analyses. We present a compact Web graph representation that, in addition, supports reverse navigation (to the nodes pointing to the given one). The standard approach to achieve this is to represent the graph and its transpose, which basically doubles the space requirement. Our structure, instead, represents the adjacency list using a compact sequence representation that allows finding the positions where a given node v is mentioned, and answers reverse navigation using that primitive. This is combined with a previous proposal based on grammar compression of the adjacency list. The combination yields interesting algorithmic problems. As a result, we achieve the smallest graph representation reported in the
Lineartime compression of boundedgenus graphs into informationtheoretically optimal number of bits
 In: 13th Symposium on Discrete Algorithms (SODA
, 2002
"... 1 I n t roduct ion This extended abstract summarizes a new result for the graph compression problem, addressing how to compress a graph G into a binary string Z with the requirement that Z can be decoded to recover G. Graph compression finds important applications in 3D model compression of Computer ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
(Show Context)
1 I n t roduct ion This extended abstract summarizes a new result for the graph compression problem, addressing how to compress a graph G into a binary string Z with the requirement that Z can be decoded to recover G. Graph compression finds important applications in 3D model compression of Computer Graphics [12, 1720] and compact routing table of Computer Networks [7}. For brevity, let a ~rgraph stand for a graph with property n. The informationtheoretically optimal number of bits required to represent an nnode ngraph is [log 2 N~(n)], where N,~(n) is the number of distinct nnode *rgraphs. Although determining or approximating the close forms of N ~ (n) for nontrivial classes of n is challenging, we provide a lineartime methodology for graph compression schemes that are informationtheoretically optimal with respect to continuous uperadditive functions (abbreviated as optimal for the rest of the extended abstract). 1 Specifically, if 7r satisfies certain properties, then we can compress any nnode medge 1rgraph G into a binary string Z such that G and Z can be computed from each other in O(m + n) time, and that the bit count of Z is at most fl(n) + o(fl(n)) for any continuous uperadditive function fl(n) with log 2 N~(n) < fl(n) + o(fl(n)). Our methodology is applicable to general classes of graphs; this extended abstract focuses on graphs with sublinear genus. 2 For example, if the input nnode,rgraph G is equipped with an embedding on its genus surface, which is a reasonable assumption for graphs arising from 3D model compression, then our methodology is applicable to any 7r satisfying the following statements:
Efficient FullyCompressed Sequence Representations
, 2010
"... We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in nH0(s) + o(n)(H0(s)+1) bits, where H0(s) is the zeroorder entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in ..."
Abstract

Cited by 15 (11 self)
 Add to MetaCart
(Show Context)
We present a data structure that stores a sequence s[1..n] over alphabet [1..σ] in nH0(s) + o(n)(H0(s)+1) bits, where H0(s) is the zeroorder entropy of s. This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in worstcase time O (lg lg σ) and average time O (lg H0(s)). The worstcase complexity matches the best previous results, yet these had been achieved with data structures using nH0(s) + o(n lg σ) bits. On highly compressible sequences the o(n lg σ) bits of the redundancy may be significant compared to the the nH0(s) bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our averagecase complexity is unprecedented. Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy. The result also improves upon the best current compressed representations of several other data structures. For example, we achieve (i) compressed redundancy, retaining the best time complexities, for the smallest existing fulltext selfindexes; (ii) compressed permutations π with times for π() and π −1 () improved to loglogarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors. Our structure is practical on large alphabets. Our experiments show that, as predicted by theory, it dominates the space/time tradeoff map of all the sequence representations, both in synthetic and application scenarios.
On Compressing the Textual Web
"... Nowadays we know how to effectively compress most basic components of any modern search engine, such as, the graphs arising from the Web structure and/or its usage, the posting lists, and the dictionary of terms. But we are not aware of any study which has deeply addressed the issue of compressing t ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
(Show Context)
Nowadays we know how to effectively compress most basic components of any modern search engine, such as, the graphs arising from the Web structure and/or its usage, the posting lists, and the dictionary of terms. But we are not aware of any study which has deeply addressed the issue of compressing the raw Web pages. Many Web applications use simple compression algorithms — e.g. gzip, or wordbased MovetoFront or Huffman coders — and conclude that, even compressed, raw data take more space than Inverted Lists. In this paper we investigate two typical scenarios of use of data compression for large Web collections. In the first scenario, the compressed pages are stored on disk and we only need to support the fast scanning of large parts of the compressed collection (such as for mapreduce paradigms). In the second scenario, we consider the fast access to individual pages of the compressed collection that is distributed among the RAMs of many PCs (such as for search engines and miners). For the first scenario, we provide a thorough experimental comparison among stateoftheart compressors thus indicating pros and cons of the available solutions. For the second scenario, we compare compressedstorage solutions with the new technology of compressed selfindexes [45]. Our results show that Web pages are more compressible than expected and, consequently, that some common beliefs in this area should be reconsidered. Our results are novel for the large spectrum of tested approaches and the size of datasets, and provide a threefold contribution: a nontrivial baseline for designing new compressedstorage solutions, a guide for software developers faced with Webpage storage, and a natural complement to the recent figures on InvertedListcompression achieved by [57, 58].
Fast compressed tries through path decompositions
 CORR
, 2014
"... Tries are popular data structures for storing a set of strings, where common prefixes are represented by common roottonode paths. More than 50 years of usage have produced many variants and implementations to overcome some of their limitations. We explore new succinct representations of pathdecom ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
Tries are popular data structures for storing a set of strings, where common prefixes are represented by common roottonode paths. More than 50 years of usage have produced many variants and implementations to overcome some of their limitations. We explore new succinct representations of pathdecomposed tries and experimentally evaluate the corresponding reduction in space usage and memory latency, comparing with the state of the art. We study the following applications: compressed string dictionary andmonotone minimal perfect hash for strings. In compressed string dictionary, we obtain data structures that outperform other stateoftheart compressed dictionaries in space efficiency while obtaining predictable query times that are competitive with data structures preferred by the practitioners. On realworld datasets, our compressed tries obtain the smallest space (except for one case) and have the fastest lookup times, whereas access times are within 20 % slower than the bestknown solutions. In monotone minimal perfect hash for strings, our compressed tries perform several times faster than other triebased monotone perfect hash functions while occupying nearly the same space. On realworld datasets, our tries are approximately 2 to 5 times faster than previous solutions, with a space occupancy less than 10 % larger.
A Compressed Text Index on Secondary Memory
"... Abstract. We introduce a practical diskbased compressed text index that, when the text is compressible, takes much less space than the suffix array. It provides very good I/O times for searching, which in particular improve when the text is compressible. In this aspect our index is unique, as compr ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We introduce a practical diskbased compressed text index that, when the text is compressible, takes much less space than the suffix array. It provides very good I/O times for searching, which in particular improve when the text is compressible. In this aspect our index is unique, as compressed indexes have been slower than their classical counterparts on secondary memory. We analyze our index and show experimentally that it is extremely competitive on compressible texts. 1 Introduction and Related Work Compressed fulltext selfindexing [22] is a recent trend that builds on the discovery that traditional text indexes like suffix trees and suffix arrays can be compacted to take space proportional to the compressed text size, and moreover be able to reproduce any text context. Therefore selfindexes replace the text,