Results 1 
7 of
7
The engineering of a compression boosting library: Theory vs practice in BWT compression
, 2006
"... ..."
An algorithmic framework for compression and text indexing
"... We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lowerorder terms. For a text T of n symbols drawn from an alphabet Σ, our bounds are stated in terms of the hthorder empirical entropy of the text, Hh. In ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lowerorder terms. For a text T of n symbols drawn from an alphabet Σ, our bounds are stated in terms of the hthorder empirical entropy of the text, Hh. In particular, we provide a tight analysis of the BurrowsWheeler transform (bwt) establishing a bound of nHh + M(T,Σ,h) bits, where M(T,Σ,h) denotes the asymptotical number of bits required to store the empirical statistical model for contexts of order h appearing in T. Using the same framework, we also obtain an implementation of the compressed suffix array (csa) which achieves nHh + M(T,Σ,h) + O(nlg lg n/lg Σ  n) bits of space while still retaining competitive fulltext indexing functionality. The novelty of the proposed framework lies in its use of the finite set model instead of the empirical probability model (as in previous work), giving us new insight into the design and analysis of our algorithms. For example, we show that our analysis gives improved bounds since M(T,Σ,h) ≤ min{g ′ h lg(n/g ′ h + 1),H ∗ hn + lg n + g′′ h}, where g ′ h = O(Σh+1) and g ′′ h = O(Σ  h+1 lg Σ  h+1) do not depend on the text length n, while H ∗ h ≥ Hh is the modified hthorder empirical entropy of T. Moreover, we show a strong relationship between a compressed fulltext index and the succinct dictionary problem. We also examine the importance of lowerorder terms, as these can dwarf any savings achieved by highorder entropy. We report further results and tradeoffs on highorder entropycompressed text indexes in the paper. 1
MovetoFront, Distance Coding, and Inversion Frequencies Revisited
, 2007
"... MovetoFront, Distance Coding and Inversion Frequencies are three somewhat related techniques used to process the output of the BurrowsWheeler Transform. In this paper we analyze these techniques from the point of view of how effective they are in the task of compressing lowentropy strings, that ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
MovetoFront, Distance Coding and Inversion Frequencies are three somewhat related techniques used to process the output of the BurrowsWheeler Transform. In this paper we analyze these techniques from the point of view of how effective they are in the task of compressing lowentropy strings, that is, strings which have many regularities and are therefore highly compressible. This is a nontrivial task since many compressors have nonconstant overheads that become nonnegligible when the input string is highly compressible. Because of the properties of the BurrowsWheeler transform, being locally optimal ensures an algorithm compresses lowentropy strings effectively. Informally, local optimality implies that an algorithm is able to effectively compress an arbitrary partition of the input string. We show that in their original formulation neither MovetoFront, nor Distance Coding, nor Inversion Frequencies is locally optimal. Then, we describe simple variants of the above algorithms which are locally optimal. To achieve local optimality with MovetoFront it suffices to combine it with Run Length Encoding. To achieve local optimality with Distance Coding and Inversion Frequencies we use a novel “escape and reenter” strategy.
Post BWT Stages of the . . .
"... The lossless BurrowsWheeler compression algorithm has received considerable attention over recent years for both its simplicity and effectiveness. It is based on a permutation of the input sequence − the BurrowsWheeler transformation − which groups symbols with a similar context close together. In ..."
Abstract
 Add to MetaCart
The lossless BurrowsWheeler compression algorithm has received considerable attention over recent years for both its simplicity and effectiveness. It is based on a permutation of the input sequence − the BurrowsWheeler transformation − which groups symbols with a similar context close together. In the original version, this permutation was followed by a MoveToFront transformation and a final entropy coding stage. Later versions used different algorithms, placed after the BurrowsWheeler transformation, since the following stages have a significant influence on the compression rate. This article describes different algorithms and improvements for these post BWT stages including a new context based approach. Results for compression rates are presented together with compression and decompression times on the Calgary corpus, the Canterbury corpus, the large Canterbury corpus and the Lukas 2D 16 bit medical image corpus.
Indexing sequences of IEEE 754 double precision numbers.∗
"... In the last decades, much attention has been paid to the development of succinct data structures to store and/or index text, biological collections, source code, etc. Their success was in most cases due to handling data with a relatively small alphabet size and to typically exploit a rather skewed d ..."
Abstract
 Add to MetaCart
(Show Context)
In the last decades, much attention has been paid to the development of succinct data structures to store and/or index text, biological collections, source code, etc. Their success was in most cases due to handling data with a relatively small alphabet size and to typically exploit a rather skewed distribution (text) or simply the repetitiveness within the source data (source code repositories, biological sequences of similar individuals). In this work, we face the problem of dealing with collections of floating point data that typically have a large alphabet (a real number hardly ever repeats twice) and a less biased distribution. We present two solutions to store and index such collections. The first one is based on the wellknown inverted index. It consumes space around the size of the original collection, providing appealing search times. The second one uses a wavelet tree, which at the expense of slower search times, obtains slightly better space consumption. 1