Results 1 -
7 of
7
The engineering of a compression boosting library: Theory vs practice in BWT compression
, 2006
"... ..."
An algorithmic framework for compression and text indexing
"... We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lower-order terms. For a text T of n symbols drawn from an alphabet Σ, our bounds are stated in terms of the hth-order empirical entropy of the text, Hh. In ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lower-order terms. For a text T of n symbols drawn from an alphabet Σ, our bounds are stated in terms of the hth-order empirical entropy of the text, Hh. In particular, we provide a tight analysis of the Burrows-Wheeler transform (bwt) establishing a bound of nHh + M(T,Σ,h) bits, where M(T,Σ,h) denotes the asymptotical number of bits required to store the empirical statistical model for contexts of order h appearing in T. Using the same framework, we also obtain an implementation of the compressed suffix array (csa) which achieves nHh + M(T,Σ,h) + O(nlg lg n/lg |Σ | n) bits of space while still retaining competitive full-text indexing functionality. The novelty of the proposed framework lies in its use of the finite set model instead of the empirical probability model (as in previous work), giving us new insight into the design and analysis of our algorithms. For example, we show that our analysis gives improved bounds since M(T,Σ,h) ≤ min{g ′ h lg(n/g ′ h + 1),H ∗ hn + lg n + g′′ h}, where g ′ h = O(|Σ|h+1) and g ′′ h = O(|Σ | h+1 lg |Σ | h+1) do not depend on the text length n, while H ∗ h ≥ Hh is the modified hthorder empirical entropy of T. Moreover, we show a strong relationship between a compressed full-text index and the succinct dictionary problem. We also examine the importance of lowerorder terms, as these can dwarf any savings achieved by high-order entropy. We report further results and tradeoffs on high-order entropy-compressed text indexes in the paper. 1
Move-to-Front, Distance Coding, and Inversion Frequencies Revisited
, 2007
"... Move-to-Front, Distance Coding and Inversion Frequencies are three somewhat related techniques used to process the output of the Burrows-Wheeler Transform. In this paper we analyze these techniques from the point of view of how effective they are in the task of compressing low-entropy strings, that ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Move-to-Front, Distance Coding and Inversion Frequencies are three somewhat related techniques used to process the output of the Burrows-Wheeler Transform. In this paper we analyze these techniques from the point of view of how effective they are in the task of compressing low-entropy strings, that is, strings which have many regularities and are therefore highly compressible. This is a non-trivial task since many compressors have non-constant overheads that become non-negligible when the input string is highly compressible. Because of the properties of the Burrows-Wheeler transform, being locally optimal ensures an algorithm compresses low-entropy strings effectively. Informally, local optimality implies that an algorithm is able to effectively compress an arbitrary partition of the input string. We show that in their original formulation neither Move-to-Front, nor Distance Coding, nor Inversion Frequencies is locally optimal. Then, we describe simple variants of the above algorithms which are locally optimal. To achieve local optimality with Move-to-Front it suffices to combine it with Run Length Encoding. To achieve local optimality with Distance Coding and Inversion Frequencies we use a novel “escape and re-enter” strategy.
Post BWT Stages of the . . .
"... The lossless Burrows-Wheeler compression algorithm has received considerable attention over recent years for both its simplicity and effectiveness. It is based on a permutation of the input sequence − the Burrows-Wheeler transformation − which groups symbols with a similar context close together. In ..."
Abstract
- Add to MetaCart
The lossless Burrows-Wheeler compression algorithm has received considerable attention over recent years for both its simplicity and effectiveness. It is based on a permutation of the input sequence − the Burrows-Wheeler transformation − which groups symbols with a similar context close together. In the original version, this permutation was followed by a Move-To-Front transformation and a final entropy coding stage. Later versions used different algorithms, placed after the Burrows-Wheeler transformation, since the following stages have a significant influence on the compression rate. This article describes different algorithms and improvements for these post BWT stages including a new context based approach. Results for compression rates are presented together with compression and decompression times on the Calgary corpus, the Canterbury corpus, the large Canterbury corpus and the Lukas 2D 16 bit medical image corpus.
Indexing sequences of IEEE 754 double precision numbers.∗
"... In the last decades, much attention has been paid to the development of succinct data structures to store and/or index text, biological collections, source code, etc. Their success was in most cases due to handling data with a relatively small alphabet size and to typically exploit a rather skewed d ..."
Abstract
- Add to MetaCart
(Show Context)
In the last decades, much attention has been paid to the development of succinct data structures to store and/or index text, biological collections, source code, etc. Their success was in most cases due to handling data with a relatively small alphabet size and to typically exploit a rather skewed distribution (text) or simply the repetitiveness within the source data (source code repositories, biological sequences of similar individuals). In this work, we face the problem of dealing with collections of floating point data that typically have a large alphabet (a real number hardly ever repeats twice) and a less biased distribution. We present two solutions to store and index such collections. The first one is based on the well-known inverted index. It consumes space around the size of the original collection, providing appealing search times. The second one uses a wavelet tree, which at the expense of slower search times, obtains slightly better space consumption. 1