#### DMCA

## Skeleton Trees for the Efficient Decoding of Huffman Encoded Texts (1997)

Venue: | Information Retrieval |

Citations: | 14 - 7 self |

### Citations

1492 | A universal algorithm for sequential data compression
- Ziv, Lempel
- 1977
(Show Context)
Citation Context ... machine with given resources, effectively increasing the size of the data base that can still be handled efficiently. Most of the popular compression methods are based on the works of Lempel and Ziv =-=[29, 30]-=-, but these are adaptive methods which are not always suitable for IR applications. In the context of full-text retrieval, a large number of small passages is accessed simultaneously, e.g., when produ... |

1336 |
A method for the construction of minimum-redundancy codes
- Huffman
- 1952
(Show Context)
Citation Context ... smaller blocks, which would cost us compression efficiency. In both cases, the advantage of using adaptive methods, which often yield better compression than static ones, may be lost. Huffman coding =-=[14]-=- is still one of the best known and most popular static data compression methods. While for certain applications, such as data transmission over a communication channel, both coding and decoding ought... |

973 | Managing Gigabytes: Compressing and Indexing Documents and Images
- Witten, Moffat, et al.
- 1999
(Show Context)
Citation Context ...ceedings, pp. 65--75. -- 1 -- 1. Introduction The importance and usefulness of Data Compression for Information Retrieval (IR) Systems is today well-established, and many authors have commented on it =-=[1, 17, 31, 27]-=-. Large full-text IR Systems are indeed voracious consumers of storage space realtive to the size of the raw textual database, because not only the text has to be kept, but also various auxiliary file... |

933 | Compression of individual sequences via variable-rate coding. IEEE Trans. Info. Theory 24 (5): 530–536. Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30
- Ziv, Lempel
- 1978
(Show Context)
Citation Context ... machine with given resources, effectively increasing the size of the data base that can still be handled efficiently. Most of the popular compression methods are based on the works of Lempel and Ziv =-=[29, 30]-=-, but these are adaptive methods which are not always suitable for IR applications. In the context of full-text retrieval, a large number of small passages is accessed simultaneously, e.g., when produ... |

181 |
The Psycho-Biology of Language
- Zipf
- 1935
(Show Context)
Citation Context ...e weights p i = 1=(i H n ), for 1sisn, where H n = P n j=1 (1=j) is the n-th harmonic number. This law is believed to govern the distribution of the most common words in a large natural language text =-=[28]-=-. A canonical code can be represented by the string hn 1 ; n 2 ; : : : ; n k i, called a source, where k denotes, here and below, the length of the longest codeword (the depth of the tree), and n i is... |

169 |
Information Retrieval - Computational and Theoretical Aspects
- Heaps
- 1978
(Show Context)
Citation Context ...d Aramaic texts (15 million words) written over the past ten centuries [7]. The first set of alphabets consists of the bigrams in the three languages (the source for English for this distribution was =-=[13]); for the-=- next set, the elements to be encoded are the different words, which yields very large "alphabets"; and the final set contains the distribution of trigrams in French. For completeness, the Z... |

101 | Data Compression - Lelewer, Hirschberg |

91 | Adding compression to a full-text retrieval system. Softw
- ZOBEL, MOFFAT
- 1995
(Show Context)
Citation Context ...ceedings, pp. 65--75. -- 1 -- 1. Introduction The importance and usefulness of Data Compression for Information Retrieval (IR) Systems is today well-established, and many authors have commented on it =-=[1, 17, 31, 27]-=-. Large full-text IR Systems are indeed voracious consumers of storage space realtive to the size of the raw textual database, because not only the text has to be kept, but also various auxiliary file... |

89 | The Art of Computer Programming. Vol. I : Fundamental Algorithms - Knuth - 1968 |

59 |
Variable-length binary encodings
- Gilbert, Moore
- 1959
(Show Context)
Citation Context ...ained as follows: the i-th codeword consists of the first ` i bits immediately to the right of the "binary point" in the infinite binary expansion of P i\Gamma1 j=1 2 \Gamma` j , for i = 1; =-=: : : ; n [12]-=-. Many properties of canonical codes are mentioned in [15, 3]. The following will be used as a running example in this paper. Consider the probability distribution implied by Zipf's law, defined by th... |

51 | Self-synchronizing Huffman codes - Ferguson, Rabinowitz - 1984 |

50 |
On the implementation of minimum redundancy prefix codes
- Moffat, Turpin
- 1997
(Show Context)
Citation Context ... i in the Huffman tree. A tree is called canonical if, when scanning its leaves from left to right, they appear in non-decreasing order of their depth (or equivalently, in non-increasing order, as in =-=[22]-=-). The idea is that Huffman's algorithm is only used to generate the lengths f` i g of the codewords, rather than the codewords themselves; the latter are easily obtained as follows: the i-th codeword... |

48 |
Generating a canonical prefix encoding
- Schwartz, Kallick
- 1964
(Show Context)
Citation Context ...ed via Huffman's algorithm. One may thus choose one of the trees that has some additional properties. The preferred choice for many applications is the canonical tree, defined by Schwartz and Kallick =-=[25]-=-, and recommended by many others (see, e.g., [15, 27]). Denote by (p 1 ; : : : ; p n ) the given probability distribution, where we assume that p 1sp 2s\Delta \Delta \Deltasp n , and let ` i be the le... |

33 | In-situ generation of compressed inverted files - Moffat, Bell - 1995 |

32 | Fast searching on compressed text allowing errors
- Moura, Navarro, et al.
- 1998
(Show Context)
Citation Context ... Another possibility to avoid accessing individual bits -- 2 -- is by using 256-ary instead of the optimal binary Huffman codes. This obviously reduces the compression efficiency, but de Moura et al. =-=[6]-=- report that the degradation is not significant. In the next section, we recall the necessary definitions of canonical Huffman trees as they are used below. Section 3 presents the new suggested data s... |

32 | Efficient decoding of prefix codes
- Hirschberg, Lelewer
- 1990
(Show Context)
Citation Context ...e building the system, whereas decompression is needed during the processing of every query and directly affects response time. There is thus a special interest in fast decoding techniques (see e.g., =-=[15]-=-). The data structures needed for the decoding of a Huffman encoded file (a Huffman tree or lookup table) are generally considered negligible overhead relative to large texts. However, not all texts a... |

29 | Huffman codes and self-information - Katona, Nemetz - 1976 |

22 |
A systematic approach to compressing a full-text retrieval system
- Bookstein, Klein, et al.
- 1992
(Show Context)
Citation Context ...fe distributions were used. The data for French was collected from the Tr'esor de la Langue Francaise, a database of 680 MB of French language texts (115 million words) of the 17 th --20 th centuries =-=[4]-=-; for English, the source are 500 MB (87 million words) of the Wall Street Journal [24]; and for Hebrew, a part of the Responsa Retrieval Project , 100 MB of Hebrew and Aramaic texts (15 million words... |

22 | Text compression for dynamic document databases
- Moffat, Zobel, et al.
- 1997
(Show Context)
Citation Context ...angue Francaise, a database of 680 MB of French language texts (115 million words) of the 17 th --20 th centuries [4]; for English, the source are 500 MB (87 million words) of the Wall Street Journal =-=[24]-=-; and for Hebrew, a part of the Responsa Retrieval Project , 100 MB of Hebrew and Aramaic texts (15 million words) written over the past ten centuries [7]. The first set of alphabets consists of the b... |

18 |
All about the Responsa Retrieval Project you always wanted to know but were afraid to ask, Expanded Summary
- Fraenkel
- 1976
(Show Context)
Citation Context ...87 million words) of the Wall Street Journal [24]; and for Hebrew, a part of the Responsa Retrieval Project , 100 MB of Hebrew and Aramaic texts (15 million words) written over the past ten centuries =-=[7]-=-. The first set of alphabets consists of the bigrams in the three languages (the source for English for this distribution was [13]); for the next set, the elements to be encoded are the different word... |

17 |
Space-efficient construction of optimal prefix codes
- Moffat, Turpin, et al.
- 1995
(Show Context)
Citation Context ... the "alphabet" to be encoded is not necessarily small, and may, e.g., consist of all the different words in the text, so that Huffman trees with thousands and even millions of nodes are not=-= uncommon [23]-=-. We try, in this paper, to reduce the necessary internal memory space by devising efficient ways to encode these trees. In addition, the new suggested data structure also allows a speed-up of the dec... |

16 | Bounding the depth of search trees
- Fraenkel, Klein
- 1993
(Show Context)
Citation Context ...e original Huffman tree would be deeper, it is sometimes convenient to impose an upper limit of B = O(log n) on the depth, -- 8 -- which often implies only a negligible loss in compression efficiency =-=[10]-=-. In any case, given a logarithmic bound on the depth, the size of the sk-tree is about log n (log n \Gamma log log n): 3.4 Time complexity When decoding is based on a standard Huffman tree, the avera... |

15 |
Fast decoding of Huffman codes
- Sieminski
- 1988
(Show Context)
Citation Context ...method based on large tables constructed in a pre-processing stage is suggested in [5], with the help of which the entire decoding process can be performed using only byte oriented commands (see also =-=[26]-=-). However, the internal memory required for the storage of these tables may be very large. Another possibility to avoid accessing individual bits -- 2 -- is by using 256-ary instead of the optimal bi... |

13 |
An application of informational divergence to Huffman codes
- Longo, Galasso
- 1982
(Show Context)
Citation Context ...the probabilities are integral powers of 1 2 . There cannot be too great a difference between the actual probability distribution and this dyadic one, since they both yield the same Huffman tree (see =-=[20] for bound-=-s on the "distance" between such distributions). Given this model, eqn. (2) becomes X i2fleaves in sk-treeg i d i 2 \Gammad i j : A similar sum, but taken over all the leaves of the original... |

10 |
T.: Novel compression of sparse bit-strings
- Fraenkel, Klein
- 1985
(Show Context)
Citation Context ...his is true for many real-life distributions, and in particular for all the examples below. On the other hand, the distribution of one of the alphabets used for compressing a set of sparse bitmaps in =-=[8]-=- is h1; 0; 0; 1; 7; 0; 1; 28; 0; 46; 59; 114i. All the techniques suggested herein can be easily adapted to the general case using a vector succ(i), giving for each codeword length i, the next larger ... |

10 |
Deerwester S., Storing Text Retrieval Systems on CD-ROM
- Klein, Bookstein
- 1989
(Show Context)
Citation Context ...ceedings, pp. 65--75. -- 1 -- 1. Introduction The importance and usefulness of Data Compression for Information Retrieval (IR) Systems is today well-established, and many authors have commented on it =-=[1, 17, 31, 27]-=-. Large full-text IR Systems are indeed voracious consumers of storage space realtive to the size of the raw textual database, because not only the text has to be kept, but also various auxiliary file... |

9 | Compression, Information Theory and Grammars: A Unified Approach
- 1300kstein, Klein
- 1990
(Show Context)
Citation Context ...uffman tree or lookup table) are generally considered negligible overhead relative to large texts. However, not all texts are large, and if Huffman coding is applied in connection with a Markov model =-=[2], the requ-=-ired Huffman forest may become itself a storage problem. Moreover, the "alphabet" to be encoded is not necessarily small, and may, e.g., consist of all the different words in the text, so th... |

5 |
Efficient Variants of Huffman Codes
- Choueka, Klein, et al.
- 1985
(Show Context)
Citation Context ...isons. The manipulation of individual bits is indeed the main cause for the slow decoding of Huffman encoded text. A method based on large tables constructed in a pre-processing stage is suggested in =-=[5]-=-, with the help of which the entire decoding process can be performed using only byte oriented commands (see also [26]). However, the internal memory required for the storage of these tables may be ve... |

4 |
Is Huffman coding dead?, Computing 50
- Bookstein, Klein
- 1993
(Show Context)
Citation Context ...` i bits immediately to the right of the "binary point" in the infinite binary expansion of P i\Gamma1 j=1 2 \Gamma` j , for i = 1; : : : ; n [12]. Many properties of canonical codes are men=-=tioned in [15, 3]-=-. The following will be used as a running example in this paper. Consider the probability distribution implied by Zipf's law, defined by the weights p i = 1=(i H n ), for 1sisn, where H n = P n j=1 (1... |