| Zobel J and Moffat A (1995) Adding compression to a full-text retrieval system. Software Practice and Experience, 25(8):891--903. |
....experimental results section, no significant decrease of the compression ratio is experienced by using bytes instead of bits. On the other hand, decompression of byte Huffman code is faster than decompression of binary Huffman code. All techniques for efficient encoding and decoding mentioned in [20] can easily be extended to this case. 3 4. Compression and Decompression Performance For the experimental results we used literary texts from the trec collection [11] We have chosen the following texts: ap Newswire (1989) doe Short abstracts from doe publications, fr Federal Register ....
J. Zobel and A. Moffat. Adding compression to a fulltext retrieval system. Software Practice and Experience, 25(8):891--903, 1995. 7
....will improve. Note also that if the cost of decompressing z bytes exceeds the cost of reading the u z extra bytes in an uncompressed inverted list, then execution performance will deteriorate. Compression techniques for inverted lists have received a fair amount of attention in the literature [90, 57, 51, 2, 6, 95]. We do not claim anything novel with respect to compression. Rather, for completeness we merely describe how it fits into an overall optimization strategy and give a necessary condition for providing benefit with respect to execution performance. Note that compression clearly has other desirable ....
Zobel, J. and Moffat, A. Adding compression to a full-text retrieval system. In Proc. 15th Australian Computer Science Conf., pages 1077--1089, Hobart, Australia, Jan. 1992.
....is the canonical tree, defined by Schwartz and Kallick [Schwartz and Kallick 1964] The Huffman tree of Figure 1 is a canonical tree. It allows more efficiency at decoding time with less memory requirement. Many properties of the canonical codes are mentioned in [Hirschberg and Lelewer 1990; Zobel and Moffat 1995; Witten et al. 1999] 3.1 Byte Oriented Huffman Code The original method proposed by Huffman [Huffman 1952] is mostly used as a binary code. That is, each symbol of the input stream is coded as a sequence of bits. In this work the Huffman codeword assigned to each text word is a sequence of ....
....bit of each byte is used as follows: the first byte of each codeword has the highest bit in 1, while the other bytes have their highest bit in 0. This is useful for direct searching on the compressed text, as explained later. All the techniques for efficient encoding and decoding mentioned in [Zobel and Moffat 1995] can easily be extended to our case. As we show later in the experimental results section no significant degradation of the compression ratio is experienced by using bytes instead of bits. On the other hand, decompression of byte Huffman code is faster than decompression of binary Huffman code. In ....
[Article contains additional citation context not shown here]
Zobel, J. and Moffat, A. 1995. Adding compression to a full-text retrieval system. Software Practice and Experience 25, 8, 891--903.
....the 8th Annual Symposium on Combinatorial Pattern Matching (CPM 97) and appeared in its proceedings, pp. 65 75. 1 1. Introduction The importance and usefulness of Data Compression for Information Retrieval (IR) Systems is today well established, and many authors have commented on it [1, 17, 31, 27]. Large full text IR Systems are indeed voracious consumers of storage space realtive to the size of the raw textual database, because not only the text has to be kept, but also various auxiliary files like dictionaries and concordances, which are usually adjoined to the system to make the ....
Zobel J., Moffat A., Adding compression to a full-text retrieval system, Software --- Practice & Experience, 26 (1995) 891--903. -- 16 --
....when the alphabet symbols are words and the source to be compressed is a natural language text. This coding scheme, known as word based Huffman [BSTW86] has important applications on information retrieval systems, were it is used to reduce the storage costs and to improve the search performance [ZM95, MNZBY98b, MNZBY98a]. In fact, the Huffman code construction represents only a small portion of the overall compression times. However, we are investigating alternative schemes to allow editing in compressed text where the Huffman code is rebuilt periodically. Contributions to reduce the Huffman coding construction ....
J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. Software Practice and Experience, 25(8):891--903, 1995. 6
....input list of n weights with the corresponding codeword lengths in O(n) time. In addition, the worst case compression loss introduced by BRCI codes with respect to unrestricted Huffman codes is proved to be negligible for all practical values of both L and n. 1 Introduction Zobel and Moffat [18] have proposed an innovative compression scheme for full text retrieval purposes of large document databases. Their compression scheme substantially reduces the consumed space and improves the query response time. This last effect is due to the reduction of the disk to memory transfer times of ....
....retrieval purposes of large document databases. Their compression scheme substantially reduces the consumed space and improves the query response time. This last effect is due to the reduction of the disk to memory transfer times of blocks. For this kind of application, the authors recommend [18] the utilization of a semi static word based compression model. Although adaptive models usually provide better compression rates, the semi static models are more suitable for full text retrieval purposes. In semi static models the decoding process can start at intermediate points of the ....
[Article contains additional citation context not shown here]
J. Zobel and A. Moffat. Adding compression to a fulltext retrieval system. Software---Practice and Experience, 25(8):891--903, Aug. 1995. 10
....space. Finally, since our two algorithms have the advantage of not writing at the input buffer during the code calculation, we discuss some applications where this feature is very useful. 1 Introduction Minimum redundancy coding plays an important role in data compression applications [Gai93, ZM95]. Methods for calculating a set of minimum redundancy codewords that correspond to a set of input symbol weights are of great interest. For an input list of n symbol weights, the well known Huffman s algorithm generates a set of codewords in O(n log n) time and O(n) extra space [Huf52] These ....
Justin Zobel and Alistair Moffat. Adding compression to a full-text retrieval system. Software---Practice and Experience, 25(8):891--903, August 1995. 10
....em tempo O(n) utilizando um espaco adicional O(minfl 2 1 ; ng) onde l 1 e o comprimento do maior c odigo. Palavras Chave: C odigos de Prefixo, Arvores de Huffman, Compress ao de Dados 1 1 Introduction Minimum redundancy coding plays an important role in data compression applications [Gai93, ZM95]. Methods for calculating a set of minimum redundancy codewords that correspond to a set of input symbol weights are of great interest. For an input list of n symbol weights, the well known Huffman s algorithm generates a set of codewords in O(n log n) time and O(n) extra space [Huf52] Leeuwen ....
Justin Zobel and Alistair Moffat. Adding compression to a full-text retrieval system. Software---Practice and Experience, 25(8):891--903, August 1995. 22
....variation of this problem, that is, for a fixed L dlog 2 ne, we must minimize P n i=1 w i l i constrained to l i L for i = 1; n. We also assume that the weights w 1 ; wn are sorted, with w 1 Delta Delta Delta wn . It is worth to mention that Zobel and Moffat [26] showed that fast and space economical algorithms for calculating length restricted prefix codes can be very useful if a word based model is used to compress very large text collections. In this case, each distinct word in the source text is treated as one symbol, what often produces very large ....
....is a worst case for the method described in section 2.2. The other six lists were extracted from two text collections, referred as Reuters and Gutenberg in table 1. For these collections, we used two parsing methods. The first one is the word model adopted by the Huffword compression algorithm [26] and generates two lists of weights. It packs any sequence of alphanumeric ASCII codes into a single word symbol that corresponds to one of the weights in the first list. Moreover, each sequence of non alphanumeric ASCII codes is considered a non word symbol, and corresponds to one of the weights ....
Justin Zobel and Alistair Moffat. Adding compression to a full-text retrieval system. Software---Practice and Experience, 25(8):891--903, August 1995.
....byte Huffman code is faster than decompression of binary Huffman code. In practice byte processing is much faster than bit processing because bit shifts and masking operations are not necessary at decoding time or at searching time. All techniques for efficient encoding and decoding mentioned in [ZM95] can easily be extended to this case. 4 Compression and Decompression Performance For the experimental results we used literary texts from the trec collection [Har95] We have chosen the following texts: ap Newswire (1989) doe Short abstracts from doe publications, fr Federal Register ....
J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. Software Practice and Experience, 25(8): 891--903, 1995. 10
....mesh representation to be present in the memory. Finally, there exists a large body of research involving compression domain processing of di erent multimedia data types. This includes direct manipulation of compressed image [SS95, SR93] compressed video [AHC93, CM95] compressed text [ZM95, VC97] and compressed volume [LGK97, C 97] data. 22 5.2 Triangle Mesh Editing Operations In this subsection, we describe the set of editing operation that the proposed Compression Domain Mesh Editing (CDE) scheme supports. We believe they form a representative set of operations supported ....
J. Zobel and A. Moat. Adding Compression to a Full-Text Retrieval System. Software - Practice and Experience, 25(8), 1995.
....E. In the second case above where the decompression is more expensive than the read, ROIO will always suffer, regardless of the contribution of E, since it would have been cheaper to read in an uncompressed version. Compression techniques for inverted lists have received a fair amount of attention [58, 36, 30, 1, 5, 63]. I do not claim anything novel with respect to compression. Rather, for completeness I merely describe how it fits into my optimization strategy and give a necessary condition for providing benefit with respect to execution performance. I also note that compression clearly has other desirable ....
J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. In Proc. 15th Australian Comp. Sci. Conf., pages 1077--1089, Hobart, Australia, Jan. 1992.
....databases does not always succeed. Because of this, many textual databases schemes do not compress the text, while compression is left to data that is not to be queried. Approaches to combine text compression and indexing techniques using inverted lists have recently received some attention [MB95, WBN92, ZM95]. However, work on combining compression techniques and suffix arrays has not been pursued. Suffix arrays [MM90] or Pat arrays [Gon87, GBYS92] are indexing structures that achieve space and time complexity similar to inverted lists. Their main drawback is their costly construction and maintenance ....
....Models and Text Compression Assumptions that are normally made when designing a general compression scheme are not valid for textual databases. For example, the need of direct access to parts of the text immediately rules out adaptive models, which are pervasive in modern compression schemes [ZM95]. Adaptive models start with no information about the text and progressively learn about its statistical distribution as the compression process goes on. They are one pass and store no additional information apart from the compressed data. In the long term, they converge to the true statistical ....
[Article contains additional citation context not shown here]
J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. Software Practice and Experience, 25(8): 891-903, 1995. This article was processed using the L A T E X 2 " macro package with CUP CS class
....used to augment the RAM, hopefully without sacrificing essential functionality or performance. Applicability of Compression Techniques Increasing the amount of data it is possible to process with a given amount of memory is particularly attractive when using a RAM based approach. Zobel and Moffat[15] describe the use of a semi static, Huffman derived compression scheme which has many desirable properties for this application. It reduces space required by a large factor (greater than 3 in a quoted example) yet, unlike adaptive schemes, allows decompression to begin at intermediate points in ....
J. Zobel and A. Moffit `Adding Compression To A Full-Text Retrieval System', in Proceedings of the Fifteenth Australian Computer Science Conference, Hobart, Australia, pp 1077-1089 (Jan 1992).
....quite large. The preferred choice for most applications is the canonical tree, defined by Schwartz and Kallich [SK64] The Huffman tree of Figure 1 is a canonical tree. It allows more efficiency at decoding time with less memory requirement. Many properties of the canonical codes are mentioned in [HL90, ZM95]. 3.1 Byte Oriented Huffman Code The original method proposed by Huffman [Huf52] is mostly used as a binary code. In our work the Huffman code assigned to each text word is a sequence of whole bytes and the Huffman tree has degree 256 instead of 2. All techniques for efficient encoding and ....
....Huffman Code The original method proposed by Huffman [Huf52] is mostly used as a binary code. In our work the Huffman code assigned to each text word is a sequence of whole bytes and the Huffman tree has degree 256 instead of 2. All techniques for efficient encoding and decoding mentioned in [ZM95] can easily be extended to our case. As we show later in the experimental results section no significant degradation of the compression ratio is experienced by using bytes instead of bits. On the other hand, decompression of byte Huffman code is faster than decompression of binary Huffman code. In ....
J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. Software Practice and Experience, 25(8): 891--903, 1995.
.... canonical Hu#man coding, the tree is not stored and decompression is much faster than traditional implementations [10, 21] Semi static compression has been successfully integrated into text information retrieval systems, resulting in savings in both space requirements and query evaluation costs [1, 18, 20, 21, 22]. The compression techniques used are relatively simple Hu#man coding for text, and integer coding techniques [20] for indexes but the savings are dramatic. Index compression in particular is widely used in commercial systems from search engines such as Google to content managers such as ....
J. Zobel and A. Mo#at. Adding compression to a full-text retrieval system. Software Practice and Experience, 25(8):891--903, 1995.
....CITRI, Department of Computer Science, RMIT, 723 Swanston St, Carlton, Victoria 3053; jz kbs.citri.edu.au. 1 In the last two years we have developed techniques that allow the text to be stored compressed and provide fast access to document collections via compressed indexes [2, 15, 16]; create indexes with an in memory inversion technique [11] and permit economical ranking of large document collections [12, 17] These techniques rely on the large main memories of modern machines to store the vocabulary of the document collection, allowing rapid access to index information and ....
....TREC compression model 7 each compressed document starts. Because the compression is performed with a static model, documents can be decompressed individually, given only the address of the compressed document. For further details of the scheme used the reader is referred to the description in [15]. The decompression process is very fast. Hu man codes are particularly amenable to a table based look up and copy implementation, and each such decoding step costs only a handful of shift and test operations per bit of input and generates several output bytes. Furthermore, the particular ....
J. Zobel and A. Moat. Adding compression to a full-text retrieval system. In Proc. Australasian Computer Science Conf., pages 1077-1089, Hobart, Australia, January 1992.
....as part of the data, is essential to the e#cient resolution of queries. For document databases, compression schemes can allow retrieval of stored text to be faster than when uncompressed, since the computational cost of decompression can be o#set by reductions in disk seeking and transfer costs [1]. In this paper we explore whether similar gains are available for numeric data. We have implemented several integer coding schemes and evaluated them on collections derived from large indexes and scientific data sets. For comparison, we contrast our results with the theoretical space requirement, ....
....length overall. Adaptive schemes (where the model evolves as the data is processed) are currently favoured for generalpurpose compression [5, 6] and are the basis of utilities such as compress . However, because databases are divided into small records that must be independently decompressible [1], adaptive techniques are generally not e#ective. Moreover, the requirement of atomic decom pression precludes the application of vertical compression techniques such as the READ compression commonly used in images [7] that take advantage of di#erences between adjacent records. For text, for ....
[Article contains additional citation context not shown here]
J. Zobel and A. Mo#at. Adding compression to a fulltext retrieval system. Software---Practice and Experience, 25(8):891--903, 1995.
....saves space, but can reduce query evaluation costs. Some years ago, Zobel and Mo#at observed experimentally that, for sequential retrieval of data, compression could lead to a net penalty in retrieval time, but for random access patterns typical of text database systems compression allowed savings [13]. However, because of the changes in hardware since those experiments, we would expect that compression would today always lead to savings in retrieval time, and have observed better throughput due to compression in the context of index processing [10] and genomic retrieval [9] For standard ....
....is shown in the first line of Table 1. A word model yields much better compression, at the cost of requiring more memory. The performance of the word model is shown in the second line of Table 1 (and in Figure 1) Word models are seen as providing good compression e#ciency for text databases [12, 13], and Hu#man coding provides compression within 1 of the optimum because the probabilities are relatively small. Other compression schemes, such as predictive modelling, can yield somewhat better compression, but are inherently slow, rely on large models [1] and are at their best when adaptive ....
J. Zobel and A. Mo#at. Adding compression to a fulltext retrieval system. Software---Practice and Experience, 25(8):891--903, 1995.
....secondary storage media means that it is now possible for an application to execute faster with embedded compression than without. That is, with a careful choice of methods, data can now often be decoded within the extra time that would, in the absence of compression, be spent fetching more of it [1]. The range of disk doubling software tools available for the PC market, and their widespread popularity, is further testament to this fact. And as machines become more and more cache reliant the same effects will be observed in main memory too the only place that data or program code is ....
Zobel, J. and Moffat, A. (1995) Adding compression to a fulltext retrieval system. Software---Practice and Experience, 25 891--903.
....for each record. These effects are sufficiently great that the savings include ample time for accessed records to be decoded. That is, compression with an appropriate choice of mechanism, of course can actually reduce the elapsed time time needed to access documents out of a large collection (Zobel Moffat, 1995). Collection Extension One operation that is required in a document database that is not germane to most other compression applications is the need to allow for the growth of the collection as new documents are appended. In the semi static word based model described above new documents can be ....
Zobel, J., & Moffat, A. (1995). Adding compression to a full-text retrieval system. Software---Practice and Experience, 25(8):891--903.
....documents or easily segmented parts. It is necessary that these divisions can be decompressed independently. In general, adaptive techniques are not effective for database applications, since they code data as a function of both the preceding symbols and the initial probability distribution [23]. Similarly, an adaptive code, such as arithmetic coding, is not practical for database compression as it is too slow [2] An adaptive code would have to encode each 7 record individually to maintain independent decompressibility, there by restricting the compression gains. To allow atomic ....
J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. 25(8):891--903, 1995. 31
....costs are improved by better using communication channel bandwidths. It has also been shown that improving data transfer, in exchange for decompression time spent by the CPU, can result in significant decreases in the overall time for retrieving data, by reducing the transfer costs from disk [22]. Irrespective of the method and approach used to compress a sequence of data, compression involves two different activities: modelling and coding [16] The modelling process constructs a model for the data that represents distinct symbols. It also provides an estimate of the likelihood of a ....
....model makes good use of specific properties of the data while at the same time remains static during decoding to allow independent decompression. The disadvantage of a semi static model is that two passes of the data are required and the model parameters need to be stored with the compressed data [22]. The Huffman compression algorithm is a typical example of this. The next section discusses coding, that is, using a model to produce a compressed representation of the data. 2.3 Coding Coding is the second activity of the compression, a process that involves the mapping of symbols to different ....
J. Zobel and A. Moffat. Adding compression to a full--text retrieval system. Software--Practice and Experience, 25:891--903, August 1995.
....data. One aim of compression is to reduce storage requirements [5] For text databases, however, compression schemes can allow retrieval of data to be faster than with uncompressed data, since the computational cost of decompression can be offset by reductions in disk seeking and transfer costs [6, 11]. Popular compression algorithms, such as gzip and compress, significantly reduce the storage space required by general purpose database Proceedings of the 1998 Computer Science Postgraduate Students Conference, Royal Melbourne Institute of Technology, Melbourne, Australia, December 8, 1998. ....
J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. Software--- Practice and Experience, Volume 25, Number 8, pages 891--903, August 1995.
....we had relevance judgments [17] The effect of applying these techniques to the TREC collection are described in Section 6. 4 Text compression Using a word based model of the text, the space required to store the documents comprising a database can be reduced to less than 30 of the original size [2, 10, 14, 15]. Each word occurrence in the text is replaced by a canonical Huffman code [7] the length of which is dependent on the frequency of the word, and the intervening non words are similarly coded against a vocabulary of non words. Thus, by alternately decoding a word, then a non word, and so on ....
....the address at which each compressed document starts. Because the compression is performed with a static model, documents can be decompressed individually, given only the address of the compressed document. For further details of the scheme used the reader is referred to the description in [15]. The decompression process is very fast. The canonical Huffman code used is particularly amenable to a table based look up and copy implementation, and each such decoding step generates several output bytes. As a result, the compression regime has only limited impact on retrieval time. ....
J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. In Australian Computer Science Conf., pages 1077--1089, Hobart, Australia, January 1992.
No context found.
Zobel J and Moffat A (1995) Adding compression to a full-text retrieval system. Software Practice and Experience, 25(8):891--903.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC