Results 1 - 10
of
133
Compressing XML with Multiplexed Hierarchical PPM Models
- In Data Compression Conference
, 2001
"... this paper, we will describe alternative approaches to XML compression that illustrate other tradeos between speed and eectiveness. We describe experiments using several text compressors and XMILL to compress a variety of XML documents. Using these as a benchmark, we describe our two main results: a ..."
Abstract
-
Cited by 105 (3 self)
- Add to MetaCart
(Show Context)
this paper, we will describe alternative approaches to XML compression that illustrate other tradeos between speed and eectiveness. We describe experiments using several text compressors and XMILL to compress a variety of XML documents. Using these as a benchmark, we describe our two main results: an online binary encoding for XML called Encoded SAX (ESAX) that compresses better and faster than existing methods; and an online, adaptive, XML-conscious encoding based on Prediction by Partial Match (PPM) [5] called Multiplexed Hierarchical Modeling (MHM) that compresses up to 35% better than any existing method but is fairly slow. First, of course, we need to describe XML in more detail.
On prediction using variable order Markov models
- JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract
-
Cited by 105 (1 self)
- Add to MetaCart
(Show Context)
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average log-loss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the Lempel-Ziv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
Spam filtering using statistical data compression models
- Journal of Machine Learning Research
, 2006
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract
-
Cited by 72 (12 self)
- Add to MetaCart
(Show Context)
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.
Models of English text
, 1997
"... The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models of English text have been the result of research into compression. Not only is this an impo ..."
Abstract
-
Cited by 55 (8 self)
- Add to MetaCart
The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models of English text have been the result of research into compression. Not only is this an important application of such models but the amount of compression provides a measure of how well such models perform. Three main classes of models are considered: character based models, word based models, and models which use auxilary information in the form of parts of speech. These models are compared in terms of their memory usage and compression.
Universal Lossless Source Coding With the Burrows Wheeler Transform
- IEEE TRANSACTIONS ON INFORMATION THEORY
, 2002
"... The Burrows Wheeler Transform (BWT) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes ar ..."
Abstract
-
Cited by 44 (4 self)
- Add to MetaCart
The Burrows Wheeler Transform (BWT) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv--Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: 1) statistical characterizations of the BWT output on both finite strings and sequences of length , 2) a variety of very simple new techniques for BWT-based lossless source coding, and 3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv--Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory sources.
Extended Application of Suffix Trees to Data Compression
- In Data Compression Conference
, 1996
"... A practical scheme for maintaining an index for a sliding window in optimal time and space, by use of a suffix tree, is presented. The index supports location of the longest matching substring in time proportional to the length of the match. The total time for build and update operations is proporti ..."
Abstract
-
Cited by 44 (4 self)
- Add to MetaCart
(Show Context)
A practical scheme for maintaining an index for a sliding window in optimal time and space, by use of a suffix tree, is presented. The index supports location of the longest matching substring in time proportional to the length of the match. The total time for build and update operations is proportional to the size of the input. The algorithm, which is simple and straightforward, is presented in detail. The most prominent lossless data compression scheme, when considering compression performance, is prediction by partial matching with unbounded context lengths (PPM*). However, previously presented algorithms are hardly practical, considering their extensive use of computational resources. We show that our scheme can be applied to PPM*-style compression, obtaining an algorithm that runs in linear time, and in space bounded by an arbitrarily chosen window size. Application to Ziv--Lempel '77 compression methods is straightforward and the resulting algorithm runs in linear time. 1 Introdu...
Boosting textual compression in optimal linear time
- Journal of the ACM
, 2005
"... Abstract. We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression Extended abstracts related to this article appeared in Proceedings of CPM 2001 and Proceedings of ACM-SIAM SOD ..."
Abstract
-
Cited by 42 (19 self)
- Add to MetaCart
Abstract. We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression Extended abstracts related to this article appeared in Proceedings of CPM 2001 and Proceedings of ACM-SIAM SODA 2004, and were combined due to their strong relatedness and complementarity. The work of P. Ferragina was partially supported by the Italian MIUR projects “Algorithms for the Next
Evaluating Next-Cell Predictors with Extensive Wi-Fi Mobility Data
- IEEE Transactions on Mobile Computing
, 2004
"... Location is an important feature for many applications, and wireless networks can better serve their clients by anticipating client mobility. As a result, many location predictors have been proposed in the literature, though few have been evaluated with empirical evidence. This paper reports on th ..."
Abstract
-
Cited by 41 (4 self)
- Add to MetaCart
Location is an important feature for many applications, and wireless networks can better serve their clients by anticipating client mobility. As a result, many location predictors have been proposed in the literature, though few have been evaluated with empirical evidence. This paper reports on the results of the first extensive empirical evaluation of location predictors, using a two-year trace of the mobility patterns of over 6,000 users on Dartmouth's campus-wide Wi-Fi wireless network.
The Entropy Of English Using Ppm-Based Models
- In Data Compression Conference
, 1996
"... this paper is to show that the difference between the best machine models and human models is smaller than might be indicated by these results. This follows from a number of observations: firsfly, the original human experiments used only 27 character English (letters plus space) against full 128 cha ..."
Abstract
-
Cited by 41 (6 self)
- Add to MetaCart
(Show Context)
this paper is to show that the difference between the best machine models and human models is smaller than might be indicated by these results. This follows from a number of observations: firsfly, the original human experiments used only 27 character English (letters plus space) against full 128 character ASCII text for most computer experiznents; secondly, using large amounts of priming text substantially improves PPM's performance; and thirdly, the PPM algorithm can k,e modified to perform better for English text. The result of this is machine performance down to 1.46 bpc
Text Classification and Segmentation Using Minimum Cross-Entropy
, 2000
"... Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their cross-entropy calculated using a fixed order character-based Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their cross-entropy calculated using a fixed order character-based Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant improvement over previously used methods in a number of areas. For example, text can be classified with a very high degree of accuracy by authorship, language, dialect and genre. Highly accurate text segmentation is also possible -- the accuracy of the PPM-based Chinese word segmenter is close to 99% on Chinese news text; similarly, a PPM-based method of segmenting text by language achieves an accuracy of over 99%.