Results 1  10
of
134
Compressing XML with Multiplexed Hierarchical PPM Models
 In Data Compression Conference
, 2001
"... this paper, we will describe alternative approaches to XML compression that illustrate other tradeos between speed and eectiveness. We describe experiments using several text compressors and XMILL to compress a variety of XML documents. Using these as a benchmark, we describe our two main results: a ..."
Abstract

Cited by 105 (3 self)
 Add to MetaCart
(Show Context)
this paper, we will describe alternative approaches to XML compression that illustrate other tradeos between speed and eectiveness. We describe experiments using several text compressors and XMILL to compress a variety of XML documents. Using these as a benchmark, we describe our two main results: an online binary encoding for XML called Encoded SAX (ESAX) that compresses better and faster than existing methods; and an online, adaptive, XMLconscious encoding based on Prediction by Partial Match (PPM) [5] called Multiplexed Hierarchical Modeling (MHM) that compresses up to 35% better than any existing method but is fairly slow. First, of course, we need to describe XML in more detail.
On prediction using variable order Markov models
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract

Cited by 105 (1 self)
 Add to MetaCart
(Show Context)
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average logloss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the LempelZiv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
Spam filtering using statistical data compression models
 Journal of Machine Learning Research
, 2006
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract

Cited by 72 (12 self)
 Add to MetaCart
(Show Context)
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on characterlevel or binary sequences. By modeling messages as sequences, tokenization and other errorprone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.
Models of English text
, 1997
"... The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models of English text have been the result of research into compression. Not only is this an impo ..."
Abstract

Cited by 55 (8 self)
 Add to MetaCart
The problem of constructing models of English text is considered. A number of applications of such models including cryptology, spelling correction and speech recognition are reviewed. The best current models of English text have been the result of research into compression. Not only is this an important application of such models but the amount of compression provides a measure of how well such models perform. Three main classes of models are considered: character based models, word based models, and models which use auxilary information in the form of parts of speech. These models are compared in terms of their memory usage and compression.
Extended Application of Suffix Trees to Data Compression
 In Data Compression Conference
, 1996
"... A practical scheme for maintaining an index for a sliding window in optimal time and space, by use of a suffix tree, is presented. The index supports location of the longest matching substring in time proportional to the length of the match. The total time for build and update operations is proporti ..."
Abstract

Cited by 44 (4 self)
 Add to MetaCart
(Show Context)
A practical scheme for maintaining an index for a sliding window in optimal time and space, by use of a suffix tree, is presented. The index supports location of the longest matching substring in time proportional to the length of the match. The total time for build and update operations is proportional to the size of the input. The algorithm, which is simple and straightforward, is presented in detail. The most prominent lossless data compression scheme, when considering compression performance, is prediction by partial matching with unbounded context lengths (PPM*). However, previously presented algorithms are hardly practical, considering their extensive use of computational resources. We show that our scheme can be applied to PPM*style compression, obtaining an algorithm that runs in linear time, and in space bounded by an arbitrarily chosen window size. Application to ZivLempel '77 compression methods is straightforward and the resulting algorithm runs in linear time. 1 Introdu...
Universal Lossless Source Coding With the Burrows Wheeler Transform
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2002
"... The Burrows Wheeler Transform (BWT) is a reversible sequence transformation used in a variety of practical lossless sourcecoding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWTbased compression schemes ar ..."
Abstract

Cited by 44 (4 self)
 Add to MetaCart
The Burrows Wheeler Transform (BWT) is a reversible sequence transformation used in a variety of practical lossless sourcecoding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWTbased compression schemes are widely touted as lowcomplexity algorithms giving lossless coding rates better than those of the ZivLempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWTbased coding. The main results of this theoretical evaluation include: 1) statistical characterizations of the BWT output on both finite strings and sequences of length , 2) a variety of very simple new techniques for BWTbased lossless source coding, and 3) proofs of the universality and bounds on the rates of convergence of both new and existing BWTbased codes for finitememory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWTbased lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in ZivLempel style codes and, for some BWTbased codes, within a constant factor of the optimal rate of convergence for finitememory sources.
Boosting textual compression in optimal linear time
 Journal of the ACM
, 2005
"... Abstract. We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression Extended abstracts related to this article appeared in Proceedings of CPM 2001 and Proceedings of ACMSIAM SOD ..."
Abstract

Cited by 42 (19 self)
 Add to MetaCart
Abstract. We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression Extended abstracts related to this article appeared in Proceedings of CPM 2001 and Proceedings of ACMSIAM SODA 2004, and were combined due to their strong relatedness and complementarity. The work of P. Ferragina was partially supported by the Italian MIUR projects “Algorithms for the Next
The Entropy Of English Using PpmBased Models
 In Data Compression Conference
, 1996
"... this paper is to show that the difference between the best machine models and human models is smaller than might be indicated by these results. This follows from a number of observations: firsfly, the original human experiments used only 27 character English (letters plus space) against full 128 cha ..."
Abstract

Cited by 41 (6 self)
 Add to MetaCart
(Show Context)
this paper is to show that the difference between the best machine models and human models is smaller than might be indicated by these results. This follows from a number of observations: firsfly, the original human experiments used only 27 character English (letters plus space) against full 128 character ASCII text for most computer experiznents; secondly, using large amounts of priming text substantially improves PPM's performance; and thirdly, the PPM algorithm can k,e modified to perform better for English text. The result of this is machine performance down to 1.46 bpc
Evaluating NextCell Predictors with Extensive WiFi Mobility Data
 IEEE Transactions on Mobile Computing
, 2004
"... Location is an important feature for many applications, and wireless networks can better serve their clients by anticipating client mobility. As a result, many location predictors have been proposed in the literature, though few have been evaluated with empirical evidence. This paper reports on th ..."
Abstract

Cited by 41 (4 self)
 Add to MetaCart
Location is an important feature for many applications, and wireless networks can better serve their clients by anticipating client mobility. As a result, many location predictors have been proposed in the literature, though few have been evaluated with empirical evidence. This paper reports on the results of the first extensive empirical evaluation of location predictors, using a twoyear trace of the mobility patterns of over 6,000 users on Dartmouth's campuswide WiFi wireless network.
Text Classification and Segmentation Using Minimum CrossEntropy
, 2000
"... Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their crossentropy calculated using a fixed order characterbased Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their crossentropy calculated using a fixed order characterbased Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant improvement over previously used methods in a number of areas. For example, text can be classified with a very high degree of accuracy by authorship, language, dialect and genre. Highly accurate text segmentation is also possible  the accuracy of the PPMbased Chinese word segmenter is close to 99% on Chinese news text; similarly, a PPMbased method of segmenting text by language achieves an accuracy of over 99%.