Results 1 - 10
of
106
The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length
- Machine Learning
, 1996
"... . We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions gene ..."
Abstract
-
Cited by 148 (15 self)
- Add to MetaCart
. We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions generated by general probabilistic automata, we prove that the algorithm we present can efficiently learn distributions generated by PSAs. In particular, we show that for any target PSA, the KL-divergence between the distribution generated by the target and the distribution generated by the hypothesis the learning algorithm outputs, can be made small with high confidence in polynomial time and sample complexity. The learning algorithm is motivated by applications in human-machine interaction. Here we present two applications of the algorithm. In the first one we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second ...
The Context-Tree Weighting Method: Basic Properties
- IEEE Trans. Inform. Theory
, 1995
"... We describe a sequential universal data compression procedure for binary tree sources that performs the "double mixture." Using a context tree, this method weights in an efficient recursive way the coding distributions corresponding to all bounded memory tree sources, and achieves a desirable coding ..."
Abstract
-
Cited by 120 (10 self)
- Add to MetaCart
We describe a sequential universal data compression procedure for binary tree sources that performs the "double mixture." Using a context tree, this method weights in an efficient recursive way the coding distributions corresponding to all bounded memory tree sources, and achieves a desirable coding distribution for tree sources with an unknown model and unknown parameters. Computational and storage complexity of the proposed procedure are both linear in the source sequence length. We derive a natural upper bound on the cumulative redundancy of our method for individual sequences. The three terms in this bound can be identified as coding, parameter, and model redundancy. The bound holds for all source sequence lengths, not only for asymptotically large lengths. The analysis that leads to this bound is based on standard techniques and turns out to be extremely simple. Our upper bound on the redundancy shows that the proposed context-tree weighting procedure is optimal in the sense that it achieves the Rissanen (1984) lower bound.
LeZi-Update: An Information-Theoretic Approach to Track Mobile Users in PCS Networks
, 1999
"... The complexity of the mobility tracking problem in a cellular environment has been characterized under an information-theoretic framework. Shannon’s entropy measure is iden-tified as a basis for comparing user mobility models. By building and maintaining a dictionary of individual user’s path update ..."
Abstract
-
Cited by 94 (11 self)
- Add to MetaCart
The complexity of the mobility tracking problem in a cellular environment has been characterized under an information-theoretic framework. Shannon’s entropy measure is iden-tified as a basis for comparing user mobility models. By building and maintaining a dictionary of individual user’s path updates (as opposed to the widely used location up-dates), the proposed adaptive on-line algorithm can learn subscribers’ profiles. This technique evolves out of the con-cepts of lossless compression. The compressibility of the variable-to-fixed length encoding of the acclaimed Lempel-Ziv family of algorithms reduces the update cost, whereas their built-in predictive power can be effectively used to re-duce paging cost.
Data Compression
- ACM Computing Surveys
, 1987
"... This paper surveys a variety of data compression methods spanning almost forty years of research, from the work of Shannon, Fano and Huffman in the late 40's to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effectiv ..."
Abstract
-
Cited by 81 (3 self)
- Add to MetaCart
This paper surveys a variety of data compression methods spanning almost forty years of research, from the work of Shannon, Fano and Huffman in the late 40's to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Concepts from information theory, as they relate to the goals and evaluation of data compression methods, are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported and possibilities for future research are suggested. INTRODUCTION Data compression is often referred to as coding, where coding is a very general term encompassing any special representation of data which satisfies a given need. Information theory is defined to be the study of eff...
Variable Length Markov Chains
- Annals of Statistics
, 1999
"... We study estimation in the class of stationary variable length Markov chains (VLMC) on a finite space. The processes in this class are still Markovian of higher order, but with memory of variable length yielding a much bigger and structurally richer class of models than ordinary higher order Markov ..."
Abstract
-
Cited by 66 (5 self)
- Add to MetaCart
We study estimation in the class of stationary variable length Markov chains (VLMC) on a finite space. The processes in this class are still Markovian of higher order, but with memory of variable length yielding a much bigger and structurally richer class of models than ordinary higher order Markov chains. From a more algorithmic view, the VLMC model class has attracted interest in information theory and machine learning but statistical properties have not been explored very much. Provided that good estimation is available, an additional structural richness of the model class enhances predictive power by finding a better trade-off between model bias and variance and allows better structural description which can be of specific interest. The latter is exemplified with some DNA data. A version of the tree-structured context algorithm, proposed by Rissanen (1983) in an information theoretical set-up, is shown to have new good asymptotic properties for estimation in the class of VLMC's, even when the underlying model increases in dimensionality: consistent estimation of minimal state spaces and mixing properties of fitted models are given. We also propose a new bootstrap scheme based on fitted VLMC's. We show its validity for quite general stationary categorical time series and for a broad range of statistical procedures. AMS 1991 subject classifications. Primary 62M05; secondary 60J10, 62G09, 62M10, 94A15 Key words and phrases. Bootstrap, categorical time series, central limit theorem, context algorithm, data compression, finite-memory sources, FSMX model, Kullback-Leibler distance, model selection, tree model. Short title: Variable Length Markov Chain 1 Research supported in part by the Swiss National Science Foundation. Part of the work has been done while visiting th...
Universal Portfolios with Side Information
- IEEE Transactions on Information Theory
, 1996
"... We present a sequential investment algorithm, the ¯-weighted universal portfolio with side-information, which achieves, to first order in the exponent, the same wealth as the best side-information dependent investment strategy (the best state-constant rebalanced portfolio) determined in hindsight fr ..."
Abstract
-
Cited by 65 (1 self)
- Add to MetaCart
We present a sequential investment algorithm, the ¯-weighted universal portfolio with side-information, which achieves, to first order in the exponent, the same wealth as the best side-information dependent investment strategy (the best state-constant rebalanced portfolio) determined in hindsight from observed market and side-information outcomes. This is an individual sequence result which shows that the difference between the exponential growth rates of wealth of the best state-constant rebalanced portfolio and the universal portfolio with side-information is uniformly less than (d=(2n)) log(n + 1) + (k=n) log 2 for every stock market and side-information sequence and for all time n. Here d = k(m \Gamma 1) is the number of degrees of freedom in the state-constant rebalanced portfolio with k states of side-information and m stocks. The proof of this result establishes a close connection between universal investment and universal data compression. Keywords: Universal investment, univ...
Predicting Nearly as Well as the Best Pruning of a Decision Tree
- Machine Learning
, 1995
"... . Many algorithms for inferring a decision tree from data involve a two-phase process: First, a very large decision tree is grown which typically ends up "over-fitting" the data. To reduce over-fitting, in the second phase, the tree is pruned using one of a number of available methods. The final tre ..."
Abstract
-
Cited by 64 (5 self)
- Add to MetaCart
. Many algorithms for inferring a decision tree from data involve a two-phase process: First, a very large decision tree is grown which typically ends up "over-fitting" the data. To reduce over-fitting, in the second phase, the tree is pruned using one of a number of available methods. The final tree is then output and used for classification on test data. In this paper, we suggest an alternative approach to the pruning phase. Using a given unpruned decision tree, we present a new method of making predictions on test data, and we prove that our algorithm's performance will not be "much worse" (in a precise technical sense) than the predictions made by the best reasonably small pruning of the given decision tree. Thus, our procedure is guaranteed to be competitive (in terms of the quality of its predictions) with any pruning algorithm. We prove that our procedure is very efficient and highly robust. Our method can be viewed as a synthesis of two previously studied techniques. First, we ...
The Context Tree Weighting Method: Basic Properties
- IEEE Transactions on Information Theory
, 1995
"... We describe a sequential universal data compression procedure for binary tree sources that performs the "double mixture". Using a context tree, this method weights in an efficient recursive way the coding distributions corresponding to all bounded memory tree sources, and achieves a desirable coding ..."
Abstract
-
Cited by 62 (1 self)
- Add to MetaCart
We describe a sequential universal data compression procedure for binary tree sources that performs the "double mixture". Using a context tree, this method weights in an efficient recursive way the coding distributions corresponding to all bounded memory tree sources, and achieves a desirable coding distribution for tree sources with an unknown model and unknown parameters. Computational and storage complexity of the proposed procedure are both linear in the source sequence length. We derive a natural upper bound on the cumulative redundancy of our method for individual sequences. The three terms in this bound can be identified as coding, parameter and model redundancy. The bound holds for all source sequence lengths, not only for asymptotically large lengths. The analysis that leads to this bound is based on standard techniques and turns out to be extremely simple. Our upper bound on the redundancy shows that the proposed context tree weighting procedure is optimal in the sense that i...
Lossless Compression of Continuous-tone Images via Context Selection, Quantization, and Modeling
, 1996
"... Context modeling is an extensively studied paradigm for lossless compression of continuous-tone images. However, without careful algorithm design, high-order Markovian modeling of continuous-tone images is too expensive in both computational time and space to be practical. Furthermore, the exponenti ..."
Abstract
-
Cited by 59 (3 self)
- Add to MetaCart
Context modeling is an extensively studied paradigm for lossless compression of continuous-tone images. However, without careful algorithm design, high-order Markovian modeling of continuous-tone images is too expensive in both computational time and space to be practical. Furthermore, the exponential growth of the number of modeling states in the order of a Markov model can quickly lead to the problem of context dilution; that is, an image may not have enough samples for good estimates of conditional probabilities associated with the modeling states. In this paper new techniques for context modeling of DPCM errors are introduced that can exploit context-dependent DPCM error structures to the benefit of compression. New algorithmic techniques of forming and quantizing modeling contexts are also developed to alleviate the problem of context dilution and reduce both time and space complexities. By innovative formation, quantization, and use of modeling contexts, the proposed lossless i...
Hypothesis Selection and Testing by the MDL Principle
- The Computer Journal
, 1998
"... ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually a particul ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually a particular hypothesis, called the `null hypothesis', has already been selected as a favorite model and it will be abandoned in favor of another model only when it clearly fails to explain the currently available data. In model selection, by contrast, all the models considered are regarded on the same footing and the objective is simply to pick the one that best explains the data. For the Bayesians certain models may be favored in terms of a prior probability, but in the minimum description length (MDL) approach to be outlined below, prior knowledge of any kind is to be used in selecting the tentative models, which in the end, unlike in the Bayesians' case, can and will be fitted to data

