Results 1 
3 of
3
The information bottleneck method
, 1999
"... We define the relevant information in a signal x ∈ X as being the information that this signal provides about another signal y ∈ Y. Examples include the information that face images provide about the names of the people portrayed, or the information that speech sounds provide about the words spoken. ..."
Abstract

Cited by 536 (35 self)
 Add to MetaCart
(Show Context)
We define the relevant information in a signal x ∈ X as being the information that this signal provides about another signal y ∈ Y. Examples include the information that face images provide about the names of the people portrayed, or the information that speech sounds provide about the words spoken. Understanding the signal x requires more than just predicting y, it also requires specifying which features of X play a role in the prediction. We formalize this problem as that of finding a short code for X that preserves the maximum information about Y. That is, we squeeze the information that X provides about Y through a ‘bottleneck ’ formed by a limited set of codewords ˜X. This constrained optimization problem can be seen as a generalization of rate distortion theory in which the distortion measure d(x, ˜x) emerges from the joint statistics of X and Y. This approach yields an exact set of self consistent equations for the coding rules X → ˜ X and ˜ X → Y. Solutions to these equations can be found by a convergent re–estimation method that generalizes the Blahut–Arimoto algorithm. Our variational principle provides a surprisingly rich framework for discussing a variety of problems in signal processing and learning, as will be described in detail elsewhere.
Predictability, Complexity, and Learning
, 2001
"... We define predictive information Ipred(T) as the mutual information between the past and the future of a time series. Three qualitatively different behaviors are found in the limit of large observation times T: Ipred(T) can remain finite, grow logarithmically, or grow as a fractional power law. If t ..."
Abstract

Cited by 47 (2 self)
 Add to MetaCart
(Show Context)
We define predictive information Ipred(T) as the mutual information between the past and the future of a time series. Three qualitatively different behaviors are found in the limit of large observation times T: Ipred(T) can remain finite, grow logarithmically, or grow as a fractional power law. If the time series allows us to learn a model with a finite number of parameters, then Ipred(T) grows logarithmically with a coefficient that counts the dimensionality of the model space. In contrast, powerlaw growth is associated, for example, with the learning of infinite parameter (or nonparametric) models such as continuous functions with smoothness constraints. There are connections between the predictive information and measures of complexity that have been defined both in learning theory and the analysis of physical systems through statistical mechanics and dynamical systems theory. Furthermore, in the same way that entropy provides the unique measure of available information consistent with some simple and plausible conditions, we argue that the divergent part of Ipred(T) provides the unique measure for the complexity of dynamics underlying a time series. Finally, we discuss how these ideas may be useful in problems in physics, statistics, and biology.
Information theory and learning: a physical approach
, 2000
"... We try to establish a unified information theoretic approach to learning and to explore some of its applications. First, we define predictive information as the mutual information between the past and the future of a time series, discuss its behavior as a function of the length of the series, and ex ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
We try to establish a unified information theoretic approach to learning and to explore some of its applications. First, we define predictive information as the mutual information between the past and the future of a time series, discuss its behavior as a function of the length of the series, and explain how other quantities of interest studied previously in learning theory—as well as in dynamical systems and statistical mechanics—emerge from this universally definable concept. We then prove that predictive information provides the unique measure for the complexity of dynamics underlying the time series and show that there are classes of models characterized by power–law growth of the predictive information that are qualitatively more complex than any of the systems that have been investigated before. Further, we investigate numerically the learning of a nonparametric probability density, which is an example of a problem with power–law complexity, and show that the proper Bayesian formulation of this problem provides for the ‘Occam ’ factors that punish overly complex models and thus allow one to learn not only a solution within a specific model class, but also the class itself using the data