Results 1 - 10
of
62
Kernel methods for measuring independence
- Journal of Machine Learning Research
, 2005
"... We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prov ..."
Abstract
-
Cited by 25 (13 self)
- Add to MetaCart
We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis.
Divergence estimation of continuous distributions based on data-dependent partitions
- IEEE Transactions on Information Theory
, 2005
"... Abstract—We present a universal estimator of the divergence @ A for two arbitrary continuous distributions and satisfying certain regularity conditions. This algorithm, which observes independent and identically distributed (i.i.d.) samples from both and, is based on the estimation of the Radon–Niko ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
Abstract—We present a universal estimator of the divergence @ A for two arbitrary continuous distributions and satisfying certain regularity conditions. This algorithm, which observes independent and identically distributed (i.i.d.) samples from both and, is based on the estimation of the Radon–Nikodym derivative � � via a data-dependent partition of the observation space. Strong convergence of this estimator is proved with an empirically equivalent segmentation of the space. This basic estimator is further improved by adaptive partitioning schemes and by bias correction. The application of the algorithms to data with memory is also investigated. In the simulations, we compare our estimators with the direct plug-in estimator and estimators based on other partitioning approaches. Experimental results show that our methods achieve the best convergence performance in most of the tested cases. Index Terms—Bias correction, data-dependent partition, divergence, Radon–Nikodym derivative, stationary and ergodic data, universal estimation of information measures. I.
Sketching and Streaming Entropy via Approximation Theory
"... We conclude a sequence of work by giving near-optimal sketching and streaming algorithms for estimating Shannon entropy in the most general streaming model, with arbitrary insertions and deletions. This improves on prior results that obtain suboptimal space bounds in the general model, and near-opti ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
We conclude a sequence of work by giving near-optimal sketching and streaming algorithms for estimating Shannon entropy in the most general streaming model, with arbitrary insertions and deletions. This improves on prior results that obtain suboptimal space bounds in the general model, and near-optimal bounds in the insertion-only model without sketching. Our high-level approach is simple: we give algorithms to estimate Rényi and Tsallis entropy, and use them to extrapolate an estimate of Shannon entropy. The accuracy of our estimates is proven using approximation theory arguments and extremal properties of Chebyshev polynomials, a technique which may be useful for other problems. Our work also yields the best-known and near-optimal additive approximations for entropy, and hence also for conditional entropy and mutual information.
Information dynamics and emergent computation in recurrent circuits of spiking neurons
- Advances in Neural Information Processing Systems 16
, 2004
"... We employ an efficient method using Bayesian and linear classifiers for analyzing the dynamics of information in high-dimensional states of generic cortical microcircuit models. It is shown that such recurrent circuits of spiking neurons have an inherent capability to carry out rapid computations on ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
We employ an efficient method using Bayesian and linear classifiers for analyzing the dynamics of information in high-dimensional states of generic cortical microcircuit models. It is shown that such recurrent circuits of spiking neurons have an inherent capability to carry out rapid computations on complex spike patterns, merging information contained in the order of spike arrival with previously acquired context information. 1
A Large-Deviation Analysis for the Maximum Likelihood Learning of Tree Structures
, 2009
"... The problem of maximum-likelihood learning of the Markov tree structure of an unknown distribution from samples is considered when the distribution is Markov on a tree. Large-deviation analysis of the error in estimation of the set of edges of the tree is considered. Necessary and sufficient conditi ..."
Abstract
-
Cited by 8 (7 self)
- Add to MetaCart
The problem of maximum-likelihood learning of the Markov tree structure of an unknown distribution from samples is considered when the distribution is Markov on a tree. Large-deviation analysis of the error in estimation of the set of edges of the tree is considered. Necessary and sufficient conditions are provided to ensure that this error probability decays exponentially. These conditions are based on the mutual information between each pair of variables being distinct from that of other pairs. The rate of error decay, which is the error exponent, is derived using the large-deviation principle. For a discrete distribution, the error exponent is approximated using Euclidean information theory, and is given by a ratio, interpreted as the signal-to-noise ratio (SNR) for learning. Extensions to the Gaussian case are also considered.
Generalization in Clustering with Unobserved Features
- In NIPS
, 2005
"... We argue that when objects are characterized by many attributes, clustering them on the basis of a relatively small random subset of these attributes can capture information on the unobserved attributes as well. ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
We argue that when objects are characterized by many attributes, clustering them on the basis of a relatively small random subset of these attributes can capture information on the unobserved attributes as well.
Sensor adaptation and development in robots by entropy maximization of sensory data
- In Proceedings of the 6th IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA-2005
"... Abstract — A method is presented for adapting the sensors of a robot to the statistical structure of its current environment. This enables the robot to compress incoming sensory information and to find informational relationships between sensors. The method is applied to creating sensoritopic maps o ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Abstract — A method is presented for adapting the sensors of a robot to the statistical structure of its current environment. This enables the robot to compress incoming sensory information and to find informational relationships between sensors. The method is applied to creating sensoritopic maps of the informational relationships of the sensors of a developing robot, where the informational distance between sensors is computed using information theory and adaptive binning. The adaptive binning method constantly estimates the probability distribution of the latest inputs to maximize the entropy in each individual sensor, while conserving the correlations between different sensors. Results from simulations and robotic experiments with visual sensors show how adaptive binning of the sensory data helps the system to discover structure not found by ordinary binning. This enables the developing perceptual system of the robot to be more adapted to the particular embodiment of the robot and the environment. Index Terms — Ontogenetic robotics, sensory systems, entropy maximization
Multi-Classification by Categorical Features via Clustering
"... We derive a generalization bound for multiclassification schemes based on grid clustering in categorical parameter product spaces. Grid clustering partitions the parameter space in the form of a Cartesian product of partitions for each of the parameters. The derived bound provides a means to evaluat ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
We derive a generalization bound for multiclassification schemes based on grid clustering in categorical parameter product spaces. Grid clustering partitions the parameter space in the form of a Cartesian product of partitions for each of the parameters. The derived bound provides a means to evaluate clustering solutions in terms of the generalization power of a built-on classifier. For classification based on a single feature the bound serves to find a globally optimal classification rule. Comparison of the generalization power of individual features can then be used for feature ranking. Our experiments show that in this role the bound is much more precise than mutual information or normalized correlation indices. 1.
An information-theoretic approach to network monitoring and measurement
- In Proc. of IMC
, 2005
"... Network engineers and operators are faced with a number of challenges that arise in the context of network monitoring and measurement. These include: i) how much information is included in measurement traces and by how much can we compress those traces?, ii) how much information is captured by diffe ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Network engineers and operators are faced with a number of challenges that arise in the context of network monitoring and measurement. These include: i) how much information is included in measurement traces and by how much can we compress those traces?, ii) how much information is captured by different monitoring paradigms and tools ranging from full packet header captures to flow-level captures (such as with NetFlow) to packet and byte counts (such as with SNMP)? and iii) how much joint information is included in traces collected at different points and can we take advantage of this joint information? In this paper we develop a network model and an information theoretic framework within which to address these questions. We use the model and the framework to first determine the benefits of compressing traces captured at a single monitoring point, and we outline approaches to achieve those benefits. We next consider the benefits of joint coding, or equivalently of joint compression of traces captured a different monitoring points. Finally, we examine the difference in information content when measurements are made at either the flow level or the packet/byte count level. In all of these cases, the effect of temporal and spatial correlation on the answers to the above questions is examined. Both our model and its predictions are validated against measurements taken from a large operational network. 1

