Results 1  10
of
13
InformationMaximization Clustering based on SquaredLoss Mutual Information
"... Informationmaximization clustering learns a probabilistic classifier in an unsupervised manner so that mutual information between feature vectors and cluster assignments is maximized. A notable advantage of this approach is that it only involves continuous optimization of model parameters, which is ..."
Abstract

Cited by 7 (5 self)
 Add to MetaCart
(Show Context)
Informationmaximization clustering learns a probabilistic classifier in an unsupervised manner so that mutual information between feature vectors and cluster assignments is maximized. A notable advantage of this approach is that it only involves continuous optimization of model parameters, which is substantially simpler than discrete optimization of cluster assignments. However, existing methods still involve nonconvex optimization problems, and therefore finding a good local optimal solution is not straightforward in practice. In this paper, we propose an alternative informationmaximization clustering method based on a squaredloss variant of mutual information. This novel approach gives a clustering solution analytically in a computationally efficient way via kernel eigenvalue decomposition. Furthermore, we provide a practical model selection procedure that allows us to objectively optimize tuning parameters included in the kernel function. Through experiments, we demonstrate the usefulness of the proposed approach.
Direct Divergence Approximation between Probability Distributions and Its Applications in Machine Learning
"... Approximating a divergence between two probability distributions from their samples is a fundamental challenge in statistics, information theory, and machine learning. A divergence approximator can be used for various purposes such as twosample homogeneity testing, changepoint detection, and class ..."
Abstract

Cited by 7 (7 self)
 Add to MetaCart
(Show Context)
Approximating a divergence between two probability distributions from their samples is a fundamental challenge in statistics, information theory, and machine learning. A divergence approximator can be used for various purposes such as twosample homogeneity testing, changepoint detection, and classbalance estimation. Furthermore, an approximator of a divergence between the joint distribution and the product of marginals can be used for independence testing, which has a wide range of applications including feature selection and extraction, clustering, object matching, independent component analysis, and causal direction estimation. In this paper, we review recent advances in divergence approximation. Our emphasis is that directly approximating the divergence without estimating probability distributions is more sensible than a naive twostep approach of first estimating probability distributions and then approximating the divergence. Furthermore, despite the overwhelming popularity of the KullbackLeibler divergence as a divergence measure, we argue that alternatives such as the Pearson divergence, the relative Pearson divergence, and the L2distance are more useful in practice because of their computationally efficient approximability, high numerical stability, and superior robustness against outliers.
Squaredloss Mutual Information Regularization: A Novel Informationtheoretic Approach to Semisupervised Learning
"... We propose squaredloss mutual information regularization (SMIR) for multiclass probabilistic classification, following the information maximization principle. SMIR is convex under mild conditions and thus improves the nonconvexity of mutual information regularization. It offers all of the followin ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
We propose squaredloss mutual information regularization (SMIR) for multiclass probabilistic classification, following the information maximization principle. SMIR is convex under mild conditions and thus improves the nonconvexity of mutual information regularization. It offers all of the following four abilities to semisupervised algorithms: Analytical solution, outofsample/multiclass classification, and probabilistic output. Furthermore, novel generalization error bounds are derived. Experiments show SMIR compares favorably with stateoftheart methods. 1.
Maximum volume clustering: A new discriminative clustering approach
 Journal of Machine Learning Research
, 2013
"... Editor: Ulrike von Luxburg The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clusterin ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Editor: Ulrike von Luxburg The large volume principle proposed by Vladimir Vapnik, which advocates that hypotheses lying in an equivalence class with a larger volume are more preferable, is a useful alternative to the large margin principle. In this paper, we introduce a new discriminative clustering model based on the large volume principle called maximum volume clustering (MVC), and then propose two approximation schemes to solve this MVC model: A softlabel MVC method using sequential quadratic programming and a hardlabel MVC method using semidefinite programming, respectively. The proposed MVC is theoretically advantageous for three reasons. The optimization involved in hardlabel MVC is convex, and under mild conditions, the optimization involved in softlabel MVC is akin to a convex one in terms of the resulting clusters. Secondly, the softlabel MVC method pos
Latent Domains Modeling for Visual Domain Adaptation
"... To improve robustness to significant mismatches between source domain and target domain arising from changes such as illumination, pose and image quality domain adaptation is increasingly popular in computer vision. But most of methods assume that the source data is from single domain, or that mu ..."
Abstract
 Add to MetaCart
To improve robustness to significant mismatches between source domain and target domain arising from changes such as illumination, pose and image quality domain adaptation is increasingly popular in computer vision. But most of methods assume that the source data is from single domain, or that multidomain datasets provide the domain label for training instances. In practice, most datasets are mixtures of multiple latent domains, and difficult to manually provide the domain label of each data point. In this paper, we propose a model that automatically discovers latent domains in visual datasets. We first assume the visual images are sampled from multiple manifolds, each of which represents different domain, and which are represented by different subspaces. Using the neighborhood structure estimated from images belonging to the same category, we approximate the local linear invariant subspace for each image based on its local structure, eliminating the categoryspecific elements of the feature. Based on the effectiveness of this representation, we then propose a squaredloss mutual information based clustering model with category distribution prior in each domain to infer the domain assignment for images. In experiment, we test our approach on two common image datasets, the results show that our method outperforms the existing stateoftheart methods, and also show the superiority of multiple latent domain discovery.
On a Theory of Nonparametric Pairwise Similarity for Clustering: Connecting Clustering to Classification
"... Pairwise clustering methods partition the data space into clusters by the pairwise similarity between data points. The success of pairwise clustering largely depends on the pairwise similarity function defined over the data points, where kernel similarity is broadly used. In this paper, we present ..."
Abstract
 Add to MetaCart
(Show Context)
Pairwise clustering methods partition the data space into clusters by the pairwise similarity between data points. The success of pairwise clustering largely depends on the pairwise similarity function defined over the data points, where kernel similarity is broadly used. In this paper, we present a novel pairwise clustering framework by bridging the gap between clustering and multiclass classification. This pairwise clustering framework learns an unsupervised nonparametric classifier from each data partition, and search for the optimal partition of the data by minimizing the generalization error of the learned classifiers associated with the data partitions. We consider two nonparametric classifiers in this framework, i.e. the nearest neighbor classifier and the plugin classifier. Modeling the underlying data distribution by nonparametric kernel density estimation, the generalization error bounds for both unsupervised nonparametric classifiers are the sum of nonparametric pairwise similarity terms between the data points for the purpose of clustering. Under uniform distribution, the nonparametric similarity terms induced by both unsupervised classifiers exhibit a well known form of kernel similarity. We also prove that the generalization error bound for the unsupervised plugin classifier is asymptotically equal to the weighted volume of cluster boundary [1] for Low Density Separation, a widely used criteria for semisupervised learning and clustering. Based on the derived nonparametric pairwise similarity using the plugin classifier, we propose a new nonparametric exemplarbased clustering method with enhanced discriminative capability, whose superiority is evidenced by the experimental results. 1
IEICE Transactions on Information and Systems, vol.E95D, no.10, pp.2564–2567, 2012. 1 On Kernel Parameter Selection in HilbertSchmidt Independence Criterion
"... The HilbertSchmidt independence criterion (HSIC) is a kernelbased statistical independence measure that can be computed very efficiently. However, it requires us to determine the kernel parameters heuristically because no objective model selection method is available. Leastsquares mutual informat ..."
Abstract
 Add to MetaCart
(Show Context)
The HilbertSchmidt independence criterion (HSIC) is a kernelbased statistical independence measure that can be computed very efficiently. However, it requires us to determine the kernel parameters heuristically because no objective model selection method is available. Leastsquares mutual information (LSMI) is another statistical independence measure that is based on direct densityratio estimation. Although LSMI is computationally more expensive than HSIC, LSMI is equipped with crossvalidation, and thus the kernel parameter can be determined objectively. In this paper, we show that HSIC can actually be regarded as an approximation to LSMI, which allows us to utilize crossvalidation of LSMI for determining kernel parameters in HSIC. Consequently, both computational efficiency and crossvalidation can be achieved. Keywords HilbertSchmidt independence criterion, leastsquares mutual information, crossvalidation, Gaussian kernel 1
Direct Approximation of Divergences between Probability Distributions
"... Approximating a divergence between two probability distributions from their samples is a fundamental challenge in the statistics, information theory, and machine learning communities, because a divergence estimator can be used for various purposes such as twosample homogeneity testing, changepoi ..."
Abstract
 Add to MetaCart
(Show Context)
Approximating a divergence between two probability distributions from their samples is a fundamental challenge in the statistics, information theory, and machine learning communities, because a divergence estimator can be used for various purposes such as twosample homogeneity testing, changepoint detection, and classbalance estimation. Furthermore, an approximator of a divergence between the joint distribution and the product of marginals can be used for independence testing, which has a wide range of applications including feature selection and extraction, clustering, object matching, independent component analysis, and causality learning. In this article, we review recent advances in direct divergence approximation that follow the general inference principle advocated by Vladimir Vapnikone should not solve a more general problem as an intermediate step. More specically, direct divergence approximation avoids separately estimating two probability distributions when approximating a divergence. We cover direct approximators of the KullbackLeibler (KL) divergence, the Pearson (PE) divergence, the relative PE (rPE) divergence, and the L2distance. Despite the overwhelming popularity of the KL divergence, we argue that the latter approximators are more useful in practice due to their computational efficiency, high numerical stability, and superior robustness against outliers. 1