Results 1  10
of
42
Sufficient Dimension Reduction via Squaredloss Mutual Information Estimation
"... The goal of sufficient dimension reduction in supervised learning is to find the lowdimensional subspace of input features that is ‘sufficient ’ for predicting output values. In this paper, we propose a novel sufficient dimension reduction method using a squaredloss variant of mutual information as ..."
Abstract

Cited by 35 (30 self)
 Add to MetaCart
(Show Context)
The goal of sufficient dimension reduction in supervised learning is to find the lowdimensional subspace of input features that is ‘sufficient ’ for predicting output values. In this paper, we propose a novel sufficient dimension reduction method using a squaredloss variant of mutual information as a dependency measure. We utilize an analytic approximator of squaredloss mutual information based on density ratio estimation, which is shown to possess suitable convergence properties. We then develop a natural gradient algorithm for sufficient subspace search. Numerical experiments show that the proposed method compares favorably with existing dimension reduction approaches. 1
Relative DensityRatio Estimation for Robust Distribution Comparison
"... Divergence estimators based on direct approximation of densityratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and tw ..."
Abstract

Cited by 27 (18 self)
 Add to MetaCart
(Show Context)
Divergence estimators based on direct approximation of densityratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and twosample homogeneity test. However, since densityratio functions often possess high fluctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative densityratios. Since relative densityratios are always smoother than corresponding ordinary densityratios, our proposed method is favorable in terms of the nonparametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach. 1
Dimensionality Reduction for Density Ratio Estimation in Highdimensional Spaces
 NEURAL NETWORKS, VOL.23, NO.1, PP.44–59
, 2010
"... The ratio of two probability density functions is becoming a quantity of interest these days in the machine learning and data mining communities since it can be used for various data processing tasks such as nonstationarity adaptation, outlier detection, and feature selection. Recently, several met ..."
Abstract

Cited by 24 (18 self)
 Add to MetaCart
The ratio of two probability density functions is becoming a quantity of interest these days in the machine learning and data mining communities since it can be used for various data processing tasks such as nonstationarity adaptation, outlier detection, and feature selection. Recently, several methods have been developed for directly estimating the density ratio without going through density estimation and were shown to work well in various practical problems. However, these methods still perform rather poorly when the dimensionality of the data domain is high. In this paper, we propose to incorporate a dimensionality reduction scheme into a densityratio estimation procedure and experimentally show that the estimation accuracy in highdimensional cases can be improved.
Direct Densityratio Estimation with Dimensionality Reduction via Leastsquares Heterodistributional Subspace Search
 NEURAL NETWORKS, VOL.24, NO.2, PP.183–198
, 2011
"... Methods for directly estimating the ratio of two probability density functions have been actively explored recently since they can be used for various data processing tasks such as nonstationarity adaptation, outlier detection, and feature selection. In this paper, we develop a new method which inc ..."
Abstract

Cited by 23 (15 self)
 Add to MetaCart
(Show Context)
Methods for directly estimating the ratio of two probability density functions have been actively explored recently since they can be used for various data processing tasks such as nonstationarity adaptation, outlier detection, and feature selection. In this paper, we develop a new method which incorporates dimensionality reduction into a direct densityratio estimation procedure. Our key idea is to find a lowdimensional subspace in which densities are significantly different and perform density ratio estimation only in this subspace. The proposed method, D³LHSS (Direct Densityratio estimation with Dimensionality reduction via Leastsquares Heterodistributional Subspace Search), is shown to overcome the limitation of baseline methods.
Computational Complexity of KernelBased DensityRatio Estimation: A Condition Number Analysis
 MACHINE LEARNING, VOL.90, NO.3, PP.431–460
, 2013
"... In this study, the computational properties of a kernelbased leastsquares densityratio estimator are investigated from the viewpoint of condition numbers. The condition number of the Hessian matrix of the loss function is closely related to the convergence rate of optimization and the numerical st ..."
Abstract

Cited by 13 (11 self)
 Add to MetaCart
In this study, the computational properties of a kernelbased leastsquares densityratio estimator are investigated from the viewpoint of condition numbers. The condition number of the Hessian matrix of the loss function is closely related to the convergence rate of optimization and the numerical stability. We use smoothed analysis techniques and theoretically demonstrate that the kernel leastsquares method has a smaller condition number than other Mestimators. This implies that the kernel leastsquares method has desirable computational properties. In addition, an alternate formulation of the kernel leastsquares estimator that possesses an even smaller condition number is presented. The validity of the theoretical analysis is verified through numerical experiments.
Dependence minimizing regression with model selection for nonlinear causal inference under nonGaussian noise
 Proceedings of the TwentyThird AAAI Conference on Artificial Intelligence (AAAI2010
, 2010
"... The discovery of nonlinear causal relationship under additive nonGaussian noise models has attracted considerable attention recently because of their high flexibility. In this paper, we propose a novel causal inference algorithm called leastsquares independence regression (LSIR). LSIR learns the ..."
Abstract

Cited by 13 (10 self)
 Add to MetaCart
(Show Context)
The discovery of nonlinear causal relationship under additive nonGaussian noise models has attracted considerable attention recently because of their high flexibility. In this paper, we propose a novel causal inference algorithm called leastsquares independence regression (LSIR). LSIR learns the additive noise model through minimization of an estimator of the squaredloss mutual information between inputs and residuals. A notable advantage of LSIR over existing approaches is that tuning parameters such as the kernel width and the regularization parameter can be naturally optimized by crossvalidation, allowing us to avoid overfitting in a datadependent fashion. Through experiments with realworld datasets, we show that LSIR compares favorably with the stateoftheart causal inference method.
On InformationMaximization Clustering: Tuning Parameter Selection and Analytic Solution
"... Informationmaximization clustering learns a probabilistic classifier in an unsupervised manner so that mutual information between feature vectors and cluster assignments is maximized. A notable advantage of this approach is that it only involves continuous optimization of model parameters, which is ..."
Abstract

Cited by 13 (8 self)
 Add to MetaCart
Informationmaximization clustering learns a probabilistic classifier in an unsupervised manner so that mutual information between feature vectors and cluster assignments is maximized. A notable advantage of this approach is that it only involves continuous optimization of model parameters, which is substantially easier to solve than discrete optimization of cluster assignments. However, existing methods still involve nonconvex optimization problems, and therefore finding a good local optimal solution is not straightforward in practice. In this paper, we propose an alternative informationmaximization clustering method based on a squaredloss variant of mutual information. This novel approach gives a clustering solution analytically in a computationally efficient way via kernel eigenvalue decomposition. Furthermore, we provide a practical model selection procedure that allows us to objectively optimize tuning parameters included in the kernel function. Through experiments, we demonstrate the usefulness of the proposed approach. 1.
HighDimensional Feature Selection by FeatureWise Kernelized Lasso
, 2013
"... The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and outpu ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. In this paper, we consider a featurewise kernelized Lasso for capturing nonlinear inputoutput dependency. We first show that, with particular choices of kernel functions, nonredundant features with strong statistical dependence on output values can be found in terms of kernelbased independence measures such as the HilbertSchmidt independence criterion (HSIC). We then show that the globally optimal solution can be efficiently computed; this makes the approach scalable to highdimensional problems. The effectiveness of the proposed method is demonstrated through feature selection experiments for classification and regression with thousands of features.
Feature selection for reinforcement learning: Evaluating implicit statereward dependency via conditional mutual information
 In ECML/PKDD
, 2010
"... Abstract. Modelfree reinforcement learning (RL) is a machine learning approach to decision making in unknown environments. However, realworld RL tasks often involve highdimensional state spaces, and then standard RL methods do not perform well. In this paper, we propose a new feature selection fra ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Modelfree reinforcement learning (RL) is a machine learning approach to decision making in unknown environments. However, realworld RL tasks often involve highdimensional state spaces, and then standard RL methods do not perform well. In this paper, we propose a new feature selection framework for coping with high dimensionality. Our proposed framework adopts conditional mutual information between return and statefeature sequences as a feature selection criterion, allowing the evaluation of implicit statereward dependency. The conditional mutual information is approximated by a leastsquares method, which results in a computationally efficient feature selection procedure. The usefulness of the proposed method is demonstrated on gridworld navigation problems. 1