Results 1  10
of
70
A kernel method for the twosampleproblem
 IN NIPS. 2006
"... We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test ..."
Abstract

Cited by 117 (27 self)
 Add to MetaCart
(Show Context)
We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.
Domain Adaptation via Transfer Component Analysis
"... Domain adaptation solves a learning problem in a target domain by utilizing the training data in a different but related source domain. Intuitively, discovering a good feature representation across domains is crucial. In this paper, we propose to find such a representation through a new learning met ..."
Abstract

Cited by 102 (18 self)
 Add to MetaCart
(Show Context)
Domain adaptation solves a learning problem in a target domain by utilizing the training data in a different but related source domain. Intuitively, discovering a good feature representation across domains is crucial. In this paper, we propose to find such a representation through a new learning method, transfer component analysis (TCA), for domain adaptation. TCA tries to learn some transfer components across domains in a Reproducing Kernel Hilbert Space (RKHS) using Maximum Mean Discrepancy (MMD). In the subspace spanned by these transfer components, data distributions in different domains are close to each other. As a result, with the new representations in this subspace, we can apply standard machine learning methods to train classifiers or regression models in the source domain for use in the target domain. The main contribution of our work is that we propose a novel feature representation in which to perform domain adaptation via a new parametric kernel using feature extraction methods, which can dramatically minimize the distance between domain distributions by projecting data onto the learned transfer components. Furthermore, our approach can handle large datsets and naturally lead to outofsample generalization. The effectiveness and efficiency of our approach in are verified by experiments on two realworld applications: crossdomain indoor WiFi localization and crossdomain text classification. 1
Hilbert Space Embeddings and Metrics on Probability Measures
"... A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseu ..."
Abstract

Cited by 72 (36 self)
 Add to MetaCart
A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk, indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk. First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translationinvariant on R d, then it is characteristic if and only if the support of its Fourier transform is the entire R d.
Injective hilbert space embeddings of probability measures
 In COLT
, 2008
"... A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). The emb ..."
Abstract

Cited by 54 (31 self)
 Add to MetaCart
A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). The embedding function has been proven to be injective when the reproducing kernel is universal. In this case, the embedding induces a metric on the space of probability distributions defined on compact metric spaces. In the present work, we consider more broadly the problem of specifying characteristic kernels, defined as kernels for which the RKHS embedding of probability measures is injective. In particular, characteristic kernels can include nonuniversal kernels. We restrict ourselves to translationinvariant kernels on Euclidean space, and define the associated metric on probability measures in terms of the Fourier spectrum of the kernel and characteristic functions of these measures. The support of the kernel spectrum is important in finding whether a kernel is characteristic: in particular, the embedding is injective if and only if the kernel spectrum has the entire domain as its support. Characteristic kernels may nonetheless have difficulty in distinguishing certain distributions on the basis of finite samples, again due to the interaction of the kernel spectrum and the characteristic functions of the measures. 1
SemiSupervised Novelty Detection
, 2010
"... A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level se ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
(Show Context)
A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semisupervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to NeymanPearson classification. Unlike the inductive approach, semisupervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distributionfree, learningtheoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general twosample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard pvalue approach to multiple testing under the socalled random effects model. Unlike standard rejection regions based on thresholded pvalues, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions.
Testing for Homogeneity with Kernel Fisher Discriminant Analysis
"... We propose to investigate test statistics for testing homogeneity based on kernel Fisher discriminant analysis. Asymptotic null distributions under null hypothesis are derived, and consistency against fixed alternatives is assessed. Finally, experimental evidence of the performance of the proposed a ..."
Abstract

Cited by 29 (14 self)
 Add to MetaCart
(Show Context)
We propose to investigate test statistics for testing homogeneity based on kernel Fisher discriminant analysis. Asymptotic null distributions under null hypothesis are derived, and consistency against fixed alternatives is assessed. Finally, experimental evidence of the performance of the proposed approach on both artificial and real datasets is provided. 1
Kernel changepoint analysis
 in "Proc. Neural Info. Proc. Systems
, 2008
"... We introduce a kernelbased method for changepoint analysis within a sequence of temporal observations. Changepoint analysis of an unlabelled sample of observations consists in, first, testing whether a change in the distribution occurs within the sample, and second, if a change occurs, estimating ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
(Show Context)
We introduce a kernelbased method for changepoint analysis within a sequence of temporal observations. Changepoint analysis of an unlabelled sample of observations consists in, first, testing whether a change in the distribution occurs within the sample, and second, if a change occurs, estimating the changepoint instant after which the distribution of the observations switches from one distribution to another different distribution. We propose a test statistic based upon the maximum kernel Fisher discriminant ratio as a measure of homogeneity between segments. We derive its limiting distribution under the null hypothesis (no change occurs), and establish the consistency under the alternative hypothesis (a change occurs). This allows to build a statistical hypothesis testing procedure for testing the presence of a changepoint, with a prescribed falsealarm probability and detection probability tending to one in the largesample setting. If a change actually occurs, the test statistic also yields an estimator of the changepoint location. Promising experimental results in temporal segmentation of mental tasks from BCI data and pop song indexation are presented. 1
Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
"... Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over man ..."
Abstract

Cited by 27 (11 self)
 Add to MetaCart
(Show Context)
Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include nontranslation invariant kernels and kernels on noncompact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1
Equivalence of distancebased and RKHSbased statistics in hypothesis testing
 Ann. Statist
, 2013
"... We provide a unifying framework linking two classes of statistics used in twosample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of dist ..."
Abstract

Cited by 21 (5 self)
 Add to MetaCart
We provide a unifying framework linking two classes of statistics used in twosample and independence testing: on the one hand, the energy distances and distance covariances from the statistics literature; on the other, maximum mean discrepancies (MMD), that is, distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS), as established in machine learning. In the case where the energy distance is computed with a semimetric of negative type, a positive definite kernel, termed distance kernel, may be defined such that the MMD corresponds exactly to the energy distance. Conversely, for any positive definite kernel, we can interpret the MMD as energy distance with respect to some negativetype semimetric. This equivalence readily extends to distance covariance using kernels on the product space. We determine the class of probability distributions for which the test statistics are consistent against all alternatives. Finally, we investigate the performance of the family of distance kernels in twosample and independence tests: we show in particular that the energy distance most commonly employed in statistics is just one member of a parametric family of kernels, and that other choices from this family can yield more powerful tests. 1. Introduction. The
Extracting discriminative concepts for domain adaptation in text mining
 in KDD, 2009
"... One common predictive modeling challenge occurs in text mining problems is that the training data and the operational (testing) data are drawn from different underlying distributions. This poses a great difficulty for many statistical learning methods. However, when the distribution in the source do ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
(Show Context)
One common predictive modeling challenge occurs in text mining problems is that the training data and the operational (testing) data are drawn from different underlying distributions. This poses a great difficulty for many statistical learning methods. However, when the distribution in the source domain and the target domain are not identical but related, there may exist a shared concept space to preserve the relation. Consequently a good feature representation can encode this concept space and minimize the distribution gap. To formalize this intuition, we propose a domain adaptation method that parameterizes this concept space by linear transformation under which we explicitly minimize the distribution difference between the source domain with sufficient labeled data and target domains with only unlabeled data, while at the same time minimizing the empirical loss on the labeled data in the source domain. Another characteristic of our method is its capability for considering multiple classes and their interactions simultaneously. We have conducted extensive experiments on two common text mining problems, namely, information extraction and document classification to demonstrate the effectiveness of our proposed method.