Results 1  10
of
23
MultiManifold SemiSupervised Learning
"... We study semisupervised learning when the data consists of multiple intersecting manifolds. We give a finite sample analysis to quantify the potential gain of using unlabeled data in this multimanifold setting. We then propose a semisupervised learning algorithm that separates different manifolds ..."
Abstract

Cited by 147 (9 self)
 Add to MetaCart
(Show Context)
We study semisupervised learning when the data consists of multiple intersecting manifolds. We give a finite sample analysis to quantify the potential gain of using unlabeled data in this multimanifold setting. We then propose a semisupervised learning algorithm that separates different manifolds into decision sets, and performs supervised learning within each set. Our algorithm involves a novel application of Hellinger distance and sizeconstrained spectral clustering. Experiments demonstrate the benefit of our multimanifold semisupervised learning approach. 1
Statistical Analysis of SemiSupervised Regression
"... Semisupervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semisupervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semisupervised learning from t ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
Semisupervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semisupervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semisupervised learning from the viewpoint of minimax theory. Our first result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semisupervised learning. 1
Unlabeled data: Now it helps, now it doesn’t
"... Empirical evidence shows that in favorable situations semisupervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
(Show Context)
Empirical evidence shows that in favorable situations semisupervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semisupervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates. 1
SemiSupervised Novelty Detection
, 2010
"... A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level se ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semisupervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to NeymanPearson classification. Unlike the inductive approach, semisupervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distributionfree, learningtheoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general twosample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard pvalue approach to multiple testing under the socalled random effects model. Unlike standard rejection regions based on thresholded pvalues, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions.
LowNoise Density Clustering
"... We study densitybased clustering under lownoise conditions. Our framework allows for sharply defined clusters such as clusters on lower dimensional manifolds. We show that accurate clustering is possible even in high dimensions. We propose two databased methods for choosing the bandwidth and we s ..."
Abstract

Cited by 18 (7 self)
 Add to MetaCart
We study densitybased clustering under lownoise conditions. Our framework allows for sharply defined clusters such as clusters on lower dimensional manifolds. We show that accurate clustering is possible even in high dimensions. We propose two databased methods for choosing the bandwidth and we study the stability properties of density clusters. We show that a simple graphbased algorithm known as the “friendsoffriends ” algorithm successfully approximates the high density clusters. 1
Optimal rates for plugin estimators of density level sets
"... In the context of density level set estimation, we study the convergence of general plugin methods under two main assumptions on the density for a given level λ. More precisely, it is assumed that the density (i) is smooth in a neighborhood of λ and (ii) has γexponent at level λ. Condition (i) ens ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
In the context of density level set estimation, we study the convergence of general plugin methods under two main assumptions on the density for a given level λ. More precisely, it is assumed that the density (i) is smooth in a neighborhood of λ and (ii) has γexponent at level λ. Condition (i) ensures that the density can be estimated at a standard nonparametric rate and condition (ii) is similar to Tsybakov’s margin assumption which is stated for the classification framework. Under these assumptions, we derive optimal rates of convergence for plugin estimators. Explicit convergence rates are given for plugin estimators based on kernel density estimators when the underlying measure is the Lebesgue measure. Lower bounds proving optimality of the rates in a minimax sense when the density is Hölder smooth are also provided.
Latent Space Domain Transfer between High Dimensional Overlapping Distributions
"... Transferring knowledge from one domain to another is challenging due to a number of reasons. Since both conditional and marginal distribution of the training data and test data are nonidentical, model trained in one domain, when directly applied to a different domain, is usually low in accuracy. Fo ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Transferring knowledge from one domain to another is challenging due to a number of reasons. Since both conditional and marginal distribution of the training data and test data are nonidentical, model trained in one domain, when directly applied to a different domain, is usually low in accuracy. For many applications with large feature sets, such as text document, sequence data, medical data, image data of different resolutions, etc. two domains usually do not contain exactly the same features, thus introducing large numbers of “missing values”when considered over the union of features from both domains. In other words, its marginal distributions are at most overlapping. In the same time, these problems are usually high dimensional, such as, several thousands of features. Thus, the combination of high
Finite sample analysis of semisupervised learning
, 2008
"... Empirical evidence shows that in favorable situations semisupervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Empirical evidence shows that in favorable situations semisupervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semisupervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates. 1
Efficient Semisupervised and Active Learning of Disjunctions
, 2013
"... We provide efficient algorithms for learning disjunctions in the semisupervised setting under a natural regularity assumption introduced by (Balcan & Blum, 2005). We prove bounds on the sample complexity of our algorithms under a mild restriction on the data distribution. We also give an active ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We provide efficient algorithms for learning disjunctions in the semisupervised setting under a natural regularity assumption introduced by (Balcan & Blum, 2005). We prove bounds on the sample complexity of our algorithms under a mild restriction on the data distribution. We also give an active learning algorithm with improved sample complexity and extend all our algorithms to the random classification noise setting.
Consistency and rates for clustering with dbscan
 Journal of Machine Learning Research  Proceedings Track, 22:1090–1098
, 2012
"... We propose a simple and efficient modification of the popular DBSCAN clustering algorithm. This modification is able to detect the most interesting vertical threshold level in an automated, datadriven way. We establish both consistency and optimal learning rates for this modification. 1 ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We propose a simple and efficient modification of the popular DBSCAN clustering algorithm. This modification is able to detect the most interesting vertical threshold level in an automated, datadriven way. We establish both consistency and optimal learning rates for this modification. 1