Results 1 - 10
of
32
Learning bounds for domain adaptation
- In Advances in Neural Information Processing Systems
, 2008
"... Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. ..."
Abstract
-
Cited by 41 (6 self)
- Add to MetaCart
Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1
On a theory of learning with similarity functions
- In International Conference on Machine Learning
, 2006
"... Abstract. Kernel functions have become an extremely popular tool in machine learning, with an attractive theory as well. This theory views a kernel as implicitly mapping data points into a possibly very high dimensional space, and describes a kernel function as being good for a given learning proble ..."
Abstract
-
Cited by 24 (8 self)
- Add to MetaCart
Abstract. Kernel functions have become an extremely popular tool in machine learning, with an attractive theory as well. This theory views a kernel as implicitly mapping data points into a possibly very high dimensional space, and describes a kernel function as being good for a given learning problem if data is separable by a large margin in that implicit space. However, while quite elegant, this theory does not necessarily correspond to the intuition of a good kernel as a good measure of similarity, and the underlying margin in the implicit space usually is not apparent in “natural ” representations of the data. Therefore, it may be difficult for a domain expert to use the theory to help design an appropriate kernel for the learning task at hand. Moreover, the requirement of positive semi-definiteness may rule out the most natural pairwise similarity functions for the given problem domain. In this work we develop an alternative, more general theory of learning with similarity functions (i.e., sufficient conditions for a similarity function to allow one to learn well) that does not require reference to implicit spaces, and does not require the function to be positive semi-definite (or even symmetric). Instead, our theory talks in terms of more direct properties of how the function behaves as a similarity measure. Our results also generalize the standard theory in the sense that any good kernel function under the usual definition can be shown to also be a good similarity function under our definition (though with some loss in the parameters). In this way, we provide the first steps towards a theory of kernels and more general similarity functions that describes the effectiveness of a given function in terms of natural similarity-based properties. 1
Explicit learning curves for transduction and application to clustering and compression algorithms
- Journal of Artificial Intelligence Research
, 2004
"... Inductive learning is based on inferring a general rule from a finite data set and using it to label new data. In transduction one attempts to solve the problem of using a labeled training set to label a set of unlabeled points, which are given to the learner prior to learning. Although transduction ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
Inductive learning is based on inferring a general rule from a finite data set and using it to label new data. In transduction one attempts to solve the problem of using a labeled training set to label a set of unlabeled points, which are given to the learner prior to learning. Although transduction seems at the outset to be an easier task than induction, there have not been many provably useful algorithms for transduction. Moreover, the precise relation between induction and transduction has not yet been determined. The main theoretical developments related to transduction were presented by Vapnik more than twenty years ago. One of Vapnik’s basic results is a rather tight error bound for transductive classification based on an exact computation of the hypergeometric tail. While being tight, this bound is given implicitly via a computational routine. Our first contribution is a somewhat looser but explicit characterization of a slightly extended PAC-Bayesian version of Vapnik’s transductive bound. This characterization is obtained using concentration inequalities for the tail of sums of random variables obtained by sampling without replacement. We then derive error bounds for compression schemes such as (transductive) support vector machines and for transduction algorithms based on clustering. The main observation used for deriving these new error bounds and algorithms is that the unlabeled test points, which in the transductive setting are known in advance, can be used in order to construct useful data dependent prior distributions over the hypothesis space. 1.
On Bayesian bounds
- In Proceedings of the 23rd International Conference on Machine Learning
, 2006
"... We show that several important Bayesian bounds studied in machine learning, both in the batch as well as the online setting, arise by an application of a simple compression lemma. In particular, we derive (i) PAC-Bayesian bounds in the batch setting, (ii) Bayesian log-loss bounds and (iii) Bayesian ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
We show that several important Bayesian bounds studied in machine learning, both in the batch as well as the online setting, arise by an application of a simple compression lemma. In particular, we derive (i) PAC-Bayesian bounds in the batch setting, (ii) Bayesian log-loss bounds and (iii) Bayesian bounded-loss bounds in the online setting using the compression lemma. Although every setting has different semantics for prior, posterior and loss, we show that the core bound argument is the same. The paper simplifies our understanding of several important and apparently disparate results, as well as brings to light a powerful tool for developing similar arguments for other methods. 1.
Smoothness, low noise and fast rates
- In NIPS
, 2010
"... We establish an excess risk bound of Õ HR2 n + √ HL ∗) Rn for ERM with an H-smooth loss function and a hypothesis class with Rademacher complexity Rn, where L ∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = √ R/n, this translates to a learning rate o ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We establish an excess risk bound of Õ HR2 n + √ HL ∗) Rn for ERM with an H-smooth loss function and a hypothesis class with Rademacher complexity Rn, where L ∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = √ R/n, this translates to a learning rate of Õ (RH/n) in the separable (L ∗ = 0) case and Õ RH/n + √ L ∗) RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
Learning to Classify with Missing and Corrupted Features
"... After a classifier is trained using a machine learning algorithm and put to use in a real world system, it often faces noise which did not appear in the training data. Particularly, some subset of features may be missing or may become corrupted. We present two novel machine learning techniques that ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
After a classifier is trained using a machine learning algorithm and put to use in a real world system, it often faces noise which did not appear in the training data. Particularly, some subset of features may be missing or may become corrupted. We present two novel machine learning techniques that are robust to this type of classification-time noise. First, we solve an approximation to the learning problem using linear programming. We analyze the tightness of our approximation and prove statistical risk bounds for this approach. Second, we define the onlinelearning variant of our problem, address this variant using a modified Perceptron, and obtain a statistical learning algorithm using an online-tobatch technique. We conclude with a set of experiments that demonstrate the effectiveness of our algorithms. 1.
Error bounds for transductive learning via compression and clustering
- NIPS
, 2004
"... This paper is concerned with transductive learning. Although transduction appears to be an easier task than induction, there have not been many provably useful algorithms and bounds for transduction. We present explicit error bounds for transduction and derive a general technique for devising bounds ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
This paper is concerned with transductive learning. Although transduction appears to be an easier task than induction, there have not been many provably useful algorithms and bounds for transduction. We present explicit error bounds for transduction and derive a general technique for devising bounds within this setting. The technique is applied to derive error bounds for compression schemes such as (transductive) SVMs and for transduction algorithms based on clustering. 1 Introduction and Related Work In contrast to inductive learning, in the transductive setting the learner is given both the training and test sets prior to learning. The goal of the learner is to infer (or “transduce”) the labels of the test points. The transduction setting was introduced by Vapnik [1, 2] who proposed basic bounds and an algorithm for this setting. Clearly, inferring the labels of
Improved Guarantees for Learning via Similarity Functions
"... We continue the investigation of natural conditions for a similarity function to allow learning, without requiring the similarity function to be a valid kernel, or referring to an implicit high-dimensional space. We provide a new notion of a “good similarity function ” that builds upon the previous ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
We continue the investigation of natural conditions for a similarity function to allow learning, without requiring the similarity function to be a valid kernel, or referring to an implicit high-dimensional space. We provide a new notion of a “good similarity function ” that builds upon the previous definition of Balcan and Blum (2006) but improves on it in two important ways. First, as with the previous definition, any large-margin kernel is also a good similarity function in our sense, but the translation now results in a much milder increase in the labeled sample complexity. Second, we prove that for distribution-specific PAC learning, our new notion is strictly more powerful than the traditional notion of a large-margin kernel. In particular, we show that for any hypothesis class C there exists a similarity function under our definition allowing learning with O(log |C|) labeled examples. However, in a lower bound which may be of independent interest, we show that for any class C of pairwise uncorrelated functions, there is no kernel with margin γ ≥ 8 / √ |C | for all f ∈ C, even if one allows average hinge-loss as large as 0.5. Thus, the sample complexity for learning such classes with SVMs is Ω(|C|). This extends work of Ben-David et al. (2003) and Forster and Simon (2006) who give hardness results with comparable margin bounds, but at much lower error rates. Our new notion of similarity relies upon L1 regularized learning, and our separation result is related to a separation result between what is learnable with L1 vs. L2 regularization. 1
PAC-Bayes risk bounds for sample-compressed Gibbs classifiers
- Proceedings of the 22nth International Conference on Machine Learning (ICML 2005
, 2005
"... We extend the PAC-Bayes theorem to the sample-compression setting where each classifier is represented by two independent sources of information: a compression set which consists of a small subset of the training data, and a message string of the additional information needed to obtain a classifier. ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We extend the PAC-Bayes theorem to the sample-compression setting where each classifier is represented by two independent sources of information: a compression set which consists of a small subset of the training data, and a message string of the additional information needed to obtain a classifier. The new bound is obtained by using a prior over a data-independent set of objects where each object gives a classifier only when the training data is provided. The new PAC-Bayes theorem states that a Gibbs classifier defined on a posterior over samplecompressed classifiers can have a smaller risk bound than any such (deterministic) samplecompressed classifier. 1.
Large margin methods for structured classification: Exponentiated Gradient algorithms and PAC-Bayesian generalization bounds. NIPS Conference
, 2004
"... Abstract. We consider the problem of structured classification, where the taskis to predict a label y from an input x, and y has meaningful internal structure.Our framework includes supervised training of both Markov random fields and weighted context-free grammars as special cases. We describe an a ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. We consider the problem of structured classification, where the taskis to predict a label y from an input x, and y has meaningful internal structure.Our framework includes supervised training of both Markov random fields and weighted context-free grammars as special cases. We describe an algorithm thatsolves the large-margin optimization problem defined in [12], using an exponentialfamily (Gibbs distribution) representation of structured objects. The algorithm isefficient- even in cases where the number of labels y is exponential in size-provided that certain expectations under Gibbs distributions can be calculated efficiently. The optimization method we use for structured labels relies on a moregeneral result, specifically the application of exponentiated gradient (EG) updates [4, 5] to quadratic programs (QPs). We describe a new method for solving QPsbased on these techniques, and give bounds on its rate of convergence. In addition to their application to the structured-labels task, the EG updates lead tosimple algorithms for optimizing "conventional " binary or multiclass SVM problems. Finally, we give a new generalization bound for structured classification,using PAC-Bayesian methods for the analysis of large margin classifiers. 1 Introduction Structured classification is the problem of predicting y from x in the case where y hasmeaningful internal structure. For example x might be a word string and y a sequenceof part of speech labels, or x might be a Markov random field and y a labeling of x, or x might be a word string and y a parse of x. In these examples the number of possiblelabels

