Results 1  10
of
91
Fast learning rates for plugin classifiers
 Ann. Statist
, 2007
"... It has been recently shown that, under the margin (or low noise) assumption, there exist classifiers attaining fast rates of convergence of the excess Bayes risk, that is, rates faster than n −1/2. The work on this subject has suggested the following two conjectures: (i) the best achievable fast rat ..."
Abstract

Cited by 58 (4 self)
 Add to MetaCart
(Show Context)
It has been recently shown that, under the margin (or low noise) assumption, there exist classifiers attaining fast rates of convergence of the excess Bayes risk, that is, rates faster than n −1/2. The work on this subject has suggested the following two conjectures: (i) the best achievable fast rate is of the order n −1, and (ii) the plugin classifiers generally converge more slowly than the classifiers based on empirical risk minimization. We show that both conjectures are not correct. In particular, we construct plugin classifiers that can achieve not only fast, but also superfast rates, that is, rates faster than n −1. We establish minimax lower bounds showing that the obtained rates cannot be improved. 1. Introduction. Let (X,Y
An rkhs for multiview learning and manifold coregularization
 in Proc. of ICML’08, 2008
"... Inspired by cotraining, many multiview semisupervised kernel methods implement the following idea: find a function in each of multiple Reproducing Kernel Hilbert Spaces (RKHSs) such that (a) the chosen functions make similar predictions on unlabeled examples, and (b) the average prediction given ..."
Abstract

Cited by 42 (4 self)
 Add to MetaCart
(Show Context)
Inspired by cotraining, many multiview semisupervised kernel methods implement the following idea: find a function in each of multiple Reproducing Kernel Hilbert Spaces (RKHSs) such that (a) the chosen functions make similar predictions on unlabeled examples, and (b) the average prediction given by the chosen functions performs well on labeled examples. In this paper, we construct a single RKHS with a datadependent “coregularization ” norm that reduces these approaches to standard supervised learning. The reproducing kernel for this RKHS can be explicitly derived and plugged into any kernel method, greatly extending the theoretical and algorithmic scope of coregularization. In particular, with this development, the Rademacher complexity bound for coregularization given in (Rosenberg & Bartlett, 2007) follows easily from wellknown results. Furthermore, more refined bounds given by localized Rademacher complexity can also be easily applied. We propose a coregularization based algorithmic alternative to manifold regularization (Belkin et al., 2006; Sindhwani et al., 2005a) that leads to major empirical improvements on semisupervised tasks. Unlike the recently proposed transductive approach of (Yu et al., 2008), our RKHS formulation is truly semisupervised and naturally extends to unseen test data.
RANKING AND EMPIRICAL MINIMIZATION OF USTATISTICS
, 2008
"... The problem of ranking/ordering instances, instead of simply classifying them, has recently gained much attention in machine learning. In this paper we formulate the ranking problem in a rigorous statistical framework. The goal is to learn a ranking rule for deciding, among two instances, which one ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
(Show Context)
The problem of ranking/ordering instances, instead of simply classifying them, has recently gained much attention in machine learning. In this paper we formulate the ranking problem in a rigorous statistical framework. The goal is to learn a ranking rule for deciding, among two instances, which one is “better,” with minimum ranking risk. Since the natural estimates of the risk are of the form of a Ustatistic, results of the theory of Uprocesses are required for investigating the consistency of empirical risk minimizers. We establish, in particular, a tail inequality for degenerate Uprocesses, and apply it for showing that fast rates of convergence may be achieved under specific noise assumptions, just like in classification. Convex risk minimization methods are also studied.
Functional classification in hilbert spaces
 IEEE Transactions on Information Theory
"... Abstract — Let X be a random variable taking values in a separable Hilbert space X, with label Y ∈ {0, 1}. We establish universal weak consistency of a nearest neighbortype classifier based on n independent copies (Xi, Yi) of the pair (X, Y), extending the classical result of Stone [1] to infinite ..."
Abstract

Cited by 32 (1 self)
 Add to MetaCart
Abstract — Let X be a random variable taking values in a separable Hilbert space X, with label Y ∈ {0, 1}. We establish universal weak consistency of a nearest neighbortype classifier based on n independent copies (Xi, Yi) of the pair (X, Y), extending the classical result of Stone [1] to infinite dimensional Hilbert spaces. Under a mild condition on the distribution of X, we also prove strong consistency. We reduce the infinite dimension of X by considering only the first d coefficients of a Fourier series expansion of each Xi, and then we perform knearest neighbor classification in R d. Both the dimension and the number of neighbors are automatically selected from the data using a simple datasplitting device. An application of this technique to a signal discrimination problem involving speech recordings is presented. Index Terms — Classification, Fourier expansion, nearest neighbor rule, universal consistency.
Ranking the best instances
 Journal of Machine Learning Research
"... We formulate a local form of the bipartite ranking problem where the goal is to focus on the best instances. We propose a methodology based on the construction of realvalued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scor ..."
Abstract

Cited by 26 (11 self)
 Add to MetaCart
(Show Context)
We formulate a local form of the bipartite ranking problem where the goal is to focus on the best instances. We propose a methodology based on the construction of realvalued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We first state the problem of finding the best instances which can be cast as a classification problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stagewise manner where first, the best instances would be tentatively identified and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem.
Generalization error bounds in semisupervised classification under the cluster assumption
, 2007
"... ..."
Simultaneous adaptation to the margin and to complexity in classification
, 2005
"... We consider the problem of adaptation to the margin and to complexity in binary classification. We suggest a learning method with a numerically easy aggregation step. Adaptivity both to the margin and complexity in classification, usually involves empirical risk minimization or Rademacher complexiti ..."
Abstract

Cited by 20 (6 self)
 Add to MetaCart
We consider the problem of adaptation to the margin and to complexity in binary classification. We suggest a learning method with a numerically easy aggregation step. Adaptivity both to the margin and complexity in classification, usually involves empirical risk minimization or Rademacher complexities which lead to numerical difficulties. On the other hand there exist classifiers that are easy to compute and that converge with fast rates but are not adaptive. Combining these classifiers by our aggregation procedure we get numerically realizable adaptive classifiers that converge with fast rates.
Overlaying classifiers: A practical approach for optimal ranking
 Adv. Neural Inf. Process. Syst
, 2009
"... The ROC curve is one of the most widely used visual tool to evaluate performance of scoring functions regarding their capacities to discriminate between two populations. It is the goal of this paper to propose a statistical learning method for constructing a scoring function with nearly optimal ROC ..."
Abstract

Cited by 15 (7 self)
 Add to MetaCart
The ROC curve is one of the most widely used visual tool to evaluate performance of scoring functions regarding their capacities to discriminate between two populations. It is the goal of this paper to propose a statistical learning method for constructing a scoring function with nearly optimal ROC curve. In this bipartite setup, the target is known to be the regression function up to an increasing transform and solving the optimization problem boils down to recovering the collection of level sets of the latter, which we interpret here as a continuum of imbricated classification problems. We propose a discretization approach, consisting in building a finite sequence of N classifiers by constrained empirical risk minimization and then constructing a piecewise constant scoring function sN(x) by overlaying the resulting classifiers. Given the functional nature of the ROC criterion, the accuracy of the ranking induced by sN(x) can be conceived in a variety of ways, depending on the distance chosen for measuring closeness to the optimal curve in the ROC space. By relating the ROC curve of the resulting scoring function to piecewise linear approximates of the optimal ROC curve, we establish the consistency of the method as well as rate bounds to control its generalization ability in supnorm. Eventually, we also highlight the fact that, as a byproduct, the algorithm proposed provides an accurate estimate of the optimal ROC curve.
On the Equivalence between Herding and Conditional Gradient Algorithms
, 2012
"... We show that the herding procedure of Welling (2009b) takes exactly the form of a standard convex optimization algorithm—namely a conditional gradient algorithm minimizing a quadratic moment discrepancy. This link enables us to invoke convergence results from convex optimization and to consider fast ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
We show that the herding procedure of Welling (2009b) takes exactly the form of a standard convex optimization algorithm—namely a conditional gradient algorithm minimizing a quadratic moment discrepancy. This link enables us to invoke convergence results from convex optimization and to consider faster alternatives for the task of approximating integrals in a reproducing kernel Hilbert space. We study the behavior of the different variants through numerical simulations. The experiments indicate that while we can improve over herding on the task of approximating integrals, the original herding algorithm tends to approach more often the maximum entropy distribution, shedding more light on the learning bias behind herding. 1