Results 11  20
of
89
On the Equivalence between Herding and Conditional Gradient Algorithms
, 2012
"... We show that the herding procedure of Welling (2009b) takes exactly the form of a standard convex optimization algorithm—namely a conditional gradient algorithm minimizing a quadratic moment discrepancy. This link enables us to invoke convergence results from convex optimization and to consider fast ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
We show that the herding procedure of Welling (2009b) takes exactly the form of a standard convex optimization algorithm—namely a conditional gradient algorithm minimizing a quadratic moment discrepancy. This link enables us to invoke convergence results from convex optimization and to consider faster alternatives for the task of approximating integrals in a reproducing kernel Hilbert space. We study the behavior of the different variants through numerical simulations. The experiments indicate that while we can improve over herding on the task of approximating integrals, the original herding algorithm tends to approach more often the maximum entropy distribution, shedding more light on the learning bias behind herding. 1
Nonparametric bandits with covariates
 In COLT
, 2010
"... We consider a bandit problem which involves sequential sampling from two populations (arms). Each arm produces a noisy reward realization which depends on an observable random covariate. The goal is to maximize cumulative expected reward. We derive general lower bounds on the performance of any admi ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
We consider a bandit problem which involves sequential sampling from two populations (arms). Each arm produces a noisy reward realization which depends on an observable random covariate. The goal is to maximize cumulative expected reward. We derive general lower bounds on the performance of any admissible policy, and develop an algorithm whose performance achieves the order of said lower bound up to logarithmic terms. This is done by decomposing the global problem into suitably “localized ” bandit problems. Proofs blend ideas from nonparametric statistics and traditional methods used in the bandit literature. 1
VC Theory of Large Margin MultiCategory Classifiers
"... In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binaryvalued functions, the computation of dichotomies with realvalued functions, and the computation of polytomies with functions taking ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binaryvalued functions, the computation of dichotomies with realvalued functions, and the computation of polytomies with functions taking their values in finite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a naïve extension of the results devoted to the computation of dichotomies. Third, most of the classification problems met in practice involve multiple categories. In this paper, a VC theory of large margin multicategory classifiers is introduced. Central in this theory are generalized VC dimensions called the γΨdimensions. First, a uniform convergence bound on the risk of the classifiers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γΨdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the specific case of the scalesensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multiclass SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines.
Optimal rates of aggregation in classification under low noise assumption
, 2007
"... In the same spirit as Tsybakov, we define the optimality of an aggregation procedure in the problem of classification. Using an aggregate with exponential weights, we obtain an optimal rate of convex aggregation for the hinge risk under the margin assumption. Moreover, we obtain an optimal rate of m ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
In the same spirit as Tsybakov, we define the optimality of an aggregation procedure in the problem of classification. Using an aggregate with exponential weights, we obtain an optimal rate of convex aggregation for the hinge risk under the margin assumption. Moreover, we obtain an optimal rate of model selection aggregation under the margin assumption for the excess Bayes risk.
Learning Noisy Linear Classifiers via Adaptive and Selective Sampling
"... We introduce efficient marginbased algorithms for selective sampling and filtering in binary classification tasks. Experiments on realworld textual data reveal that our algorithms perform significantly better than popular and similarly efficient competitors. Using the socalled MammenTsybakov lo ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
We introduce efficient marginbased algorithms for selective sampling and filtering in binary classification tasks. Experiments on realworld textual data reveal that our algorithms perform significantly better than popular and similarly efficient competitors. Using the socalled MammenTsybakov low noise condition to parametrize the instance distribution, and assuming linear label noise, we show bounds on the convergence rate to the Bayes risk of a weaker adaptive variant of our selective sampler. Our analysis reveals that, excluding logarithmic factors, the average risk of this adaptive sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of queried labels, and α> 0 is the exponent in the low noise condition. For all α> √ 3 − 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the base selective sampler, which queries all labels. Moreover, for α → ∞ (hard margin condition) the gap between the semi and fullysupervised rates becomes exponential.
On the Theory of Learning with Privileged Information
, 2010
"... In Learning Using Privileged Information (LUPI) paradigm, along with the standard training data in the decision space, a teacher supplies a learner with the privileged information in the correcting space. The goal of the learner is to find a classifier with a low generalization error in the decisi ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
In Learning Using Privileged Information (LUPI) paradigm, along with the standard training data in the decision space, a teacher supplies a learner with the privileged information in the correcting space. The goal of the learner is to find a classifier with a low generalization error in the decision space. We consider an empirical risk minimization algorithm, called Privileged ERM, that takes into account the privileged information in order to find a good function in the decision space. We outline the conditions on the correcting space that, if satisfied, allow Privileged ERM to have much faster learning rate in the decision space than the one of the regular empirical risk minimization.
Combining PACBayesian and Generic Chaining Bounds
, 2007
"... There exist many different generalization error bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or algorithms. Our goal is, first, to underline the links between these bounds, and second, to combine the different improvements ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
There exist many different generalization error bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or algorithms. Our goal is, first, to underline the links between these bounds, and second, to combine the different improvements into a single bound. In particular we combine the PACBayes approach introduced by McAllester (1998), which is interesting for randomized predictions, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand (see Talagrand, 1996), in a way that also takes into account the variance of the combined functions. We also show how this connects to Rademacher based bounds.
NeymanPearson Classification, Convexity and Stochastic Constraints
 Journal of Machine Learning Research
"... Motivated by problems of anomaly detection, this paper implements the NeymanPearson paradigm to deal with asymmetric errors in binary classification with a convex loss ϕ. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two foll ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
Motivated by problems of anomaly detection, this paper implements the NeymanPearson paradigm to deal with asymmetric errors in binary classification with a convex loss ϕ. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its ϕtype I error is below a prespecified level and (ii), it has ϕtype II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error.
A primaldual convergence analysis of boosting
, 2011
"... Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primaldual relationship under a broad family of losses, including the exponential los ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primaldual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε>0,O(ln(1/ε)) iterations suffice to produce a predictor with empirical risk εclose to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rateO(ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O(1/ε), with a matching lower bound provided for the logistic loss.