Results 11  20
of
38
VC Theory of Large Margin MultiCategory Classifiers
"... In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binaryvalued functions, the computation of dichotomies with realvalued functions, and the computation of polytomies with functions taking ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binaryvalued functions, the computation of dichotomies with realvalued functions, and the computation of polytomies with functions taking their values in finite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a naïve extension of the results devoted to the computation of dichotomies. Third, most of the classification problems met in practice involve multiple categories. In this paper, a VC theory of large margin multicategory classifiers is introduced. Central in this theory are generalized VC dimensions called the γΨdimensions. First, a uniform convergence bound on the risk of the classifiers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γΨdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the specific case of the scalesensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multiclass SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines.
LEARNING KERNELBASED HALFSPACES WITH THE 01 LOSS
, 2011
"... We describe and analyze a new algorithm for agnostically learning kernelbased halfspaces with respect to the 01 loss function. Unlike most of the previous formulations, which rely on surrogate convex loss functions (e.g., hingeloss in support vector machines (SVMs) and logloss in logistic regr ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
We describe and analyze a new algorithm for agnostically learning kernelbased halfspaces with respect to the 01 loss function. Unlike most of the previous formulations, which rely on surrogate convex loss functions (e.g., hingeloss in support vector machines (SVMs) and logloss in logistic regression), we provide finite time/sample guarantees with respect to the more natural 01 loss function. The proposed algorithm can learn kernelbased halfspaces in worstcase time poly(exp(L log(L/ɛ))), for any distribution, where L is a Lipschitz constant (which can be thought of as the reciprocal of the margin), and the learned classifier is worse than the optimal halfspace by at most ɛ. We also prove a hardness result, showing that under a certain cryptographic assumption, no algorithm can learn kernelbased halfspaces in time polynomial in L.
A Better Variance Control For PacBayesian Classification
 LABORATOIRE DE PROBABILITÉS ET MODÈLES ALÉATOIRES, UNIVERSITÉS PARIS 6 AND PARIS 7
, 2004
"... The common method to understand and improve classification rules is to prove bounds on the generalization error. Here we provide localized databased PACbounds for the di#erence between the risk of any two randomized estimators. We derive from these bounds two types of algorithms: the first one use ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
(Show Context)
The common method to understand and improve classification rules is to prove bounds on the generalization error. Here we provide localized databased PACbounds for the di#erence between the risk of any two randomized estimators. We derive from these bounds two types of algorithms: the first one uses combinatorial technics and is related to compression schemes whereas the second one involves Gibbs estimators. We also recover some of the results of the VapnikChervonenkis theory and improve them by taking into account the variance term measured by the pseudodistance (f 1 , f 2 ) #[f 1 (X) f 2 (X)]. Finally, we present di#erent ways of localizing the results in order to improve the bounds and make them less dependent on the choice of the prior. For some classes of functions (such as VCclasses), this will lead to gain a logarithmic factor without using the chaining technique (see [1] for more details).
Tight Sample Complexity of LargeMargin Learning
"... We obtain a tight distributionspecific characterization of the sample complexity of largemargin classification with L2 regularization: We introduce the γadapteddimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distributionspecific upper and lo ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
(Show Context)
We obtain a tight distributionspecific characterization of the sample complexity of largemargin classification with L2 regularization: We introduce the γadapteddimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distributionspecific upper and lower bounds on the sample complexity, both governed by the γadapteddimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of largemargin classification. The bounds hold for a rich family of subGaussian distributions. 1
Local Complexities for Empirical Risk Minimization
 In Proceedings of the 17th Annual Conference on Learning Theory (COLT
, 2004
"... Abstract. We present sharp bounds on the risk of the empirical minimization algorithm under mild assumptions on the class. We introduce the notion of isomorphic coordinate projections and show that this leads to a sharper error bound than the best previously known. The quantity which governs this bo ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
Abstract. We present sharp bounds on the risk of the empirical minimization algorithm under mild assumptions on the class. We introduce the notion of isomorphic coordinate projections and show that this leads to a sharper error bound than the best previously known. The quantity which governs this bound on the empirical minimizer is the largest fixed point of the function ξn(r) = E sup {Ef − Enf  : f ∈ F, Ef = r}. We prove that this is the best estimate one can obtain using “structural results”, and that it is possible to estimate the error rate from data. We then prove that the bound on the empirical minimization algorithm can be improved further by a direct analysis, and that the correct error rate is the maximizer of ξ ′ n(r) − r, where ξ ′ n(r) = E sup {Ef − Enf: f ∈ F, Ef = r}.
Learning kernelbased halfspaces with the zeroone loss
"... We describe and analyze a new algorithm for agnostically learning kernelbased halfspaces with respect to the zeroone loss function. Unlike most previous formulations which rely on surrogate convex loss functions (e.g. hingeloss in SVM and logloss in logistic regression), we provide finite time/s ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
We describe and analyze a new algorithm for agnostically learning kernelbased halfspaces with respect to the zeroone loss function. Unlike most previous formulations which rely on surrogate convex loss functions (e.g. hingeloss in SVM and logloss in logistic regression), we provide finite time/sample guarantees with respect to the more natural zeroone loss function. The proposed algorithm can learn kernelbased halfspaces in worstcase time poly(exp(L log(L/ɛ))), for any distribution, where L is a Lipschitz constant (which can be thought of as the reciprocal of the margin), and the learned classifier is worse than the optimal halfspace by at most ɛ. We also prove a hardness result, showing that under a certain cryptographic assumption, no algorithm can learn kernelbased halfspaces in time polynomial in L.
Large margin multicategory discriminant models and scalesensitive Ψdimensions
, 2006
"... ..."
(Show Context)
Datadependent generalization error bounds for (noisy) classification: a PACBayesian approach
, 2004
"... The common method to understand and improve classification rules is to prove bounds on the generalization error. Here we provide localized databased PACbounds for the difference between the risk of any two randomized estimators. We derive from these bounds two types of algorithms: the first one us ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
The common method to understand and improve classification rules is to prove bounds on the generalization error. Here we provide localized databased PACbounds for the difference between the risk of any two randomized estimators. We derive from these bounds two types of algorithms: the first one uses combinatorial technics and is related to compression schemes whereas the second one involves Gibbs estimators. We also recover some of the results of the VapnikChervonenkis theory and improve them by taking into account the variance term measured by the pseudodistance (f 1 , f 2 ) f 2 (X)]. Finally, we present different ways of localizing the results in order to improve the bounds and make them less dependent on the choice of the prior. For some classes of functions (such as VCclasses), this will lead to gain a log N factor (without using the chaining technique (see [1] for more details)).
Learning Halfspaces with the ZeroOne Loss: TimeAccuracy Tradeoffs
"... Given α, ɛ, we study the time complexity required to improperly learn a halfspace with misclassification error rate of at most (1 + α) L ∗ γ + ɛ, where L ∗ γ is the optimal γmargin error rate. For α = 1/γ, polynomial time and sample complexity is achievable using the hingeloss. For α = 0, ShalevS ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Given α, ɛ, we study the time complexity required to improperly learn a halfspace with misclassification error rate of at most (1 + α) L ∗ γ + ɛ, where L ∗ γ is the optimal γmargin error rate. For α = 1/γ, polynomial time and sample complexity is achievable using the hingeloss. For α = 0, ShalevShwartz et al. [2011] showed that poly(1/γ) time is impossible, while learning is possible in time exp ( Õ(1/γ)). An immediate question, which this paper tackles, is what is achievable if α ∈ (0, 1/γ). We derive positive results interpolating between the polynomial time for α = 1/γ and the exponential time for α = 0. In particular, we show that there are cases in which α = o(1/γ) but the problem is still solvable in polynomial time. Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model. 1
Finite dimensional projection for classification and statistical learning, in
 IEEE Transactions on Information Theory
, 2008
"... A new method for the binary classification problem is studied. It relies on empirical minimization of the hinge loss over an increasing sequence of finitedimensional spaces. A suitable dimension is picked by minimizing the regularized loss, where the regularization term is proportional to the dimen ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
A new method for the binary classification problem is studied. It relies on empirical minimization of the hinge loss over an increasing sequence of finitedimensional spaces. A suitable dimension is picked by minimizing the regularized loss, where the regularization term is proportional to the dimension. An oracletype inequality is established, which ensures adequate convergence properties of the method. We suggest to select the considered sequence of subspaces by applying kernel principal components analysis. In this case the asymptotical convergence rate of the method can be better than what is known for the Support Vector Machine. Exemplary experiments are presented on benchmark datasets where the practical results of the method are comparable to the SVM. 1 Introduction. 1.1 The classification framework. In this paper, we consider the framework of supervised binary classification. Let (X, Y) denote a random variable with values in X × {−1, +1} and probability distribution P. The marginal