#### DMCA

## On the Generalisation of Soft Margin Algorithms (2000)

Venue: | IEEE Transactions on Information Theory |

Citations: | 12 - 5 self |

### Citations

3596 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...which contains the support of the input probability distribution. This bound directly motivates the optimisation of the 2-norm of the slack variables originally proposed for SVMs by Cortes and Vapnik =-=[11]-=- (see Section VI for details). The results are generalized to non-linear function classes using a characterisation of their capacity at scalesknown as the fat-shattering dimension fat( ). In this case... |

1827 | A training algorithm for optimal margin classi
- Boser, Guyon, et al.
- 1992
(Show Context)
Citation Context ...practice have pointed to the concept of the margin of a classier as being central to the success of a new generation of learning algorithms. This is explicitly true of Support Vector Machines (SVMs) [=-=9]-=-, [12], which in their simplest form implement maximal margin hyperplanes in a high dimensional feature space, but has also been shown to be the case for boosting algorithms such as Adaboost [24]. Inc... |

1437 |
Pattern recognition and neural networks
- Ripley
- 1996
(Show Context)
Citation Context ...o applying the hard margin algorithm after adding 2 I to the covariance matrix. This technique is well known in classical statistics, where it is sometimes called the \shrinkage method" (see Rip=-=ley [22]-=-). In the context of regression with squared loss it is better known as Ridge Regression (see [23] for an exposition of dual Ridge Regression), and in this case leads to a form of weight decay. It is ... |

1297 |
Solution to Ill-Posed Problems
- Tikhonov, Arsenin
- 1977
(Show Context)
Citation Context ... is better known as Ridge Regression (see [23] for an exposition of dual Ridge Regression), and in this case leads to a form of weight decay. It is a regularization technique in the sense of Tikhonov =-=[28-=-]. Another way to describe it, is that it reduces the number of eective free parameters, as measured by the trace of K. Notesnally that from an algorithmic point of view these kernels still give a pos... |

1270 |
An Introduction to Support Vector Machines
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ...s depending on the margin of a classier are a relatively recent development. They provide an explanation of the performance of state-of-the-art learning systems such as Support Vector Machines (SVM) [=-=12-=-] and Adaboost [24]. The diculty with these bounds has been either their lack of robustness or their looseness. The question of whether the generalisation of a classier can be more tightly bounded in ... |

952 | Estimation of Dependences Based on Empirical Data - Vapnik - 1982 |

884 | Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics - Schapire, Freund, et al. - 1998 |

512 | Large margin classification using the perceptron algorithm - Freund, Schapire - 1999 |

511 | Knowledgebased analysis of microarray gene expression data by using support vector machines
- Brown, Grundy, et al.
- 2000
(Show Context)
Citation Context ...g algorithm developed for linear classiers. Though the algorithm is not new, the analysis has already given further insights for SVMs that have been used to tune their application to Microarray data [=-=10]-=-. The analysis has also placed the optimisation of the quadratic loss used in the back-propagation algorithm on asrm footing, though in this case no polynomial time algorithm is known. The paper has, ... |

276 | Structural risk minimization over data-dependent hierarchies
- Shawe-Taylor, Bartlett, et al.
- 1998
(Show Context)
Citation Context ...s also been shown to be the case for boosting algorithms such as Adaboost [24]. Increasing the margin has been shown to implement a capacity control through datadependent structural risk minimisation =-=[25-=-], hence overcoming the apparent diculties of using high dimensional feature spaces. In the case of SVMs a further computatonal simplication is derived by never explicitly computing the feature vector... |

238 | Scale-sensitive dimensions, uniform convergence, and learnability
- Alon, Ben-David, et al.
- 1997
(Show Context)
Citation Context ...ill consider the covers to be chosen from the set of all functions with the same domain as F and range the reals. We now quote a lemma from [25] which follows immediately from a result of Alon et al. [1]. Corollary VII.2: [25] Let F be a class of functions X ! [a; b] and P a distribution over X . Choose 0s1 and let d = fat F (=4). Then sup x2X m N (; F ; x) 2 4m(b a) 2 2 d log 2 (2em(b a)=(... |

237 | Robust linear programming discrimination of two linearlyinseparable sets. Optimization Methods and Software
- Bennett, Mangasarian
- 1992
(Show Context)
Citation Context ...rmulation is related to the kernel parameter , namely C = 1 2 : October 30, 2001 DRAFT 15 Note that this approach to handling non-separability goes back to Smith [27], with Bennett and Mangasarian [6=-=-=-] giving essentially the same formulation as Cortes and Vapnik [11], but with a dierent optimisation of the function class. The expression also shows how moving to the soft margin ensures separability... |

235 | Support vector regression machines
- Drucker, Burges, et al.
- 1997
(Show Context)
Citation Context ... quantity is the amount by which f exceeds the error margin son the point (x; y) or 0 if f is within sof the target value. Hence, this is the insensitive loss measure considered by Drucker et al. [13] with = s. Let g f 2 L f (X) be the function g f = X (x;y)2S ((x; y); f;s) x : 1 We are grateful to an anonymous referee for pointing out this natural generalisation. October 30, 2001 DRAFT 25 Pro... |

207 | The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network - Bartlett - 1998 |

161 | Ridge regression learning algorithm in dual variables
- Saunders, Gammerman, et al.
- 1998
(Show Context)
Citation Context ...s well known in classical statistics, where it is sometimes called the \shrinkage method" (see Ripley [22]). In the context of regression with squared loss it is better known as Ridge Regression =-=(see [23]-=- for an exposition of dual Ridge Regression), and in this case leads to a form of weight decay. It is a regularization technique in the sense of Tikhonov [28]. Another way to describe it, is that it r... |

142 | Generalization Performance of Support Vector Machines and other Pattern Classifiers
- BARTLETT
- 1999
(Show Context)
Citation Context ...btained by Shawe-Taylor et al. [25]. Gurvits [16] generalised this to innite dimensional Banach spaces. We will quote an improved version of this bound for inner product spaces which is contained in [=-=3]-=- (slightly adapted here for an arbitrary bound on the linear operators). Theorem III.6: [3] Consider an inner product space and the class of linear functions L of norm less than or equal to B restrict... |

99 | Robust trainability of single neurons - Hoffgen, Simon, et al. - 1995 |

98 |
Boosting the margin: A new explanation for the eectiveness of voting methods. The Annals of Statistics
- Schapire, Freund, et al.
- 1998
(Show Context)
Citation Context ... margin of a classier are a relatively recent development. They provide an explanation of the performance of state-of-the-art learning systems such as Support Vector Machines (SVM) [12] and Adaboost [=-=24-=-]. The diculty with these bounds has been either their lack of robustness or their looseness. The question of whether the generalisation of a classier can be more tightly bounded in terms of a robust ... |

78 | Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators
- Williamson, Smola, et al.
- 2001
(Show Context)
Citation Context ... as tight as the results on which they depend. There has been a signicant tightening of the covering number bounds for linear classiers taking into account the structure of the training data itself [3=-=1]-=-, [15], [26] and all of these results could be combined with the techniques described here to give equivalent soft margin bounds. Acknowledgements The research was supported in part by the European Co... |

60 | Boosting algorithms as gradient descent in function space
- Mason, Baxter, et al.
- 1999
(Show Context)
Citation Context ... develop a soft margin boosting algorithm [7]. Standard boosting has been shown to perform gradient descent in function space optimising the negative exponential of the margins of the training points =-=[2-=-0]. The exponential function applies something close to a hard margin penalty to individual margin errors and hence can suer from overtting if the training data is noisy and dicult to separate with th... |

35 |
Robust trainability of single neurons
- Hoegen, Simon, et al.
(Show Context)
Citation Context ...em with this approach is that there are no ecient algorithms for even obtaining asxed ratio between the number of misclassied training points and the true minimum for linear classiers unless P = NP [1=-=8]-=-, [2]. Hence, in SVM practice, so-called soft margin versions of the algorithms are used, that attempt to achieve a (heuristic) compromise between large margin and accuracy. The question whether it is... |

34 | A column generation algorithm for boosting
- Bennett, Demiriz, et al.
- 2000
(Show Context)
Citation Context ... prediction, with a statistical analysis of Ridge Regression and Gaussian Processes as a special case. The analysis presented in the paper has also lead to new boosting algorithms described elsewhere =-=[7]-=-. Keywords Margin, Generalisation, Soft margin, pac learning, statistical learning, support vector machines, ridge regression, neural networks October 30, 2001 DRAFT 3 I. Introduction Both theory and ... |

33 |
A note on a scale-sensitive dimension of linear bounded functionals in banach spaces
- Gurvits
(Show Context)
Citation Context ...idering bounds on the fat-shattering dimension. Thesrst bound on the fat-shattering dimension of bounded linear functions in asnite dimensional space was obtained by Shawe-Taylor et al. [25]. Gurvits =-=[16-=-] generalised this to innite dimensional Banach spaces. We will quote an improved version of this bound for inner product spaces which is contained in [3] (slightly adapted here for an arbitrary bound... |

33 |
Robust ensemble learning
- Rätsch, Schölkopf, et al.
- 2000
(Show Context)
Citation Context ... errors and hence can suer from overtting if the training data is noisy and dicult to separate with the available weak learners. Some heuristic algorithms have been derived for soft margin boosting [2=-=1]-=-, but Bennett et al. [8] show how optimising October 30, 2001 DRAFT 28 a soft margin bound derived using the techniques of this paper reduces to solving a linear programme via column generation techni... |

29 |
The hardness of approximate optima in lattices, codes and linear equations
- ARORA, BABAI, et al.
- 1993
(Show Context)
Citation Context ...h this approach is that there are no ecient algorithms for even obtaining asxed ratio between the number of misclassied training points and the true minimum for linear classiers unless P = NP [18], [2=-=]-=-. Hence, in SVM practice, so-called soft margin versions of the algorithms are used, that attempt to achieve a (heuristic) compromise between large margin and accuracy. The question whether it is poss... |

27 | Pattern Classifier Design by Linear Programming - Smith - 1968 |

24 | Enlarging the margins in perceptron decision trees
- Bennett, Cristianini, et al.
- 2000
(Show Context)
Citation Context ...2 2 =s2 jSj ; The fat-shattering dimension has been estimated for many function classes including single hidden layer neural networks [17], general neural networks [4] and perceptron decision trees [=-=5]-=-. An important feature of the fat-shattering dimension for these classes is that it does not depend on the number of parameters (for example weights in a neural network), but rather on their sizes. Th... |

24 |
Approximation and learning of convex superpositions
- Gurvits, Koiran
- 1997
(Show Context)
Citation Context ...ned has the form (see Theorem VII.11) ~ O fat( =16) + kk 2 2 =s2 jSj ; The fat-shattering dimension has been estimated for many function classes including single hidden layer neural networks [17], general neural networks [4] and perceptron decision trees [5]. An important feature of the fat-shattering dimension for these classes is that it does not depend on the number of parameters (for exam... |

22 | Covering numbers for support vector machines
- Guo, Bartlett, et al.
- 2002
(Show Context)
Citation Context ...ght as the results on which they depend. There has been a signicant tightening of the covering number bounds for linear classiers taking into account the structure of the training data itself [31], [1=-=5]-=-, [26] and all of these results could be combined with the techniques described here to give equivalent soft margin bounds. Acknowledgements The research was supported in part by the European Commissi... |

21 |
The sample complexity of pattern classi with neural networks: the size of the weights is more important than the size of the network
- Bartlett
- 1998
(Show Context)
Citation Context ...). This raises concern that they are very brittle in the sense that a single training point can have a very signicant in uence on the bound, possibly rendering the training set inseparable. Bartlett [=-=4]-=- extended the analysis to the case where a number of points closest to the boundary are treated as errors and the minimal margin of the remaining points is used. The bound obtained has the disadvantag... |

21 |
Large Margin Classi Using the Perceptron Algorithm
- Freund, Schapire
- 1999
(Show Context)
Citation Context ...rucial use of a special loss function, that is equivalent to the slackvariables used in optimization theory and is related to the hinge loss. Our analysis was motivated by work of Freund and Schapire =-=[14]-=-, though their technique was originally introduced by Klasner and Simon [19]. Furthermore, for neural networks the criterion derived corresponds exactly to that optimised by the back-propagation algor... |

18 | From Noise-Free to NoiseTolerant and from On-line to Batch
- Klasner, Simon
- 1995
(Show Context)
Citation Context ...les used in optimization theory and is related to the hinge loss. Our analysis was motivated by work of Freund and Schapire [14], though their technique was originally introduced by Klasner and Simon =-=[19]-=-. Furthermore, for neural networks the criterion derived corresponds exactly to that optimised by the back-propagation algorithm using weight decay further clarifying why this algorithm appears to gen... |

18 | Generalization performance of classifiers in terms of observed covering numbers - Shawe-Taylor, Williamson - 1999 |

11 |
Uniform convergence of frequencies of occurence of events to their probabilities, Dokl
- Vapnik, Chervonenkis
(Show Context)
Citation Context ...ss F of real-valued functions the class sign(F) is the set of derived classication functions. Wesrst consider classical learning analysis which has been shown to be characterised by the VC dimension [=-=30-=-] Denition III.1: Let H be a set of binary valued functions. We say that a set of points X is shattered by H if for all binary vectors b indexed by X , there is a function f b 2 H realising b on X . T... |

5 |
Generalization performance of classi in terms of observed covering numbers
- Shawe-Taylor, Williamson
- 1999
(Show Context)
Citation Context ...evaluation map ~ xF : F ! R m ; dened by ~ xF : f 7! (f(x 1 ); : : : ; f(xm )) is a compact subset of R m . October 30, 2001 DRAFT 17 Note that this denition diers slightly from that introduced in [26]. The current denition is more general, but at the same time simplies the proof of the required properties. Lemma VII.4: Let F be a sturdy class of functions. Then for each N 2 N and anysxed sequenc... |

2 |
Pattern classi design by linear programming
- Smith
- 1968
(Show Context)
Citation Context ...he trade-o parameter C in their formulation is related to the kernel parameter , namely C = 1 2 : October 30, 2001 DRAFT 15 Note that this approach to handling non-separability goes back to Smith [27=-=-=-], with Bennett and Mangasarian [6] giving essentially the same formulation as Cortes and Vapnik [11], but with a dierent optimisation of the function class. The expression also shows how moving to th... |