| P. Bartlett. For valid generalization, the size of the weights is more important than the size of the network. In Advances in Neural Information Processing Systems 9, pages 134--140, 1997. |
.... Keywords: Supervised Learning, Cross ValidationEstimate, HoldoutEstimate, Sample Complexity, Error Estimation, PAC Learning 1 1 Introduction Recently, within probabilistic models of machine learning, attention has focused on the use of realvalued functions for binary classification (as in [4, 7, 8, 24], for instance) It has been shown that in many cases, one can obtain fairly accurate estimates of a classifier s error when the classifier is a realvalued function which achieves the correct classification of a training sample, with a large margin (a notion to be made precise in what follows) ....
.... by f is simply sgn(f(x) where 13 sgn(a) 1 if a 0 and sgn(a) Gamma1 if a 0, and the error of f (with respect to P ) is er P (f) P (f(x; y) 2 X Theta f Gamma1; 1g : sgn(f(x) 6= yg) The use of real valued functions for binary classification has been considered in [4, 7, 8, 24] within versions of the PAC model of learning and versions of agnostic PAC learning (see Kearns et al. 19] and Haussler [14] and it has been shown that there are advantages in considering the values of the real function during training, rather than merely its sign. In particular, as suggested ....
[Article contains additional citation context not shown here]
P. Bartlett. For valid generalisation, the size of the weights is more important than the size of the network. In Advances in Neural Information Processing Systems, 9. Morgan Kaufmann, 1996.
..... 26 8.5 Proof of the main theorems . 28 9 Conclusions and further work 30 1 Introduction Recently, within probabilistic models of machine learning, attention has focused on the use of real valued functions for binary classi cation (as in [22, 8, 9, 4], for instance) It has been shown that in many cases, one can obtain fairly accurate estimates of a classi er s error when the classi er is a real valued functions which achieves the correct classi cation of a training sample, with a large margin (a notion to be made precise in what follows) ....
.... the resulting binary classi cation of x by f is simply sgn(f(x) where sgn(a) 1 if a 0 and sgn(a) 1 if a 0, and the error of f (with respect to P ) is er P (f) P (f(x; y) 2 X f1; 1g : sgn(f(x) 6= yg) The use of real valued functions for binary classi cation has been considered in [22, 8, 9, 4] within versions of the PAC model of learning and versions of agnostic PAC learning (see Kearns et al. 17] and Haussler [14] and it has been shown that there are advantages in considering the values of the real function during training, rather than merely its sign. In particular, as suggested ....
[Article contains additional citation context not shown here]
P. Bartlett. For valid generalisation, the size of the weights is more important than the size of the network. In Advances in Neural Information Processing Systems, 9. Morgan Kaufmann, 1996.
....is dependent on the probability distribution and on the classier considered. Vapnik has developped a geometric margin denition. He considers hyperplan (y = wx b) seperators. He relates the margin as the distance of an element (x 0 ; y 0 ) 2 R n Theta f Gamma1; 1g to the hyperplan. Bartlett [2] considers real bounded functions. An element (x; y) 2 X Theta f Gamma1; 1g is correctly classied with a margin fl by a real function f if yf(x) fl: Algorithms (SVM [7] boosting [4] have been developped in order to reduce the upper bound on the generalization error. This kind of bounds ....
P.L. Bartlett. For valid generalization, the size of the weights is more important than the size of the network. In Neural Information Processing Systems 9, 1997.
....and with any classi cation method. Furthermore, our algorithm was developed in a theoretical frame so that its generalization performance can be expected by using theoretical results. However, these results needed very large training sample size to be practical. By adapting the Bartlett results [1], we plan to improve our bounds. It would also be interesting to extend these results to the multiclass case. The tests we performed on classical benchmarks are encouraging. It would be useful to consider even more datasets and to compare our algorithm with other methods. ....
P.L. Bartlett. For valid generalization, the size of the weights is more important than the size of the network. In Neural Information Processing Systems 9, 1997.
....is induced by the training data, and can be considerably smaller than the size of the tree. These results build on recent theoretical results that give misclassification probability bounds for thresholded real valued functions, including support vector machines, sigmoid networks, and boosting (see [1, 8, 9]) that do not depend on the size of the classifier. We extend these results to decision trees by considering a decision tree as a thresholded convex combination of the leaf functions (the boolean functions that specify, for a given leaf, which patterns reach that leaf) We can then apply the ....
P.L. Bartlett. For valid generalization, the size of the weights is more important than the size of the network. In Neural Information Processing Systems 9, pages 134-140. Morgan Kaufmann, San Mateo, CA, 1997.
....to lists or trees with nodes in a finite alphabet [19] In general one could restrict the absolute values of the weights and inputs and consider the fat shattering dimension instead of the pseudodimension. This turns out to be useful when dealing with the SVM or ensembles of networks, for example [4, 13, 27]. Unfortunately, even if the activation function coincides with the identical function a lower bound Omega Gammaun ln t) can be found for the fat shattering dimension and restricted weights and inputs [15] For the sigmoidal activation a lower bound Omega Gamma t) can be found for the fat ....
P. L. Bartlett. For valid generalization, the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2), 1998.
....of voting methods, applicable, for instance, to bagging, boosting, arcing [5] and ECOC [7] We prove rigorous upper bounds on the generalization error of such methods in terms of a measure of performance of the combined hypothesis on the training set. A similar result was presented by Bartlett [1] in a different context. Our bounds also depend on the number of training examples and the complexity of the base hypotheses, but do not depend explicitly on the number of base hypotheses. Besides explaining the mysterious shape of the observed learning curves, our analysis may be helpful in ....
....classifiers [6] and with Boser and Guyon [3] on optimal margin classifiers. In Section 6, we discuss the relation between our work and Vapnik s in greater detail. Shawe Taylor et al. 17] gave bounds on the generalization error of these classifiers in terms of the margins, and Bartlett [1] used related techniques to give a similar bound for neural networks with small weights. Since voting classifiers are a special case of these neural networks, an immediate consequence of Bartlett s result is a bound on the generalization error of a voting classifier in terms of the fraction of ....
[Article contains additional citation context not shown here]
Peter L. Bartlett. For valid generalization, the size of the weights is more important than the size of the network. In Advances in Neural Information Processing Systems 9, 1997.
....is induced by the training data, and can be considerably smaller than the size of the tree. These results build on recent theoretical results that give misclassification probability bounds for thresholded real valued functions, including support vector machines, sigmoid networks, and boosting (see [1, 8, 9]) that do not depend on the size of the classifier. We extend these results to decision trees by considering a decision tree as a thresholded convex combination of the leaf functions (the boolean functions that specify, for a given leaf, which patterns reach that leaf) We can then apply the ....
P. L. Bartlett. For valid generalization, the size of the weights is more important than the size of the network. In Neural Information Processing Systems 9, pages 134--140. Morgan Kaufmann, San Mateo, CA, 1997.
....even increases the variance while reducing the overall generalization error. In this paper, we present an alternative theoretical analysis of voting methods, applicable, for instance, to bagging, boosting, arcing [5] andECOC [7] Our approach is based on a similar result presented by Bartlett [1] in a different context. We prove rigorous upper bounds on the generalization error of voting methods in terms of a measure of performance of the combined hypothesis on the training set. Our bounds also depend on the number of training examples and the complexity of the base hypotheses, but do ....
....classifiers [6] and with Boser and Guyon [3] on optimal margin classifiers. In Section 6, we discuss the relation between our work and Vapnik s in greater detail. Shawe Taylor et al. 16] gave bounds on the generalization error of these classifiers in terms of the margins, and Bartlett [1] used related techniques to give a similar bound for neural networks with small weights. Since voting classifiers are a special case of these neural networks, an immediate consequence of Bartlett s result is a bound on the generalization error of a voting classifier in terms of the fraction of ....
[Article contains additional citation context not shown here]
Peter L. Bartlett. For valid generalization, the size of the weights is more important than the size of the network. In Advances in Neural Information Processing Systems 9,1997.
....is unbounded. The result gives theoretical support for the use of weight decay and early stopping (see, for example, 21] two heuristic techniques that encourage gradient descent algorithms to produce networks with small weights. Some of the results in this paper were presented at NIPS 96 [5]. 1.1 Outline of the paper The next section gives estimates of the misclassification probability in terms of the proportion of distinctly correct examples and the fat shattering dimension. Section 3 gives some extensions to this result. Results in that section show that it is not necessary to ....
P. L. Bartlett. For valid generalization, the size of the weights is more important than the size of the network. To appear in Neural Information Processing Systems 9, 1997.
No context found.
P. Bartlett. For valid generalization, the size of the weights is more important than the size of the network. In Advances in Neural Information Processing Systems 9, pages 134--140, 1997.
No context found.
P. L. Bartlett, \For Valid Generalization, the Size of the Weights is More Important Than the Size of the Network," in Advances in Neural Information Processing Systems 9, M.C. Mozer, M.I. Jordan, and T. Petsche, (eds.), Cambridge, MA: The MIT Press, pp. 134-140, 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC