| Shawe-Taylor, J., Bartlett, P., Williamson, R., and Anthony, M. (1996). Structural risk minimization over data-dependent hierarchies. Technical Report NC-TR-96-053, NeuroCOLT. |
.... have been made to de ne the penalties in a data dependent way to achieve this goal, see, for example, Bartlett, Boucheron, and Lugosi [2] Koltchinskii [11] Koltchinskii and Panchenko [13] Lozano [15] Lugosi and Nobel [17] Massart [19] and Shawe Taylor, Bartlett, Williamson, and Anthony [22]. For example, in [11] and [2] random complexity penalties based on Rademacher averages were proposed and investigated. Rademacher averages are de ned as b RF k = E i Iff(X i ) 6= Y i g D n where 1 ; n are i.i.d. symmetric f1; 1g valued random variables, independent ....
J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44:1926.
....allows us to treat normal vectors w which are not normalized, if the margin is normalized to 1. According to [15] this is called the canonical form of the separation hyperplane. The hyperplane with largest margin is then obtained by minimizing 2 for a margin which equals 1. It has been shown [14, 13, 12] that the generalization error of a linear classifier, eq. 1) can be bounded from above with probability 1 # by the bound B, a #, #) log 2 EN 2 a , 2L log 2 4 L a # # , 3) provided that the training classification error is zero and f(x) is bounded by a # ....
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anhtony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44:1926.
....1 (8) Yi equals 1 ( 1) if document di is in class ( The constraints (8) require that all training examples are classified correctly. We can use the lemma from above to draw conclusions about the VCdim of the structure element that the separating hyperplane comes from. A bound similar to (2) [Shawe Taylor et al. 1996] gives us a bound on the true error of this hyperplane on our classification task. Since the optimization problem from above is difficult to handle numerically, Lagrange multipliers are used to translate the problem into an equivalent quadratic optimization problem [Vapnik, 1995] 1 (9) ....
Shawe-Taylor, J., Bartlett, P., Williamson, R., and Anthony, M. (1996). Structural risk minimization over data-dependent hierarchies. Technical Report NC-TR-96-053, NeuroCOLT.
....the expected misclassification error (learning curve) converges at a rate of O(1=n) as long as W ( j ) and sup kxk1 are reasonably bounded. It is also not difficult to obtain interesting PAC style bounds by using the covering number result for entropy regularization in [16] and ideas in [14]. Although the PAC analysis would imply a slightly suboptimal learning curve of O(log n=n) for linearly separable problems, the bound itself provides a probability confidence and can be generalized to non separable problems. We state below an example for non separable problems, which justifies the ....
....confidence and can be generalized to non separable problems. We state below an example for non separable problems, which justifies the entropy regularization. The bound itself is a direct consequence of Theorem 2. 2 and a covering number result with entropy regularization in [16] Note that as in [14], the square root can be removed if k fl = 0; fl can also be made data dependent. Theorem 3.2 If the data is infinity norm bounded as kxk1 b, then consider the family Gamma of hyperplanes w such that kwk 1 a and j w j ln( c. Denote by err(w) the misclassification error of w with the ....
J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory, 44(5):1926.
....; f 0 ( x ) where the x are chosen independently and identically distributed according to P j C and m will be de ned below. Denote the weight vector of the SVM by w , the bias by b 1 , and the corresponding function by f 1 . Because of Observation 1, we nd j w j. Due to [18](5.5) P j C (f x j f 1 ( x) 6= f 0 ( x)g) c log c log(32 m) log 8em 0:5 with con dence 0:5 where c = 577 j w maxfR; jb 1 jg and m denotes the number of points used for training. Since j w j can be uniformly bounded by j w j and jb 1 j = jy j )j 1 ....
....0 ( x) g( x)j =4 for all x 2 C. Denote the weight vector and bias of f 0 by w and b 0 , respectively. Since is uniformly continuous on the compact set C, we nd 0 such that j ( x) y)j = 8 j w j) Note that this bound on the generalization ability of SVMs is formulated in [18](5.5) only for nite dimensional real vector spaces since it is based on Vapnik s bound on the so called fat shattering dimension of large margin classi ers [21] However, Gurvits provides the same bound on the fat shattering dimension in the case of Hilbert spaces in [7] hence the above ....
J. Shawe-Taylor, P. L. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over data dependent hierarchies. Technical report, NeuroCOLT, 1996.
....architectures which guarantees that the transition function is a contraction with a given fixed parameter. Note that these learning results can easily be extended to arbitrary contractive transition functions with not priorly known constant through the luckiness framework of machine learning [30]. The size of the weights or the parameter of the contractive tranisition function, respectively, offers a hierarchy of nested function classes with increasing complexity. The parameter of the contraction controls the structural risk in learning contractive recurrent architectures. ....
J. Shawe-Taylor, P. L. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over data dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1998.
....by theory. Until recently, however, the success of the SVM remained somewhat obscure because in PAC VC theory the structuring of the hypothesis space must be independent of the training sample in contrast to the data dependence of the canonical hyper plane. As a consequence Shawe Taylor et al. [8] developed the luckiness framework, where luckiness refers to a complexity measure that is a function of both hypothesis and training sample. First bounds on the generalisation error in a PAC Bayesian spirit were obtained by Shawe Taylor et al. 9] for single hypotheses. Recently, David ....
....sufficient. Since all classifiers 3 are indistinguishable in terms of number of errors committed on the given training sample we introduce the concept of the margin of a classifier S (D fhi A u F jl E F oW (5) The following theorem due to Shawe Taylor et al. [8] bounds the generalisation errors of all classifiers 1 3 in terms of their margins . Theorem 1: For all probability measures utmv l t 6 , for any , with probability at least KJ over the random draw of the training sample Y for any consistent classifier 3 ....
[Article contains additional citation context not shown here]
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony, "Structural risk minimization over data-dependent hierarchies," IEEE Transactions on Information Theory, vol. 44, no. 5, pp. 1926.
....Siemens Corp ora6 ReseaN h, Inc. AT T, Digita Equipment Corpora tion, Centra l Resea rch Institute of Electrica l Power Industry,a nd Honda . 1 Introduction Deriving bounds on the generalization performance of kernel classifiers has been an important theoretical topic of research in recent years [4, 8, 9, 10, 12]. We present new bounds on the generalization performance of a family of kernel classifiers with margin, from which Support Vector Machines (SVM) can be derived. The bounds use the V # dimension of a class of loss functions, where the SVM one belongs to, and functions of the margin distribution of ....
....with h any continuous monotone function. 3 Discussion In recent years there has been significant work on bounding the generalization performance of classifiers using scale sensitive dimensions of real valued functions out of which indicator functions can be generated through thresholding (see [4, 9, 8], 3] and references therein) This is unlike the standard statistical learning theory approach where classification is typically studied using the theory of indicator functions (binary valued functions) and their VC dimension [10] The work presented in this paper is similar in spirit with that ....
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions onInform:7MD Theory, 1998.
....ffn(f) f margin(f(Xi) ff where IA denotes the indicator of an event A. It is well known that in many cases L(f) may be upper bounded by the margin error (f) plus a quantity that typically decreases with increasing if, see, e.g. Bartlett [6] Anthony and Bartlett [2] Shawe Taylor et al. [15]. In particular, covering numbers and fat shattering dimensions of at scale if have been used to obtain useful bounds on the probability of error of classifiers. In this paper we develop improved, data dependent bounds. We show that the empirical versions of the fat shattering dimensions and ....
....= max m: r shatters a subsequence of length m of x . Note that for X = Xl, Xn) fat,x ( is a random quantity whose value depends on the data. The (worst case) fat shattering dimension fat: n( sup was used by Kearns and Schapire [11] Alon et al. 1] Shawe Taylor et al. [15], and Bartlett [6] to derive useful bounds. In particular, Anthony and Bartlett [2] show that if d = fat: n( 8) then for any 0 5 1 2, with probability at least 1 5, all f r satisfies Z(f) n(f) 2.829(dlog2(3)ln(128n) 2.829 ln. 2) Throughout this paper log b denotes the logarithm to the ....
J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44:1926-1940, 1998.
....bounds on the generalization error cannot exist in these situations. In order to take the specific distribution into account we modify two approaches from the literature which guarantee learnability even for infinite VC dimension but an adequate stratification of the function class instead [2, 27]. These approaches are only formulated for binary valued function classes and consider the generalization error of an algorithm with zero empirical error. We generalize the situation to function classes and arbitrary error such that it applies to folding networks and standard learning algorithms ....
....a priori in order to get bounds on the number of examples which are sufficient for valid generalization. This holds even if the maximum input height of the trees in a concrete training set is restricted and therefore it is not very likely for larger trees to occur. Here the luckiness framework [27] turns out to be useful. It allows us to substitute the prior bounds on the probability of high trees by posterior bounds on the maximum input height. Since we want to get bounds for the UCED property we generalize the approach of [27] to function classes in the following way: Assume that F is a ....
[Article contains additional citation context not shown here]
J. Shawe-Taylor, P. L. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over data dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1998.
....directly: restricted transition functions or restricted input distributions, respectively. Afterwards, we derive posterior bounds on the generalization capability which depend on the concrete training set. For this purpose we generalize the luckiness framework to the general agnostic setting [10, 12, 16]. Unlike in [9] we obtain results which cover long term prediction and allow the restriction to representative parts of data. 2 The Learning Scenario A single layer FNN computes a function f : R ; x 7 (A x ) where A 2 R n m , 2 R , and denotes the componentwise application of ....
....neither prior information about the input distribution is available nor the weights fulfill appropriate restrictions. A different approach allows to derive posterior, data dependent bounds on the generalization ability. For this purpose we use a modification of the luckiness framework proposed in [16]. The key idea is very simple: Generalization does not hold for the general setting. However, in lucky situations, restriucted maximum input length of sequences, for example, very good generalization bounds can be obtained. Hence the situation is stratified according to the concrete setting. Some ....
[Article contains additional citation context not shown here]
J. Shawe-Taylor, P. L. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over data dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1998.
....to depend on the specific setting instead. For this purpose we use some modification of the so called luckiness framework. The following is a generalization of the approach in [6] to the model free setting with a more convenient notation of the somewhat mysterious smoothness condition in [13] of the luckiness function. The key idea is very simple: Generalization does not hold for the general setting. However, in specific situations, if the maximum input height is restricted, for example, very good generalization bounds can be obtained. Hence the situation is stratified according to ....
.... ) trees are higher than t. Hoeffding s inequality yields the upper bound 2e 2m which is at most for as stated in the theorem. 2 Hence we obtain posterior bounds which allow an arbitrary underlying regularity. Moreover, we may drop a fraction if measuring the maximum height compared to [6, 13]. It would be possible to drop a fixed fraction which does not depend on m as well, which would yield an error not converging to 0 for m 1, but to some fixed constant depending on the fraction. In practice this would prove useful since the fraction of high trees converges to the probability of ....
J. Shawe-Taylor, P. L. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over data dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1998.
....training data are short sequences. Then we are lucky and valid generalization is very likely. Assumed a huge amount of long sequences can be found. Then we would demand more examples for proper generalization. This argumentation can be formalized within the so called luckiness 22 framework [36]. This framework allows to stratify a learning setting according to a concrete so called luckiness function. The luckiness function measures important quantities which are related to the expected generalization ability. We expect better generalization if we are lucky, in unlucky situations the ....
J. Shawe-Taylor, P. L. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over data dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1998.
No context found.
J.Shawe-Taylor, P. Bartlett, R. Williamson, M. Anthony, Structural Risk Minimization over Data-Dependent Hierarchies NeuroCOLT Technical Report NC-TR96 -053, 1996. (ftp://ftp.dcs.rhbnc.ac.uk/pub/neurocolt/tech reports).
No context found.
J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926--1940, 1998.
No context found.
J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1996: 1926--1940.
....mathematical avors of margin bound dependent upon the weights w i of the vote and the features x i that the vote is taken over. w i and i ( l 2 =l 2 bounds) i w i and max i x i ( l 1 =l 1 bounds) The results here are of the l 2 =l 2 form. We improve on Shawe Taylor et al. [12] and Bartlett [1] by a log(m) sample complexity factor and much tighter constants (1000 or unstated versus 9 or 18 as suggested by Section 2.2) In addition, the bound here covers margin errors without weakening the error free case. Herbrich and Graepel [3] moved signi cantly towards the ....
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926-1940, 1998.
....University Canberra 0200 Australia Bob.Williamson anu.edu.au Abstract In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the speci c learning algorithm used. Motivated by the luckiness framework [8] we are also able to exploit the serendipity of the training sample. The main di erence to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the ....
....given hypothesis space. In addition, our new model of learning allows the exploitation of the fact that we serendipitously observe a training sample which is easy to learn by a given learning algorithm. In that sense, our framework is a descendant of the luckiness framework of Shawe Taylor et al. [8]. In the present case, the luckiness is a function of a given learning algorithm and a given training sample and characterises the diversity of the algorithms solutions. The notion of luckiness allows us to study given learning algorithms at many di erent perspectives. For example, the maximum ....
[Article contains additional citation context not shown here]
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926.
No context found.
Shawe-Taylor, J., Bartlett, P., Williamson, R., and Anthony, M. (1996). Structural risk minimization over data-dependent hierarchies. Technical Report NC-TR-96-053, NeuroCOLT.
No context found.
J. Shawe-Taylor, P. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926.
No context found.
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Trans. Information Theory, 44, 1926--1940.
No context found.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926--1940, REFERENCES 89 1998.
No context found.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, M. Anthony, "Structural Risk Minimization over Data-Dependent Hierarchies", IEEE Transactions on Information Theory, Volume 44(5), pp. 1926--1940, 1998.
No context found.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926--1940, 1998.
No context found.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, M. Anthony, Structural risk minimization over data-dependent hierarchies, IEEE Trans. Inf. Theory 44 (1998.
No context found.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory, 44:1926--1940, 1998.
No context found.
Shawe-Taylor, J., Bartlett, P., Williamson, R.C., Anthony, M.: Structural Risk Minimization over Datadependent Hierarchies. Technical Report NC-TR-1996-053, Royal Holloway, University of London, 1996.
No context found.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, M. Anthony, \Structural Risk Minimization over Data-Dependent Hierarchies", IEEE Trans. on Information Theory, 44(5):1926-1940, 1998.
No context found.
Shawe-Taylor J., Bartlett P. L., Williamson R. C., Anthony M., (1998), Structural risk minimization over data-dependent hierarchies, IEEE Trans. Inf. Theory, Vol. 44, pp.1926-1940.
No context found.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, M. Anthony, Structural risk minimization over data-dependent hierarchies, IEEE Trans. Inform. Theory 44 (1998.
No context found.
J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural Risk Minimization over Data-Dependent Hierarchies. Technical Report NCTR -96-053, NeuroCOLT, 1996.
No context found.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926.
No context found.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anhtony. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory, 44:1926.
No context found.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. Technical Report NC-TR-96-053, Department of Computer Science, Royal Holloway, University of London, Egham, UK, 1996.
No context found.
J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural Risk Minimization over Data-Dependent Hierarchies. Technical Report NCTR -96-053, NeuroCOLT, 1996.
No context found.
J. Shawe-Taylor, P. L. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over data dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1998.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC