| H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8), 1992. |
....k . 3.22) We shall see in a while how Mercer s condition reduces to a requirement on k that is very easy to check. 3.4 Eigenvalue Expansion The general aim of this part of the thesis is to establish an analysis of the above minimisation problem (3. 8) by means of statistical mechanics [64, 76, 55]. The starting point for this is, as usual, the partition function . 3.23) The integral is over the free variables, i.e. the student weights; the Hamiltonian is just the objective function from (3.8) and the inequality constraints are implemented via the # functions. ....
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Phys. Rev. A, 45(8):6056--6091, 1992.
....who show that the generalization error undergoes a second order phase transition related to symmetry breaking and follows asymptotically an inverse power law, as the sample size increases. They, however, consider hard boundary gating functions and only a small number of experts (e.g. m = 2) As Seung, Sompolinsky and Tishby (1992) pointed out, the two different approaches, VC theory that employs inequalities and bounds, versus statistical physics that uses approximations, provide complementary perspectives to the study of generalization. Acknowledgments The author is grateful to Martin A. Tanner for suggesting this ....
Seung, H. S., Sompolinsky, H. and Tishby, N. 1992. Statistical mechanics of learning from examples. Physical Review A. 45, 6056-6091.
....neighborhood of the empirical minimizer de ne the version space (see also [4] Averaging over this neighborhood yields a structure with risk equivalent to the expected risk obtained by random sampling from this set of hypotheses. There exists also a tight methodological relationship to [7] and [4] where learning curves for the learning of two class classi ers are derived using techniques from statistical mechanics. 2 The Empirical Risk Approximation Principle The data samples Z = fz r 2 ; 1 r lg which have to be analyzed by the unsupervised learning algorithm are elements ....
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056-6091, April 1992.
....3 Intermezzo: other approaches 3. 1 The Langevin approach In this section we will point out the difference between the intrinsic noise due to the random presentation of training patterns and the artificial noise in studies on the generalization capabilities of neural networks (see e.g. [57, 64]) In the latter case, the noise is added to the deterministic equation (18) i.e. the weights evolve according to the Langevin equation dw(t) dt = GammarE(w(t) p 2T(t) 23) where (t) is white noise obeying Omega i (t) j (t 0 ) ff = ffi ij ffi(t Gamma t 0 ) The Langevin ....
....ff(w) P (w; t)g T r 2 P (w; t) The equilibrium distribution is [compare with equation (9) P s (w) 1 Z exp Gamma E(w) T ; 24) with Z a normalization constant. The existence of this Gibbs distribution raises the idea to put learning in the framework of statistical mechanics [45, 57, 64]. In these studies, the Langevin equation (23) is more an excuse to arrive at the Gibbs distribution (24) than an attempt to study the dynamics of learning processes in artificial neural networks. The equilibrium distribution of the master equation for on line learning processes is not a simple ....
H. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45:6056--6091, 1992.
....= 0 (i.e. origin centered halfspaces) under the uniform distribution on the unit sphere in R n : 6. 1 PREVIOUS WORK The problem of learning an unknown origin centered halfspace in R n given access to examples drawn uniformly from the unit sphere has been the subject of considerable research [7, 8, 16, 22, 28, 34, 39]. Long [28] proved that any algorithm which learns an origin centered halfspace to accuracy ffl under the uniform distribution must use at least Omega Gamma n ffl ) examples. Long also showed [29] that by applying Vaidya s linear programming algorithm [40] it is possible to learn to accuracy ....
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056-6091, 1992.
....0 h, and q at least one h agreeing with q for which T(h) 0. This last requirement ensures that the generalizer is defined for all q. The Gibbs generalizer can be viewed as a zero temperature limit of the scenarios analyzed in the statistical mechanics supervised learning framework (see [Seung et al. 1991a, 1991b, Tishby et al. 1989] III) The generalization error function is a mapping from (f, h, q X , q Y ) to R. It measures how good h is as a guess for f. One rather popular choice is the i.i.d. error function: Er(f, h, q) S xX p(x) 1 d(f(x) h(x) where p( is the same distribution ....
Seung H., et al. (1991). Statistical mechanics of learning from examples I, II. Submitted.
....attracted much attention recently [1, 4, 6, 7, 13] This was motivated by both theoretical and practical reasons. First, because the number of possible states in the weight space of a binary network is finite, its properties may di#er drastically from these of a network with continuous weights [4, 12]. Second, the hardware realization of binary networks may prove simpler. The generalization ability of neural networks with binary weights has been studied extensively using the statistical mechanics approach [4, 7, 12] Although this approach has yielded some impressive results, it has its ....
....properties may di#er drastically from these of a network with continuous weights [4, 12] Second, the hardware realization of binary networks may prove simpler. The generalization ability of neural networks with binary weights has been studied extensively using the statistical mechanics approach [4, 7, 12]. Although this approach has yielded some impressive results, it has its shortcomings. In particular, it neglects the computational aspect of the learning process by assuming a stochastic training algorithm, similar to a finite Monte Carlo process, that leads at long times to a Gibbs distribution ....
[Article contains additional citation context not shown here]
Seung H. S., Sompolinsky H., and Tishby N., "Statistical Mechanics of Learning from Examples", Phys. Rev. A, Vol. 45, (1992), 6056--6091.
....of machine learning. Many of the notions that have been used in the de nition of the PAC model, as well as later studies, emerged from di erent (although related) research elds, such as Pattern Recognition [43] Inductive Inference [7] Information Theory [102, 33] and Statistical Mechanics [105, 67]. A notable contribution was made by the work of Vapnik [116] who addressed mainly questions related to sample complexity of learning algorithms. Since then, computational learning theory literature has made an extensive use of many of the notions and results proposed by Vapnik, although he ....
....the hypothesis h. Several papers in the eld of computational learning theory have studied sequential classi cation problems in which f 1g labeled instances (examples) are given online, one at a time, and for each new instance, the learning system must predict the label before it sees it ( cf. [66, 105]) Such systems adapt online learning to make better predictions as they see more examples. If n is the total number of examples, then the performance of these online learning systems, as a function of n, has been measured both by the total number of mistakes (incorrect predictions) they make ....
[Article contains additional citation context not shown here]
H. S. Seung, H. Sampolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45:6056-6091, 1992.
....risk obtained by a random sampling over this ball. From a Bayesian point of view this is similar to averaging over a posterior distribution, where a uniform distribution over the hypothesis space is used as prior. In addition there is a tight methodological relationship to the papers [10] and [11] where learning curves for the learning of two class classifiers are derived using techniques from statistical mechanics. Especially in [11] the notion of an optimal temperature with respect to the generalization error is introduced. These works present asymptotic results in the sense, that the ....
....where a uniform distribution over the hypothesis space is used as prior. In addition there is a tight methodological relationship to the papers [10] and [11] where learning curves for the learning of two class classifiers are derived using techniques from statistical mechanics. Especially in [11] the notion of an optimal temperature with respect to the generalization error is introduced. These works present asymptotic results in the sense, that the learning behavior is studied in the limit of an infinite number of data samples l and an infinite number of the hypotheses, where the ratio ....
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056--6091, April 1992.
....there is a value of ff below which training with zero error is possible (even in the unrealizable problem) The entropy of the system at this value of ff vanishes; thus we term this value ff ZE . ii) In the realizable case L s L t there is a sharp transition to perfect generalization at ff ZE [SST92][MF92] iii) For the unrealizable problem above ff ZE , there is a freezing of the system at a nonzero temperature T ZE (ff) i.e. the entropy of the system vanishes for all T T ZE (ff) The training error of the network above ff ZE is nonzero. iv) N log ff bits are sufficient to perfectly ....
Seung S., H. Sompolinsky and N. Tishby. Statistical mechanics of learning from examples. Phys. Rev. A45: 6056-6091. 10
.... is spherically symmetric (for instance, the uniform density on the unit ball in N ) and the target function is a function in H s with all s nonzero weights equal to 1, then it can be shown that the approximation rate function ffl g (d) is ffl g (d) 1= cos Gamma1 ( p d=N) for d s [6], and of course ffl g (d) 0 for d s. This problem provides a nice contrast to the intervals problem, since here the behavior of the approximation rate for small d is concave down: as long as d s, an incremental increase in d yields more approximative power for large d than it does for small d ....
....h d 3 . Note that the best such bound may depend in a complicated way on all of the elements of the problem: f , D and the structure. Indeed, much of the recent workon the statistical physics theory of learning curves has documented the wide variety of behaviors that such deviations may assume [6, 3]. However, for many natural problems it is both convenient and accurate to rely on a universal estimation rate bound provided by the powerful theory of uniform convergence: Namely, for any f , D and any structure the function ae(d; m) p (d=m) log(m=d) is an estimation rate bound [9] 4 . ....
[Article contains additional citation context not shown here]
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review, A45:6056-- 6091, 1992.
....effective number of accessible states may be much smaller than the size of the state space. These considerations do not apply to MDPs with undiscounted rewards. Our analysis employs a particular limiting method the so called thermodynamic limit developed in the statistical physics literature[9, 12]. For MDPs, this is the combined limit that the allowed exploration time, T , and the size of the state space, N , grow to infinity at a fixed rate: T 1;N 1; T =N = ff (finite) Ref. 5] gives a rigorous treatment of this method from the viewpoint of computational learning theory. Though ....
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A 45: 6056--6091, 1992.
.... those frameworks: PAC ( Blumer et al. 1987, Blumer et al. 1989, Haussler 1994ab, Valiant 1984, Dietterich 1990, COLT, Rivest 1989, Natarajan 1991, Anthony and Biggs 1992] the statistical physics of supervised learning (SP [Hertz et al. 1991, Opper and Haussler 1991ab, Schwartz et al. 1990, Seung et al. 1991, Tishby et al. 1989, Tishby et al. 1994, Van der Broech and Kawai 1991, Wolpert 1994e] Bayesian supervised learning ( Berger 1985, Buntine and Weigend 1991, Buntine 1990a, Duda and Hart 1973, Haussler et al. 1994, Loredo 1990, Neal 1994, Wolpert 1994bc, 1993, MacKay 1991, Wolpert and Strauss ....
....there is a good deal of current work concerned with modifying and extending the other three frameworks. For example a lot of work has been done extending SP to the case of noise, non zero temperature generalizers, and or assumed correspondences between the generalizer and the prior. See Seung et al. 1991, Tishby et al. 1989, Tishby et al. 1994) Often this is all done with f and h parameterized as neural nets. Sometimes the distributions involving f and d are referred to as the teacher , and P(h d) as the student . As example of an extension of PAC is the Probably Approximately Bayes ....
Seung H., et al. (1991). Statistical mechanics of learning from examples I, II. Physical Review A, 45, p.
.... D we consider is the uniform distribution on the unit sphere (or any other radially symmetric distribution) Despite the voluminous literature on learning perceptrons in general (see the work of Minsky and Papert [17] for a partial bibliography) and with respect to this distribution in particular [23, 2, 6], no efficient noise tolerant learning algorithm has been given previously. Here we give a very simple and efficient algorithm for learning from statistical queries (and thus an algorithm tolerating noise) The sketch of the main ideas is as follows: for any vector v 2 R n , the error of v ....
H.S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056--6091, April 1992.
....is of little interest. The more relevant measure is the error rate of the system in the field, where it would be used in practice. This performance is estimated by measuring the accuracy on a set of samples disjoint from the training set, called the test set. Much theoretical and experimental work [Seung et al. 1992, Vapnik et al. 1994, Cortes et al. 1994] has shown that the gap between the expected error rate on the test set E test and the error rate on the training set E train decreases with the number of training samples approximately as E test Gamma E train = k(h=P ) ff (1) where P is the number of ....
Seung, S., Sompolinsky, H., and Tishby, N. (1992). Statistical mechanics of learning from examples. Physical Review A, 45:6056--6091.
....Here we present the first analytic study of generalization in the mixture of experts from the statistical physics perspective. Statistical mechanics formulation [5] have been utilized to investigate the generalization in learning of feedforward neural networks with various architectures [6, 7, 8, 9], together with the VC theory[10, 11] We expect that the statistical mechanics approach can also be effectively used to evaluate more advanced architectures including mixture models. In this paper we study generalization in the mixture of experts[1] and its variety with two level hierarchy[2] ....
....the sake of simplicity, we consider a network with one gating network and two experts. Each expert produces its output j as a generalized linear function of the N dimensional input x: j = f(W j Delta x) j = 1; 2; 1) where W j is a weight vector of the j th expert with spherical constraint[6]. We consider a transfer function f(x) sgn(x) which produces binary outputs. The principle of divide and conquer is implemented by assigning each expert to a subspace of the input space with different local rules. A gating network partitions the input space and assigns each expert a weighting ....
[Article contains additional citation context not shown here]
H. S. Seung, H. Sompolinsky, and N. Tishby, Statistical mechanics of learning from examples, Phys. Rev. A 45, 6056 (1992).
.... that is, algorithms that choose their hypotheses according to a posterior distribution, obtained from a prior that is modified by the sample data and a temperature parameter. Such algorithms are frequently studied in the simulated annealing and statistical physics literature on learning [16, 5]. DEFINITION 4.5 We say that a randomized algorithm A using hypothesis space H is a Bayesian algorithm if there exists a prior P over H and a temperature T 0 such that for any sample Sm and any h 2 H , Pr r [A(Sm ; r ) h] 1 Z P(h) exp Gamma 1 T X i I(h(x i ) 6= y i ) ....
....overestimates ffl(A(S m ) 1=2 by 1=2, and for half of the sample it underestimates the error by 1=2. Theorem 5. 4) 6 Extensions and Open Problems It is worth mentioning explicitly that in the many situations when uniform convergence bounds better than V C(d; m; ffi) can be obtained [16, 6] our resulting bounds for leave one out will be correspondingly better as well. In the full paper we will also detail the generalizations of our results for other loss functions, and give results for k fold cross validation as well. There are a number of interesting open problems, both theoretical ....
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review, A45:6056--6091, 1992.
....theoretical description of learning in neural networks predicts a generalization error inversely proportional to some power of the number of learning samples. The most common approaches are probably almost correct (PAC) learning [Baum 1989] Hausler 1989] and statistical mechanical (SM) framework [Seung 1992]. Alternative approaches also exist based on Bayesian paradigm [Lansen 1989] Vapnik s method of structural risk minimization [Vapnik 1982] or informationtheoretic model estimation techniques such as minimum description length criterion [Rissanen 1986] In fact, these methods are very closely ....
Seung, H.,S., Sompolinsky, H., Tishby, N. "Statistical mechanics of learning from examples", Physical Review A, 1992, vol.45, no.8, pp. 6056-6091.
....is of little interest. The more relevant measure is the error rate of the system in the field, where it would be used in practice. This performance is estimated by measuring the accuracy on a set of samples disjoint from the training set, called the test set. Much theoretical and experimental work [3], 4] 5] has shown that the gap between the expected error rate on the test set E test and the error rate on the training set E train decreases with the number of training samples approximately as E test Gamma E train = k(h=P ) ff (1) where P is the number of training samples, h is a measure ....
S. Seung, H. Sompolinsky, and N. Tishby, "Statistical mechanics of learning from examples," Physical Review A, vol. 45, pp. 6056--6091, 1992.
....that is simple and efficient, and performs essentially as well as the min max strategy. Actually P is a family of algorithms that is related to the algorithm studied by Vovk [Vov90b] and the Bayesian, Gibbs and weighted majority methods studied by a number of authors [LW94, LLW91, HKS94, STS90, SST92, HBar, HW94] as well as the method developed by Feder, Merhav and Gutman [FMG92] We show that P performs quite well in the sense defined above so that, for example, given any finite set E of weather forecasting experts, P is guaranteed not to perform much worse than the best expert in E , no ....
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056--6091, 1992.
....n 1 ;h) Gamma 1 n log Z dPH (h)e Gammafi n(f;h) 15) The bound (15) is general in that it makes no assumptions about the nature of the space H. In fact, the expression appearing on the right hand side of eq. 15) is just the high temperature free energy as derived previously in (Seung et al. 1992). The point to note in the present case, however, is that here it is found to be an upper bound for all fi, while the usual interpretation treats the right hand side of eq. 15) as an approximation to the average stochastic complexity, valid for small fi. It is also useful at this stage to ....
....bound. This bound has often been used as a quick way to obtain useful qualitative results. As we show here however, the annealed approximation may lead to totally inadequate results in the case of learning unrealizable rules. This point has been observed by several authors (see for example Seung et al. 1992, and Meir Fontanari, 1992) but seems to have been ignored by many other workers. The lower bound is easily derived using the convexity of the logarithm function and Jensen s bound: ED Gamma 1 n log Z dPH (h)e Gammafi (y n 1 ;x n 1 ;h) Gamma 1 n log Z dPH (h)ED e ....
[Article contains additional citation context not shown here]
Seung, H.S., Sompolinsky, H. & Tishby N. (1992). Statistical mechanics of learning from examples. Phys. Rev. A 45: 6056-6091.
....fully defines the learning problems of interest. It then provides a short introduction to the areas of statistical physics relevant to this paper. We do not provide a full account of the statistical physics relevant to the analysis of learning curves; such an account can be found in Seung et al. [11], Watkin et al. 2] Hertz et al. 12] and Landau and Lifshitz [13] 2.1 General Notation We consider feedforward networks having p real valued inputs, denoted by the vector x 2 R p (vectors are assumed to be column vectors throughout this paper) Input vectors are assumed to be generated ....
....only the parameters in ff are adaptable, the networks of interest operate by mapping input vectors of dimension p to a new space of dimension k, and mapping the resulting vectors to an output using a linear network having parameter vector ff. Many results are available for such linear networks [2, 11], however such results usually apply to networks having Gaussian distributed inputs, whereas in this case such inputs are clearly non Gaussian, as the outputs of the basis functions of equation (1) are strictly positive. Some of the later results presented in this paper can therefore be regarded ....
[Article contains additional citation context not shown here]
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056--6091, April 1992.
.... Convergence: sample independent, worst case, large sample bounds, grew out of pattern recognition and computer science, late 1980 s on (see [40, 41, 28, 98, 43] large sample so ignore priors) Statistical Physics: adapting mathematical techniques from statistical physics, late 1980 s on (see [86, 87]) assume truth is known so no prior needed) Stochastic Complexity, Minimum Description Length (MDL) etc. techniques from coding theory to learning, late 1970 s on [81, 99, 59] rename priors to be code lengths and claim they are objective) Bayesian Statistics: applies pure probability ....
H.S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45:6056--6091, 1992.
....in this paper, which was introduced in [Wolpert 1992] is an extension of conventional Bayesian analysis. This formalism doesn t restrict itself to a certain kind of generalizer (as do the various versions of the statistical mechanics machine learning formalism, for example [Tishby et al. 1989, Seung et al. 1991]) nor does it restrict itself to finding worst case 3 bounds, where one assumes one knows very little about the generalizer (as does PAC, for example [Blumer et al. 1987, Blumer et al. 1989, Valiant 1984] It does not need to assume that one only knows the size of the training set, and not the ....
....X which contribute substantially to P(h f m) for all h which contribute substantially to P(E f m) see equation (2) Schwartz et al. call this approximation self averaging and claim that one would expect it to hold whenever n m. Self averaging is similar to the annealed approximation [Seung et al. 1991] which, loosely speaking, consists of replacing k( with the L X average of k( Given self averaging and (3) P(h f, m) T(h) S x X d(h(x) f(x) p(x) m = 20 T(h) 1 Er(f, h) m . Substituting this result into equation (2) collecting all constants into an overall ....
Seung H., et al. (1991). Statistical mechanics of learning from examples I, II. Submitted.
....arising from such a simple scheme have been investigated. The one layered perceptron is a popular platform for these studies: it is complicated enough to arrive at interesting results, yet simple enough to be described analytically. Excellent reviews of the main achievements can be found in [1, 2]. Instead of simply storing questions and answers, the student tries to realize a particular input output relation which corresponds to some concept hiding in the training examples. Her generalization ability, i.e. her performance on questions she has never seen before, measures her success. ....
....to some concept hiding in the training examples. Her generalization ability, i.e. her performance on questions she has never seen before, measures her success. Generalization properties have been studied for all kinds of different learning recipes, both deterministic [3] and stochastic [1], but mostly under ideal conditions, i.e. with completely reliable teachers and sometimes even students that only ask intelligent questions ( learning with queries [4] From a practical point of view, it is worth spending some effort on less ideal situations. Examples include learning ....
[Article contains additional citation context not shown here]
H. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45:6056--6091, 1992.
....curves, such as for example the occurrence of discontinuous jumps in learning curves, which cannot be predicted from VC theory alone. These results were derived by adapting to the problem of learning ideas that arise in the context of statistical mechanics. In recent years many other results (Seung et al. 1992, Watkin et al. 1993, Opper Kinzel, 1996) bounds or approximations, rigorous or not, have been obtained in the learning theory of neural networks by applying a host of methods originated in the study of disordered materials. These methods permit looking at the properties of large networks, ....
....therefore the integral (5) depends only on the scalars ae = B Delta J=BJ , x, y, B and J . As x and y are sums of independent random variables (S i J i =J and S i B i =B, respectively) in the TL a straightforward application of the central limit theorem leads to (see e.g. Opper et al. 1990, Seung et al. 1992, Watkin et al. 1993) e G (ae) Z dxdyPC (x; y) Theta( Gammaxy) 2 = 1 arccos ae : 6) STATISTICAL MECHANICS OF LEARNING DRIFTING CONCEPTS 7 q q B J Figure 1. Simple representation of weight vectors in the hyper sphere. The teacher and the student disagree when the input vector S is ....
Seung, H.S., Sompolinsky, H. & Tishby, N. (1992). Statistical mechanics of learning from examples.
....that is simple and efficient, and performs essentially as well as the min max strategy. Actually P is a family of algorithms that is related to the algorithm studied by Vovk [Vov90] and the Bayesian, Gibbs and weighted majority methods studied by a number of authors [LW94, LLW95, HKS94, STS90, SST92, HB92, HW95] as well as the method developed by Feder, Merhav and Gutman [FMG92] We show that P performs quite well in the sense defined above so that, for example, given any finite set E of weather forecasting experts, P is guaranteed not to perform much worse than the best expert in E , no ....
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056--6091, 1992.
....ways. First, and perhaps most importantly, ffl M ( 1 Gammafl)m) may be considerably larger than ffl M (m) This could either be due to properties of the underlying learning algorithm L, or due to inherent phase transitions (sudden decreases) in the optimal informationtheoretic learning curve [9, 3] thus, in an extreme case it could be that the generalization error that can be achieved within some class F d by training on m examples is close to 0, but that the optimal generalization error that can be achieved in F d by training on a slightly smaller sample is near 1=2. This is ....
....problem instance (fF d g; f; D;L) We believe that giving similarly general bounds for any penalty based algorithm would be extremely difficult, if not impossible. The reason for this belief arises from the diversity of learning curve behavior documented by the statistical mechanics approach [9, 3], among other sources. In the same way that there is no universal learning curve behavior, there is no universal behavior for the relationship between the functions ffl(d) and ffl(d) the relationship between these quantities may depend critically on the target function and the input ....
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review, A45:6056--6091, 1992.
....research was supported by the ESPRIT Working Group RAND II and CNRS GDR AMI y This research was supported by CNRS GDR AMI Random Structures Algorithms Vol. 1996) c fl 1996 John Wiley Sons, Inc. CCC 1063 8539 94 030117 13 2 S. BOUCHERON AND D. GARDY Statistical Physics techniques [19] have portrayed a variety of behaviors. Though those investigations were concerned with rather simple systems like perceptrons, they had to resort to advanced methods like replica calculus that still require foundational elaboration. Other analyses [12] used approximations to provide rigorous ....
H.S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical review A, 45(8):6056--6091, april 1992.
....estimation in MDPs and given a PAC learning algorithm for finding near optimal policies. In this paper we propose an alternative framework for studying MDPs, based on ideas from statistical mechanics. Our approach draws on previous work in statistical mechanics and computational learning theory [5, 9, 12]. We have not made an effort to be rigorous, relying instead on numerical simulations to check the soundness of our methods. The main contributions of this paper are the following: to view the expected returns as defining an energy landscape over policy space, to analyze this landscape with tools ....
....from the optimal policy, as well as upper and lower bounds on this loss. In the last part of the section, we discuss the problem of learning optimal policies from empirical estimates of the expected return. We relate our findings for the entropy to the well known limit of high temperature learning [9]. Numerical evidence is presented to support the theoretical results. Finally, section 4 presents our conclusions and ideas for future work. The appendix contains technical details of the calculations that appear in section 3. 2 Markov Decision Processes This section presents a brief review of ....
[Article contains additional citation context not shown here]
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A 45: 6056--6091, 1992.
.... Introduction Several recent papers in the area of computational learning theory have studied sequential classification problems in which f Sigma1g labeled instances (examples) are given on line, one at a time, and for each new instance, the learning system must predict the label before it sees it [HLW90, Lit89, LW89, Vov90b, HKS91, OH91a, SST92, MF92]. Such systems adapt on line, learning to make better predictions as they see more examples. If n is the total number of examples, then the performance of these on line learning systems, as a function of n, has been measured both by the total number of mistakes (incorrect predictions) they make ....
....n, has been measured both by the total number of mistakes (incorrect predictions) they make during learning, and by the probability of a mistake on the nth example alone. The latter function is often called a learning curve (see also [HKLW91] Sequential regression problems have also been studied [Daw84, Dawa, Dawb, Vov90a, Vov92, Yam91, Yam92, Ama92, AFS92, SST92, MF92]. In this case, instead of predicting either 1 or Gamma1, the learning system outputs a probability distribution, predicting that the label will be 1 with a certain probability, and Gamma1 with one minus that probability. When there is some noise or uncertainty in the labeling process, an ....
[Article contains additional citation context not shown here]
H.S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056--6091, April 15, 1992.
.... that is, algorithms that choose their hypotheses according to a posterior distribution, obtained from a prior that is modified by the sample data and a temperature parameter. Such algorithms are frequently studied in the simulated annealing and statistical physics literature on learning [SST92, GG84] DEFINITION 4.5 We say that a randomized algorithm A using hypothesis space H is a Bayesian algorithm if there exists a prior P over H and a temperature T 0 such that for any sample Sm and any h 2 H , Pr r [A(Sm ; r ) h] 1 Z P(h) exp Gamma 1 T X i I(h(x i ) 6= y i ) ....
....overestimates ffl(A(S m ) 1=2 by 1=2, and for half of the sample it underestimates the error by 1=2. Theorem 5. 4) 6 Extensions and Open Problems It is worth mentioning explicitly that in the many situations when uniform convergence bounds better than V C(d; m; ffi) can be obtained [SST92, HKST96] our resulting bounds for leave one out will be correspondingly better as well. There are a number of interesting open problems, both theoretical and experimental. On the experimental side, it would be interesting to determine the typical dependence of the leave one out estimate s ....
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review, A45:6056--6091, 1992.
No context found.
H.S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056--6091, April 1992.
.... Lyuu and Rivin (1991; 1992) Here we are actually giving a bound on the entire learning curve, and the behavior of our bound is very similar in shape to learning curves obtained in both simulations and non rigorous replica calculations from statistical physics (Engel Fink, 1993; Gyorgyi, 1990; Seung et al. 1992; Sompolinsky et al. 1990) 6 . In figure 11, we graph the difference of the entropy and energy curves shown in figure 3, that is, we plot s(#) # log(1 #) for the three values of #. This plot is simply another way of visualizing the entropy energy competition. The zero crossings of the ....
....3.5. Large # asymptotics of scaled learning curves Our formalism can be used to give a classification of the large # asymptotics of scaled learning curves 7 , thus completing a classification program that has been suggested by several researchers (Amari et al. 1992; Schwartz et al. 1990; Seung et al. 1992). From Eq. 32) and Lemma 9, the weaker form u(#) # # min ) 2 2v(#) 54) P1: rba Machine Learning KL36204(Haus) October 10, 1996 14:3 RIGOROUS LEARNING CURVE BOUNDS 227 Figure 21. Phase diagram showing line of first order transitions beginning at # = 1.448 for # min (# ) 0 and ....
[Article contains additional citation context not shown here]
Seung, H.S., Sompolinsky, H., & Tishby, N. (1992). Statistical mechanics of learning from examples. Physical Review, A45:6056--6091.
....the same. Probably a full RSB calculation is necessary to obtain the correct capacity. Only the learning of randomly labeled examples has been analyzed here. When the examples are drawn from a target function, the issue of generalization to examples not seen during training is of great importance[9, 18]. This issue can also be addressed with a statistical mechanics of VC entropy, as will be discussed elsewhere. Acknowledgments This work was supported by Bell Laboratories, the Deutsche Forschungsgemeinschaft and a travel grant by the Studienstiftung des deutschen Volkes. ....
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Phys. Rev., A45:6056--6091, 1992.
....different. The theoretical revisions of the VC theory mentioned above cannot explain such behavior, because they conservatively modify only with the constant factors of the same power laws. In this paper, we show that ideas from statistical mechanics (namely, the annealed approximation [20, 1, 21] and the thermodynamic limit [21] can be used as the basis of a mathematically precise and rigorous theory of learning curves. 3 Speaking coarsely, there are two main ideas be 2 By a power law, we mean the functional form (a=m) b , where a; b 0 are constants. 3 Aside to the ....
....of the VC theory mentioned above cannot explain such behavior, because they conservatively modify only with the constant factors of the same power laws. In this paper, we show that ideas from statistical mechanics (namely, the annealed approximation [20, 1, 21] and the thermodynamic limit [21]) can be used as the basis of a mathematically precise and rigorous theory of learning curves. 3 Speaking coarsely, there are two main ideas be 2 By a power law, we mean the functional form (a=m) b , where a; b 0 are constants. 3 Aside to the statistical physicist: the annealed ....
[Article contains additional citation context not shown here]
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review, A45:6056--6091, 1992.
.... to the case of boolean inputs by Baum, Lyuu and Rivin [2, 19] Here we are actually giving a bound on the entire learning curve, and the behavior of our bound is very similar in shape to learning curves obtained in both simulations and non rigorous replica calculations from statistical physics [14, 28, 25, 9]. 6 In Figure 11, we graph the difference of the entropy and energy curves shown in Figure 3, that is, we plot s(ffl) ff log(1 Gamma ffl) for the three values of ff. This plot is simply another way of visualizing the entropy energy competition. The zero crossings of the graphs in Figure 11 ....
....much weaker some of the bounds are than others. 3. 5 Large ff asymptotics of scaled learning curves Our formalism can be used to give a classification of the large ff asymptotics of scaled learning curves, 7 thus completing a classification program that has been suggested by several researchers [24, 25, 1]. From Equation (32) and Lemma 9, the weaker form u(ffl) ffl Gamma ffl min) 2 2v(ffl) 54) is derived as a permissible energy bound in the Appendix in Section A.2. The entropy energy competition then takes the form s( Deltaffl) ffu( Deltaffl) ff ( Deltaffl) 2 2v( Deltaffl) 55) where ....
[Article contains additional citation context not shown here]
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review, A45:6056--6091, 1992.
....paper, the only training algorithm that we consider is the (zero temperature) Gibbs algorithm, which selects a weight vector at random from the version space, the set of all weight vectors that are consistent with the training set. This will enable us to use techniques from statistical mechanics[SST92]. After training 2k students on the same training set, the query by committee algorithm selects an input that is classified as positive by half of the committee, and negative by the other half. By maximizing disagreement among the committee, the information gain of the query can be made high. In ....
.... relation ffl g = Gamma1 cos Gamma1 q : 40) In the large ff limit, this leads to the inverse power law behavior ffl g (ff) 0:625 ff (41) The large ff asymptotics can also be obtained by examining the scaling of the entropy with the generalization error, similar to the arguments in [SST92] using the microcanonical high T limit for a general classification of learning curves. As q 1, the first term of (36) dominates, so that the entropy has a simple logarithmic dependence on the generalization error s log ffl g ; 42) where we have used the asymptotic result ffl g ....
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Phys. Rev., A45:6056--6091, 1992.
No context found.
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8), 1992.
No context found.
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review, A45:6056--6091, 1992.
No context found.
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8), 1992.
No context found.
H S Seung, H Sompolinsky, and N Tishby. Statistical-mechanics of learning from examples. Phys. Rev. A, 45:6056--6091, 1992.
No context found.
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review, A45:6056--6091, 1992.
No context found.
H.S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056--6091, April 1992.
No context found.
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review, A45:6056--6091, 1992.
No context found.
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review, A45:6056-- 6091, 1992.
No context found.
H. S. Seung, H. Sompolinsky and N. Tishby 1992, #Statistical mechanics of learning from examples," Phys. Rev. A 45, 6056#6091.
No context found.
Seung H.S., Sompolinsky H. and Tishby N. "statistical mechanics of learning from examples". Physical Review A, 45:6056, 1992.
No context found.
H.S. Seung, H. Sompolinsky and N. Tishby (1991). `Statistical mechanics of learning from examples', preprint.
No context found.
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45:6056--6091, 1992.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC