24 citations found. Retrieving documents...
G Gyorgyi and N Tishby. Statistical theory of learning a rule. In W Theumann and R Koberle, editors, Neural Networks and Spin Glasses, pages 3--36. World Scientific, Singapore, 1990. Abbreviated version published as [26].

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Statistical Mechanics of Neural Networks: Enhancement by.. - Dietrich   (Correct)

....R c can also be expressed in the form # c = 3.83) Because of # g = 1 # arccos R, this shows a remarkably simple formula for the minimal generalisation error the linear perceptron student can achieve: # g,c = 1 # c . 3. 84) This relation already appears in a slightly di#erent context in [33]. Numerical solution of (3.79) and (3.80) shows that for increasing gap, more and more inputs can be stored (fig. 3.10, left) After the discussions in the preceding section, this is no longer surprising: A large gap will just blur the detailed structure of the separation surface and the ....

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In W. K. Theumann and R. Koeberle, editors, Neural Networks and Spin Glasses. World Scientific, Singapore, 1990.


Understanding stepwise generalization of Support Vector.. - Risau-Gusman, Gordon (2000)   (Correct)

.... at random a student in version space, i.e. a vector w that classi es correctly the training set D , with a probability proportional to P 0 (w) In the case of an isotropic pattern distribution, which corresponds to = 1 in (1) the properties of cost function (2) have been extensively studied [5]. The case of patterns drawn from two gaussian clusters in which the symmetry axis of the clusters is the same [6] and di erent [7] from the teacher s axis, have recently been addressed. Here we consider the problem where, instead of having a single direction along which the patterns distribution ....

....shown for comparison. The inset shows the rst step of learning and its plateau (see text) R c 1 R c = 2 R u 1 R u : 20) where Dt = dt e t 2 =2 = p 2 and H(x) R 1 x Dt. If 2 = 1, we recover the equations corresponding to Gibbs learning of isotropic pattern distributions [5]. The order parameters are represented as a function of on gure 1, for a particular choice of n c and . R u grows much faster than R c , meaning that it is easier to learn the components of the uncompressed space. As a result, R (and therefore the generalization error g ) presents a ....

[Article contains additional citation context not shown here]

G. Gyorgyi and N. Tishby (1990) Statistical Theory of Learning a Rule. In Neural Networks and Spin Glasses (W. K. Theumann and R. Koberle, Worls Scientic), 3-36.


On Weak Learning - Helmbold, Warmuth (1995)   (6 citations)  (Correct)

....0 ) 1 Gamma g( 0 ; and g( 1 2 . Although the function g used by a volume prediction algorithm may be simple, computing the volume of a sample may not be computationally feasible. Here we consider three volume prediction algorithms. Algorithm Gibbs P (Gibbs Algorithm) is well known [HKS91, HO91, GT90, HS90, STS90] and can be viewed as predicting with a randomly chosen consistent concept from the class where the consistent concepts are weighted according to the prior P . Algorithm G P is a special case of the aggregating strategy introduced by Vovk [Vov90] and was used as the basis for a polynomial weak ....

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In K. Thuemann and R. Koeberle, editors, Neural Networks and Spin Glasses. World Scientific, 1990.


Calculation of the Learning Curve of Bayes Optimal.. - Opper, Haussler (1991)   (14 citations)  (Correct)

....exceptional progress has been made in recent years in applying the methods of statistical mechanics to the analysis of the process of learning from random examples, as exemplified in the learning algorithms used to train neural networks. Recent work [DSW 87] HLW88] BH89] VJP89] LTS89] GT90] HS90] STS90] OKKN90] has focused on quantifying what is known in the neural net literature as the generalization performance of learning algorithms. This is the probability that the learning algorithm will correctly predict the classification of a new random instance, after it has seen a ....

....on novel instances by selecting a hypothesis, represented by couplings or synaptic weights of a neural network, that performs well on the training examples. A canonical algorithm of this type, which we call the Gibbs algorithm 1 , was studied from a statistical mechanics perspective in [GT90, HS90, STS90] and in a more abstract setting in [LW89] as the randomized weighted majority algorithm) and [HKS91] For noise free training examples, the extreme zero temperature version of this algorithm simply chooses a hypothesis at random from among those that are consistent with all the ....

[Article contains additional citation context not shown here]

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In K. Thuemann and R. Koeberle, editors, Neural Networks and Spin Glasses. World Scientific, 1990.


Knowledge Acquisition in Statistical Learning Theory - Fine (1999)   (Correct)

....learning. Subsequently, Decatur and Gennaro [39] proved that if the di erent noise rates are known (or at least an upper bound is given) then there exist ecient PAC learning algorithms for simple classes such as monomials and k DNF. Based on ideas from statistical mechanics, Gy orgyi and Tishby [61] introduced another type of attribute noise, which seems suitable to model real life situations when the training data is the result of some physical experiment for which noise may tend to be stronger in boundary areas. Restricting their attention to the problem of learning Ising perceptrons 1 ....

....since it can examine the whole sample and then remove the most informative examples, replacing them by less useful and even misleading examples, whereas in the malicious noise model the adversary cannot choose which examples to change. Driven by a motivation similar to Gy orgyi and Tishby [61], this model is aimed at capturing situations where the noise may tend to be stronger in boundary areas. Another situation is the Agnostic Learning setting (cf. 76] in which the concept class is unknown and thus the learner needs to minimize the empirical error while using hypotheses from a ....

G. Gyorgyi and N. Tishby. A statistical theory of learning a rule. In W. K. Theumann and R. Koberle, editors, Neural Networks and Spin Glasses, pages 3-36. World Scientic, Singapore, 1990.


Part 1: Overview of the Probably Approximately Correct (PAC).. - Haussler (1995)   (Correct)

....work needs to be done on sensitivity analysis, and on simplifying the calculations so that larger problems can be analysed. Some success in tackling the difficult computations involved in certain Bayesian approaches to learning theory has been obtained by using the tools from statistical physics [32, 121, 47, 117, 97]. This work, and the other distribution specific learning work, provides an increasingly important counterpart to PAC theory 1 . Another variant of the PAC model designed to address these issues is the probability of mistake model explored in [57] 56] and [97] This model is designed ....

....a notion of average case big L risk to be minimized. The former goal is know as minimax optimality, and has been used in the PAC model. The later is the Bayesian notion of optimality [20, 67] and has been used in several approaches to learning in neural nets based on statistical mechanics [32, 121, 47, 117, 97, 98]. Unfortunately this last question has no clear cut answer, and leads us directly into a longstanding unresolved debate in statistics (see e.g. 74] and following discussion. Since we have set out to generalize the PAC model, and since our results are best illustrated in the minimax setting, we ....

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In K. Thuemann and R. Koeberle, editors, Neural Networks and Spin Glasses. World Scientific, 1990.


Bounds on the Sample Complexity of Bayesian Learning.. - Haussler, Kearns.. (1992)   (69 citations)  (Correct)

.... (VC) dimension [34,4] In contrast, the average case sample complexity of learning in neural networks has recently been investigated from a standpoint that is essentially Bayesian 1 , and is strongly influenced by ideas and tools from statistical physics, as well as by information theory [10,31,15,29,24]. While each of these theories has its own distinct strengths and drawbacks, there is little understanding of what relationships hold between them. In this paper, we study an average case or Bayesian model of learning with two primary goals. First, we are interested in ultimately developing a ....

....predicts that f(xm 1 ) f (xm 1 ) Thus, the Gibbs algorithm simply chooses a hypothesis randomly (according to P) from F among those that are consistent with the labels seen so far. The Gibbs algorithm is the zero temperature limit of the learning algorithm studied in several recent papers [10,31,15,29]. It is important to note that both the Bayes and Gibbs algorithms are quite different from the well known maximum a posteriori algorithm, which chooses the hypothesis f that maximizes the posterior probability Pm [ f ] While this algorithm maximizes the probability of exactly identifying the ....

[Article contains additional citation context not shown here]

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In K. Thuemann and R. Koeberle, editors, Neural Networks and Spin Glasses. World Scientific, 1990.


On-Line Learning In The Committee Machine - Copelli, Caticha (1995)   (1 citation)  (Correct)

....simulations. Section 5 contains some concluding remarks. 2 The Generalization Error and the Learning Dynamics Since we are interested in the problem of generalization or rule extraction within the framework of supervised learning, we build a learning set with the help of a teacher network [2] [3], which for simplicity we take to have the same architecture as the student net. The K nonoverlapping committee we deal with is a set of K independent boolean perceptrons or branches with N=K inputs units each. The notation we use is such that every N dimensional vector V can be thought of as K ....

Gyorgyi, G., Tishby, N., "Statistical theory of learning a rule" in "Neural Networks and spin glasses", Theumann, W., Koberle, R., Eds. World Scientific, 1989


Efficient Adaptive Learning For Classification Tasks With Binary.. - Moreno (1998)   (1 citation)  (Correct)

....minimum of (4) have been studied theoretically with methods of statistical mechanics (Gordon Grempel, 1995) It was shown that in the limit T 0, the minimum of E corresponds to the weights that minimize the number of training errors. If the training set is LS, these weights are not unique (Gyorgyi Tishby, 1990). In that case, there is an optimal learning temperature such that the weights minimizing E at that temperature endow the perceptron with a generalization error numerically indistinguishable from the optimal (bayesian) value. The algorithm Minimerror (Gordon Berchier, 1993; RaOEn Gordon, 1995) ....

Gyorgyi, G., & Tishby, N. 1990. Statistical theory of learning a rule. In: Theumann, W.K., & Koeberle, R. (eds), Neural networks and spin glasses. Singapore: World Scientiøc.


Learning From Queries for Maximum Information Gain in.. - Peter Sollich (1995)   (12 citations)  (Correct)

.... function of ff = p=N , the number of training examples per weight, which we denote simply by s(ff) The calculation can then be split into two parts: First, the function s(ff) is obtained from a calculation of the teacher space entropy using the replica method, generalizing the results of Gyorgi and Tishby (1990). The average generalization 1 More precisely, what is minimized is the value of the entropy after a new training example (x; y) is added, averaged over the distribution of the unknown new training output y given the new training input x and the existing training set; see Sollich (1994) 0 1 ....

....the effects of (approximate) MTSE queries in teacher space. For large ff values, the teacher space entropy decreases linearly with ff, with gradient c 0:44, whereas the entropy for random examples, also shown for comparison, decreases much more slowly (asymptotically like Gamma ln ff, see (Gyorgi and Tishby, 1990)) The linear ff dependence of the entropy for queries corresponds to an average reduction of the version space volume with each new training example by a factor of exp( Gammac) 0:64, which is reasonably close to the factor 1 2 for proper bisection of the version space. This justifies our ....

[Article contains additional citation context not shown here]

G Gyorgi and N Tishby (1990). Statistical theory of learning a rule. In W Theumann and R Koberle, editors, Neural Networks and Spin Glasses, pages 3--36. Singapore, World Scientific.


Learning boolean functions safe from local minima with a.. - Raffin, Virot (1995)   (Correct)

....of the simulations. We consider a Perceptron with N input neurons, all connected to a single output neuron with the weight vector w. Neuron states belong to f Gamma1; 1g. By this choice of neuron states, we can directly use theoretical results about learning and generalizing with a Perceptron [7, 11]. We use such a Perceptron to learn sets of P patterns with KLR. For each learning set Gamma, the input patterns are randomly chosen. The expected output are either randomly generated or with a Reference Perceptron. A Reference Perceptron is used to built linearly separable learning sets. It is ....

....N = 50 and N = 100. We turn now to analyze the quality of the extrapolation that provides KLR. Let us introduce the generalization error ffl g that depend on ff = P N . It is the probability that a Perceptron gives the wrong output to a pattern not belonging to the learning set. Gyorgyi and al. [11] computed ffl g (ff) in the thermodynamic limit, N 1, for the Perceptrons that perform an error free solution on linearly separable learning sets. We also find in [17] the generalization error of Bayes algorithm, which gives the lowest bound to the generalization error of a Perceptron learning ....

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In Neural Networks and Spin Glasses, pages 3--36, 1990.


How Well do Bayes Methods Work for On-Line Prediction of.. - Haussler, Barron (1992)   (Correct)

....and regression problems on the outcome space Y = f Sigma1g. Many of our techniques should generalize easily to other kinds of outcome spaces, as well as other decision spaces and loss functions. Some of these techniques may also help in analyzing other learning methods, such as the Gibbs method [GT90, HKS91, OH91b, OH91a, SST92]. However, a major problem remaining is to develop equally simple and general techniques to obtain lower bounds on the risk, so that we can see how tight these upper bounds are. Very tight upper and lower bounds on the risk of Bayes methods under log loss are available for the case when Theta is ....

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In K. Thuemann and R. Koeberle, editors, Neural Networks and Spin Glasses. World Scientific, 1990.


Query by Committee - Seung, Opper, Sompolinsky (1992)   (79 citations)  (Correct)

....of ffl g rather than log ffl g . The coefficient of P in the resulting exponential is given by logh i rather than hlog i. 4. 1 Random inputs When all inputs are chosen at random from the distribution (34) the replica method can be used to calculate the entropy of the posterior distribution[GT90]. The calculation is exact in the thermodynamic limit, where P; N 1 with ff = P=N constant. The entropy per weight s j S=N is then s(ff) 1 2 log(1 Gamma q) 1 2 q 2ff Z DxH(flx) log H(flx) 36) where fl j r q 1 Gamma q ; 37) Dx j dx p 2 e Gammax 2 =2 ; 38) H(y) j ....

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In W. K. Theumann and R. Koberle, editors, Neural Networks and Spin Glasses, pages 3--36, Singapore, 1990. World Scientific.


On the equivalence of Two Layered Perceptrons with Binary Neurons - Marcelo Blatt   (Correct)

....of threshold logic [Dertouzos, 1964; Lewis II and C.L. Coates, 1967] In the case of a finite number of inputs N , there is no simple expression for ffl g . Explicit classes of equivalence up to N = 6 are presented by Dertouzos (1964) In the large N limit, i.e. N AE 1, the well known result (Gyorgyi and Tishby 1990, Opper et al. 1990, Seung et al. 1992) for the generalization error between two irreducible perceptrons whose weights are W 1 and W 2 (with norm p N ) is: ffl g i W 1 ; W 2 j = 1 arccos 0 W 1 Delta W 2 N 1 A O 1 N In this limit the generalization error ....

G. Gyorgyi and N. Tishby (1990), "Statistical Theory of Learning a Rule", in Neural Networks and Spin Glasses, W.K. Theumann and R. Koeberle editors (Singapore: World Scientific).


Bounds on the Sample Complexity of Bayesian Learning Using.. - Haussler (1994)   (69 citations)  (Correct)

.... (VC) dimension [34, 4] In contrast, the average case sample complexity of learning in neural networks has recently been investigated from a standpoint that is essentially Bayesian 1 , and is strongly influenced by ideas and tools from statistical physics, as well as by information theory [10, 31, 15, 29, 24]. While each of these theories has its own distinct strengths and drawbacks, there is little understanding of what relationships hold between them. In this paper, we study an average case or Bayesian model of learning with two primary goals. First, we are interested in ultimately developing a ....

....predicts that f(xm 1 ) f(xm 1 ) Thus, the Gibbs algorithm simply chooses a hypothesis randomly (according to P) from F among those that are consistent with the labels seen so far. The Gibbs algorithm is the zerotemperature limit of the learning algorithm studied in several recent papers [10, 31, 15, 29]. It is important to note that both the Bayes and Gibbs algorithms are quite different from the well known maximum a posteriori algorithm, which chooses the hypothesis f that maximizes the posterior probability Pm [ f ] While this algorithm maximizes the probability of exactly identifying the ....

[Article contains additional citation context not shown here]

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In K. Thuemann and R. Koeberle, editors, Neural Networks and Spin Glasses. World Scientific, 1990.


A Learning Rule Safe From Local Minima for a Generalized.. - Raffin, Virot (1996)   (Correct)

....sets are learned. KLR is tested on a Perceptron with N real input neurons, all connected to one binary output neuron. Each learning set is composed of P random input patterns with uniform probabilities of apparition. The expected output are either randomly generated or with a Reference Perceptron [11]. A Reference Perceptron, with random weights, constructs linearly separable learning sets. Each learning algorithm consists of alternating between minimizing K 0 by conjugate gradient [17] and increasing Diam 2 by 200 with an initial diameter Diam 2 = 1. Note that for sake of clarity all ....

....the diameter (Figure 1) We also learn sets with random outputs for N = 100 and P = 300. These values ensure that the probability of being separable is almost zero [7] Their non separability is effectively detected by KLR (Figure 1) We turn now to the analyze of the generalization error ffl g [11]. It is the probability that a Perceptron gives the wrong output to a pattern not belonging to the learning set. In the thermodynamic limit, N 1, Opper and Haussler have computed the generalization error of Bayes algorithm [16] versus ff (ff = P N ) It gives the lowest bound to the ....

[Article contains additional citation context not shown here]

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In Neural Networks and Spin Glasses, pages 3--36, 1990.


Statistical Mechanics of Learning From Examples II.. - Seung, Sompolinsky..   Self-citation (Tishby)   (Correct)

....In this Part we present the replica theory of problem. Application of the replica method to study learning a classification task in a single layer network was suggested by Gardner and Derrida[8] and further pursued by del Giudice et al. 9] Hansel and Sompolinsky[10] and Gyorgyi and Tishby[11, 12]. Other studies, directly related to the present work, have recently been published[13 18] In the present work we study the replica theory of learning both realizable and unrealizable rules. In the case of realizable rules, there exists a set of synaptic weights that make the trained network ....

....by numerical simulations of the model at T = 0. We have found that the system converges rapidly to R = 1 from almost all initial conditions for ff 1:0 : 5.36) C. Boolean Output with Continuous Weights The boolean perceptron with continuous weights has been previously studied in detail[11]. We present below a few of the results for completeness. Since the a priori measure d(W) is the same as in the linear continuous model of Section V.C above, G 0 is again given by (5.5) For a boolean output, G r equals G r = Gamma2 Z 1 0 Dy Z 1 Gamma1 Dt ln h e Gammafi (1 Gamma e ....

[Article contains additional citation context not shown here]

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In W.K. Theumann and R. Koberle, editors, Neural Networks and Spin Glasses, pages 3--36, 1990.


Statistical Mechanics of Learning From Examples - I.. - Seung, Sompolinsky..   Self-citation (Tishby)   (Correct)

....Patarnello[19] and by Denker et al. 20] and further elaborated by Tishby et al. 21] 22] Studies of learning a classification task in a perceptron can be found in Hansel and Sompolinsky[23] and del Giudice et al. 24] using spin glass techniques. Gardner and Derrida[25] and Gyorgyi and Tishby[26, 27] have used these methods for studying learning of a perceptron rule. Related models have been studied in Refs. 28] 29] However the extent of applicability of results gained from these specific toy models to more general circumstances has remained unknown. Recently an interesting attempt to ....

....rules occur in two basic situations. In the first, the data available for training are corrupted with noise, making it impossible for the network to reproduce the data exactly, even with a large training set. This case has been considered by several authors. Particularly relevant works are Refs. [26] and [39] which show that even with noisy data the underlying target rule itself can be reproduced exactly in the limit. We will not address this case explicitly. A second situation, which we do consider, is when the network architecture is restricted in a manner that does not allow an exact ....

[Article contains additional citation context not shown here]

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In W.K. Theumann and R. Koberle, editors, Neural Networks and Spin Glasses, pages 3--36, 1990.


Rigorous Learning Curve Bounds from Statistical Mechanics - Haussler, Kearns, Seung.. (1996)   (40 citations)  Self-citation (Tishby)   (Correct)

No context found.

Rev., A41:7097--7100. Gyorgyi, G., & Tishby, N. (1990). Statistical theory of learning a rule. In K. Thuemann & R. Koeberle (Eds.), Neural Networks and Spin Glasses, World Scientific.


Rigorous Learning Curve Bounds from Statistical Mechanics - Haussler (1996)   (40 citations)  Self-citation (Tishby)   (Correct)

....example hx; yi from DN , y = f(x; sgn(w Delta (x ) 46) The distribution of inputs x is Gaussian, with unit variance on each component. The distribution of noise is also Gaussian, with variance fl 2 Gamma1 on each component. A similar problem was examined by Gyorgyi and Tishby [15]. In this case, one can show that ffl gen (w) 1 cos Gamma1 (R=fl) 47) ffl min(fl) ffl gen (w ) 1 cos Gamma1 (1=fl) 48) ffl gen (w;w ) 1 cos Gamma1 R (49) where R = w Delta w =N . The entropy function takes the form s fl (ffl) H( 1 Gamma cos ffl= cos ....

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In K. Thuemann and R. Koeberle, editors, Neural Networks and Spin Glasses. World Scientific, 1990.


Learning Unrealizable Tasks From Minimum Entropy - Sollich (1995)   (Correct)

No context found.

G Gyorgyi and N Tishby. Statistical theory of learning a rule. In W Theumann and R Koberle, editors, Neural Networks and Spin Glasses, pages 3--36. World Scientific, Singapore, 1990. Abbreviated version published as [26].


A Learning Rule Safe From Local Minima for a Generalized.. - Raffin, Virot   (Correct)

No context found.

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In Neural Networks and Spin Glasses, pages 3--36, 1990.


On the Stochastic Complexity of Learning Realizable and.. - Meir, Merhav (1995)   (6 citations)  (Correct)

No context found.

Gyorgyi, G. & Tishby N. (1990). Statistical theory of learning a rule, in W.K. Theumann and R. Koberle, editors, Neural Networks and Spin Glasses.


Annealed Theories of Learning - Seung (1995)   (8 citations)  (Correct)

No context found.

G. Gyorgyi and N. Tishby. Statistical theory of learning a rule. In W. K. Theumann and R. Koberle, editors, Neural Networks and Spin Glasses, pages 3--36, Singapore, 1990. World Scientific.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC