| E Levin, N Tishby, and S A Solla. A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE, 78:1568--1574, 1990. |
....introduces some concepts drawn from statistical physics, in particular the Gibbs distribution which models a system in thermodynamical equilibrium at a given temperature. Such a statistical framework has been introduced into many elds. One can cite combinatorial optimization [9] machine learning [14, 10] or image processing [6] However, relaxation methods (without memory) have prevailed in the eld of combinatorial optimization. In the remainder of the paper we focus on the application of learning or adaptive algorithms to combinatorial optimization. We use the Gibbs distribution as a reference ....
....will be more likely than high cost states. In that sense optimization can be seen as a learning task. We need some sort of distance between the Gibbs distribution and the target distribution to evaluate the quality of approximation. For this purpose, we use the Kullback Leibler (KL) divergence [10] which is well suited to exponential functions. The KL divergence between p and p T is de ned by D(p; p T ) X x2S p(x) log p T (x) p(x) It has, in particular, the following properties: 1. D(p; p T ) 0; 2. D(p; p T ) 0 if and only if p = p T . Replacing ....
E. Levin, N. Tishby, and S.A. Solla. A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE, 78(10):1568-1574, October 1990.
....is identical to finding the maximum likelihood parameters wML . Thus an interpretation has been given to backpropagation s energy functions ED and EW , and to the parameters ff and fi. This framework offers some partial enhancements for backprop methods: The work of Levin et al. [12] makes it possible to predict the average generalisation ability of neural networks trained on one of a defined class of problems. However, it is not clear whether this will lead to a practical technique for choosing between alternative network architectures for real data sets. Le Cun et al. ....
E. Levin, N. Tishby and S. Solla (1989). A statistical approach to learning and generalization in layered neural networks, COLT '89: 2nd workshop on computational learning theory, 245--260.
....may be used as a measure of the performance of the estimated MSE OLC on future observations, and indicates whether or not the estimated MSE OLC is robust. This testing strategy is straightforward and is similar to the strategy often adopted in testing the generalization ability of a (single) NN [27, 66, 76, 106]. Likewise, in the literature on combining forecasts, many advocate the use of out of sample testing of the combination [24, 25, 32, 42, 55] This test may be integrated in a more comprehensive framework, that attempts to correct for harmful 1 collinearity problems arising from the data, in ....
E. Levin, N. Tishby, and S. A. Solla. A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE, 78(10):1568-- 1574, Oct. 1990.
....ff(w) P (w; t)g T r 2 P (w; t) The equilibrium distribution is [compare with equation (9) P s (w) 1 Z exp Gamma E(w) T ; 24) with Z a normalization constant. The existence of this Gibbs distribution raises the idea to put learning in the framework of statistical mechanics [45, 57, 64]. In these studies, the Langevin equation (23) is more an excuse to arrive at the Gibbs distribution (24) than an attempt to study the dynamics of learning processes in artificial neural networks. The equilibrium distribution of the master equation for on line learning processes is not a simple ....
E. Levin, N. Tishby, and S. Solla. A statistical approach to learning and generalization in layered neural networks. Proceedings IEEE, 78:1568--1574, 1990.
....represented by the pdf (x; y) fl(yjx)fi(x) the conditional pdf fl(yjx) is the probabilistic relationship between the input and output spaces defined by S and the pdf fi(x) describes the statistical behaviour in the input space. The neural model can be described in probabilistic terms as well [1, 2, 3]: the neural network is regarded as a parametric distribution p(x; yj ) p(yjx; fi(x) approximating (x; y) fl(yjx) fi(x) 180 Massimo Battisti et al. Training a neural network is usually considered as the process of minimising a cost function D(fl(yjx) p(yjx; between the true ....
....Merging in Neural Modelling 181 3 Merging Prior and Data Information On the basis of the above considerations we can now approach the problem of transferring the independent information available in both the parameters and data spaces to the parameters space. To obtain this result, following [3], we consider the parametric distribution p(yjx; describing the model network: if the structure of p(x; yj ) is constrained to be of the generalised Gaussian type [6] it infers on the events (x; yj ) probabilities whose values depend on the parameters vector ; a global function P (x (n) ....
[Article contains additional citation context not shown here]
S.A. Solla N. Tishby, E. Levin. A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78, n.10:1568--1574, October 1990.
....simulations of [24] assume p i ( i ) as independent Bernoulli laws (resp. gaussian) with average i (resp. average and variance given by i components) The discrete optimization problem in E is then transformed into a continuous minimization problem of the Kullback Leibler (KL) divergence [101] 9 between p and the Gibbs distribution p T , which charges the high fitness states of E for low temperatures T . It can be shown that this minimization can be achieved with a gradient in the space of the free energy of the system F ( 24] Yet as for W (t) above (9) computing the free ....
E. Levin, N. Tishby, and S.A. Solla. A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE, 78(10):1568--1574, October 1990.
....A test set, consisting of patterns previously unseen by the classifier, is then used to determine the classification performance. This ability to meaningfully respond to novel patterns, or generalize, is an important aspect of a classifier system and in essence, the true gauge of performance [1, 2]. Given infinite training data, consistent classifiers approximate the Bayesian decision boundaries to arbitrary precision, therefore providing similar generalizations [3] However, often only a limited portion of the pattern space is available or observable [4, 5] Given a finite and noisy data ....
E. Levin, N. Tishby, and S. A. Solla. A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78(10):1568--74, Oct 1990.
....are provided to illustrate the benefits and pitfalls of reducing the correlation among classifiers, especially when the training data is in limited supply. 2 1 Introduction A classifier s ability to meaningfully respond to novel patterns, or generalize, is perhaps its most important property (Levin et al. 1990; Wolpert, 1990) In general however, the generalization is not unique, and different classifiers provide different generalizations by realizing different decision boundaries (Ghosh and Tumer, 1994) For example, when classification is performed using a multilayered, feed forward artificial neural ....
Levin, E., Tishby, N., and Solla, S. A. (1990). A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78(10):1568--74.
....A test set, consisting of patterns not previously seen by the classifier, is then used to determine the classification performance. This ability to meaningfully respond to novel patterns, or generalize, is an important aspect of a classifier system and in essence, the true gauge of performance [26, 48]. Given infinite training data, consistent classifiers approximate the Bayesian decision boundaries to arbitrary precision, therefore providing similar generalizations [14] However, often only a limited portion of the pattern space is available or observable [11, 12] Given a finite and noisy ....
E. Levin, N. Tishby, and S. A. Solla. A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78(10):1568--74, Oct 1990.
....A test set, consisting of patterns not previously seen by the classifier, is then used to determine the classification performance. This ability to meaningfully respond to novel patterns, or generalize, is an important aspect of a classifier system and in essence, the true gauge of performance [1, 2]. Given infinite training data, consistent classifiers approximate the Bayesian decision boundaries to arbitrary precision, therefore providing similar generalizations [3] However, often only a limited portion of the pattern space is available or observable [4, 5] Given a finite and noisy data ....
E. Levin, N. Tishby, and S. A. Solla. A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78(10):1568--74, Oct 1990.
....GD88] exceptional progress has been made in recent years in applying the methods of statistical mechanics to the analysis of the process of learning from random examples, as exemplified in the learning algorithms used to train neural networks. Recent work [DSW 87] HLW88] BH89] VJP89] LTS89] GT90] HS90] STS90] OKKN90] has focused on quantifying what is known in the neural net literature as the generalization performance of learning algorithms. This is the probability that the learning algorithm will correctly predict the classification of a new random instance, after it has seen ....
.... )Z wrong m : Thus, since the Gibbs algorithm chooses its hypothesis at random according to the posterior density dm , it makes a mistake in predicting oe m 1 with probability Z wrong m Zm = 1 1 Gamma e Gammafi 1 Gamma Zm 1 Zm : 3) A similar formulation has been obtained in [LTS89] and [LW89] The average generalization error of the Gibbs algorithm, when the target vector w is chosen at random by d( w ) and the noise sequence j m 1 is generated randomly with noise rate , but the first m 1 instances x m 1 = x 1 ; x m 1 ) are fixed, is thus given ....
E. Levin, N. Tishby, and S. Solla. A statistical approach to learning and generalization in neural networks. In R. Rivest, editor, Proc. 2nd Workshop on Computational Learning Theory. Morgan Kaufmann, 1989.
.... looked at the performance of Bayes method for this task, as measured by the total number of mistakes for the classi cation problem, and by the total log loss (or information gain) for the regression 1 This rule is the zero temperature limit of the more general algorithm studied for example at [110, 83, 105] 48 problem. Their results were given by comparing the performance of Bayes method to the performance of a hypothetical omniscient scientist who is able to use extra information about the labeling process that would not be available in the standard learning protocol. For example, if each label ....
E. Levin, N. Tishby, and S. A. Solla. A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE, 78(10):1568-1574, October 1990.
.... taken over the training set Dm (in the calculations presented in this work we take E to represent the number of misclassified examples [GD88] Within the statistical physics framework, learning takes place through a modification of the probability distribution on weight space due to incoming data [LTS90]. The posterior distribution, which can easily be derived from the maximum entropy principle, gives rise to the Gibbs distribution P ( wjDm ; h) e Gammafi E( w;Dm ) Zm (Dm ) 2) where the partition function Zm is given by Zm (Dm ) Z d( w)e Gammafi E( w;Dm ) 3) We use the ....
.... posterior distribution of weights, P ( wjDm ) we find P (y m 1 jDm ; x m 1 ) Z d( w)P (y m 1 j x m 1 ; w)P ( wjDm ) Zm 1 (Dm ; x m 1 ; y m 1 ) zZm (Dm ) 4) where we have used P (yjx; w) exp( Gammafie(yjx; w) z, with e(yj w; x) representing the single pattern error [LTS90]. The denominator z normalizes the probability so that P y P (yjx; w) 1. For the error function used in this work, e(yjx; w) Theta( Gammayf w ( x) we have z = 1 e Gammafi . Now, the probability distribution P (y m 1 jDm ; x m 1 ) allows us to encode the random variable y m 1 with ....
Levin E., N. Tishby and S. Solla. A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78:1568-1574.
....1 2 P p (y p Gamma f(x p ; w) 2 is sum squared training error and ZD = 2=fi) P 2 . This form resembles a Gibbs distribution over student space: it also corresponds to imposing the constraint that minimisation of the training error is equivalent to maximising the likelihood of the data (Levin et al. 1989). This distribution can be realized practically by employing the Langevin training algorithm, which is simply the gradient descent algorithm with an appropriate noise term added to the weights at each update (Rognvaldsson, 1994) Furthermore, it has been shown that gradient descent, considered as ....
....networks, where often in practice only locally optimal solutions are found. 6 E AE= Z X P d P xP(x p ) Z j P d P j P(j p ) Z X dx P(x) Gamma f(x; w 0 ) Gamma f(x; w) Delta 2 (9) An alternative measure of generalisation performance is a quantity known as prediction error (Levin et al. 1989), EP = Gamma log P(yjx; D) which is derived from the probability of the network correctly predicting a data point drawn from a known probability distribution. Prediction error is closely linked to both the free energy F and the evidence. 4 Calculation of Generalisation Error The calculation of ....
Levin, E., Tishby, N. and Solla, S.A. (1989). A statistical approach to learning and generalisation in layered neural networks. In Colt '89: 2nd Workshop on Computational Learning Theory, pages 245--260.
....network outputs is that it eliminates the costly search for the ideal set of network structure parameters, providing good generalization performance with an array of networks which can possibly be suboptimal individually. More detailed presentations of the generalization issue can be found in [40, 61, 98, 163, 164]. 13 2.3 Statistical Background Since neural networks share many aspects of their learning abilities with statistical methods, this section reviews certain results from estimation theory [1, 26, 94, 114, 154, 159] Furthermore, statistical estimation theory provides a solid framework for ....
E. Levin, N. Tishby, and S. A. Solla, A statistical approach to learning and generalization in layered neural networks, Proc. IEEE, 78 (1990), pp. 1568--74.
....set. For example, when training a neural network classifier, different initial weights, learning rates, momentum terms, and architectures (e.g. number of hidden layers and hidden units, connections, single vs. distributed output encoding, etc. affect how the classifier performs on novel examples [19]. For this reason, choosing a single classifier is not optimal. Even choosing the single best classifier among several classifiers trained using the same training examples is suboptimal because potentially valuable information may be wasted. These observations lead to the idea of generating ....
E. Levin, N. Tishby, and S. A. Solla. A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78(10):1568--74, Oct 1990.
....in addition to the input output training examples, training a complicated neural network with many interconnected hidden neurons could be reduced to training a sequence of individual neurons. To establish probability models, Gaussian models have been established in [71] 69] based on previous work [67][72] 104] Through some tedious algebraic manipulations, the following algorithm has been derived [71] ffl (1) p = 0 and initialize the weights at both layers (W (1) and w (2) randomly. ffl (2) E step: Compute the expected hidden targets for the hidden units z n as follows. z j;n = h ....
E. Levin, N. Tishby, and S.A. Solla. A statistical approach to learning and generalization in layered neural network. Proceedings of the IEEE, 78:1568--1574, 1990.
....may be used as a measure of the performance of the estimated MSE OLC on future observations, and indicates whether or not the estimated MSE OLC is robust. This testing strategy is straightforward and is similar to the strategy often adopted in testing the generalization ability of a (single) NN [15, 36, 40, 54]. Likewise, in the literature on combining forecasts, many advocate the use of out of sample testing of the combination [18, 30] By construction, the estimated MSE OLC results in the smallest MSE on K, compared to the best NN among the component NNs, and to the simple average of the corresponding ....
E. Levin, N. Tishby, and S. A. Solla. A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE, 78(10):1568--1574, Oct. 1990.
....[9, 23] Neural networks are not magical . They do require that the set of examples used for training should come from the same (possibly unknown) distribution as the set used for testing the networks, in order to provide valid generalization and good performance on classifying unknown signals [4, 16]. Also, the number of training examples should be adequate and comparable to the number of effective parameters in the neural network, for valid results [20, 22] In this context, it is noted that cross validation techniques can partially counter the effects of small training set size [20] This ....
E. Levin, N. Tishby, and S. A. Solla. A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78(10):1568--74, Oct 1990.
.... not in its scope and results, to the Bayesian information theoretic approach, recently applied also to continuous networks[18, 17] A SM approach to learning from examples was first proposed by Carnevali and Patarnello[19] and by Denker et al. 20] and further elaborated by Tishby et al. 21] [22]] Studies of learning a classification task in a perceptron can be found in Hansel and Sompolinsky[23] and del Giudice et al. 24] using spin glass techniques. Gardner and Derrida[25] and Gyorgyi and Tishby[26, 27] have used these methods for studying learning of a perceptron rule. Related models ....
E. Levin, N. Tishby, and S.A. Solla. A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE---Special Issue on Neural Networks, 1990.
....The theoretical revisions of the VC theory mentioned above cannot explain such behavior, because they conservatively modify only with the constant factors of the same power laws. In this paper, we show that ideas from statistical mechanics (namely, the annealed approximation (Amari et al. 1992; Levin et al. 1989; Schwartz et al. 1990; Sompolinsky et al. 1991) and the thermodynamic limit (Sompolinsky et al. 1991) can be used as the basis of a mathematically precise and rigorous theory of learning curves 3 . This theory will be distribution specific, but will not attempt to force a power law form on ....
Levin, E., Tishby, N., & Solla, S. (1989). A statistical approach to learning and generalization in neural networks.
No context found.
E Levin, N Tishby, and S A Solla. A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE, 78:1568--1574, 1990.
No context found.
E. Levin, N. Tishby, and S. A. Solla. A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78(10):1568--74, Oct 1990.
No context found.
Levin Esther, Naftali Tishby, and Sara A. Solla. A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE, pages 1568 -- 1572, October 1990.
No context found.
E. Levin, N. Tishby, and S. A. Solla. A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE, 78(10):1568-- 1574, October 1990.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC