| N. Tishby and E. Levin. Consistent inference of probabilities in layered networks: Predictions and generalization. In International Joint Conference on Neural Networks, volume 2, pages 403--409, 1989. |
....this probability to grow as f(x; w) learns from B. Because this localization theory is targeted at applications such as pattern recognition where the number of pairs in A is finite, the concept of generalization has been extended for approximators dealing with an infinite number of elements in A [45]. 2.5 Controlling a Network s Degrees of Freedom by Controlling its Size In general, the number of adjustable weights corresponds to a network s degrees of freedom. The more weights in a network, the larger the class of functions it can implement. Degreesof freedom can depict the ....
N. Tishby and E. Levin. Consistent inference of probabilities in layered networks: Predictions and generalization. In International Joint Conference on Neural Networks, volume 2, pages 403--409, 1989.
....although simple parametric methods (e.g. Gaussian approximations in the large ) are likely to fail. 4 However, such a probabilistic recognition framework can be cast in terms of powerful connectionist learning and classification algorithms in a rigorous and structured manner [12] 54] [80], leading to improved classifier performance. The classifier used for the experiments reported here has been implemented with a connectionist architecture, trained and selected using the Bayesian methods for classification networks proposed by MacKay [55] our experimental results give indications ....
N. Tishby, E. Levin, and S. Solla. Consistent inference of probabilities in layered networks. In Proc. Internat. Joint Conf. on Neural Networks, Washington, 1989.
....trade off between error fitting and complexity reduction. This fitness function has an elegant probabilistic interpretation for the learning process: according to the Bayesian framework, minimizing F is identical to finding the most probable network with architecture A and weights W (Sorkin 1983; Tishby et al. 1989). To see this, let us define the following. Let D be the data set for the function fl : X Y , i.e. D = f(x i ; y i ) j x i 2 X; y i 2 Y; y i = fl(x i ) i = 1: Ng: Then a model M of the function fl is an assignment to each possible pair (x; y) of a number P (yjx) representing the ....
N. Tishby, E. Levin, and S. A. Solla (1989). Consistent inference of probabilities in layered networks: predictions and generalization. Proc. Int. Joint Conf. Neural Networks, Vol. II, 403--409. IEEE.
....of weight space that is consistent with the training set, shrinks with each informative sample. In our case of vanishing nal training error and if we assume the prior distribution on the weight space to be at, then the posterior distribution is uniform on the version space and vanishes outside [48]. This means that the distribution Pm (w) on the weight space can be related to the volume Vm of the version space Wm , Pm (w) 1=Vm if w 2 Wm , 0 otherwise. 4) In terms of the volume of version space, the information gain from the answer to the query xm 1 is then, cf. Eq. 2) I m 1 = ....
N. Tishby, E. Levin, and S. Solla. Consistent inference of probabilities in layered networks: Predictions and generalization. In Proceedings of the International Joint Conference on Neural Networks, volume 2, pages 403-409, New York, June 1989. IEEE Press.
....we should also mention approaches to the problem that have originated from the field of statistical mechanics. When viewed as collections of simple interacting units, neural networks closely resemble large scale atomic physical systems, and their dynamics can, indeed, be studied in a similar way (Tishby et al. 1989). Drawing on (Gardner and Derrida, 1989) who used mean field analysis to study optimal storage capacity of Hopfield networks, a calculation of the generalization ability as a function of the size of the randomly chosen training set, this independently of the learning algorithm used, was possible ....
....network NETtalk (Sejnowski and Rosenberg, 1987) for the same task. 2.4. 4 Empirical studies Apart from experimental studies confirming theoretical results similar to those obtained by Schwartz et al. and reported in (Denker et al. 1987) Schwartz et al. 1990) Samalam and Schwartz, 1989) (Tishby et al. 1989), little empirical research has been done on generalization. 25 Ahmad (1988) as mentioned earlier, has studied how the perceptron learns the majority function and has derived a number of results. He has shown that in his experiments, the fraction of misclassified instances decreases ....
N. Tishby, E. Levin, and S. Solla. Consistent inference of probabilities in layered networks: prediction and generalization. In Proceedings of the International Joint Conference on Neural Networks, Washington, D.C, pages 403--409. New York: IEEE, 1989.
.... of the non parametric regression models that connectionist architectures are able to learn seems particularly attractive, as is the rigorous yet simple way in which the above outlined probabilistic recognition framework can be cast in terms of connectionist training and recognition algorithms [7, 14, 24]. Connectionist architectures allow one to naturally represent in their inputs the uncertainty in shape reconstruction, by regarding shape parameters as sampled from a prior probability distribution depending on the estimated uncertainty, while the sufficiency of the shape recovery data from the ....
N. Tishby, E. Levin, and S. Solla. Consistent inference of probabilities in layered networks. In Proc. International Joint Conference on Neural Netowrks, Washington, 1989.
....h agreeing with q for which T(h) 0. This last requirement ensures that the generalizer is defined for all q. The Gibbs generalizer can be viewed as a zero temperature limit of the scenarios analyzed in the statistical mechanics supervised learning framework (see [Seung et al. 1991a, 1991b, Tishby et al. 1989]) III) The generalization error function is a mapping from (f, h, q X , q Y ) to R. It measures how good h is as a guess for f. One rather popular choice is the i.i.d. error function: Er(f, h, q) S xX p(x) 1 d(f(x) h(x) where p( is the same distribution used to define the ....
Tishby N., et al. (1989). Consistent inference of probabilities in layered networks: predictions and generalization. In International Joint Conference on Neural Networks, Vol. II, 403-409, IEEE, New York.
.... looked at the performance of Bayes method for this task, as measured by the total number of mistakes for the classi cation problem, and by the total log loss (or information gain) for the regression 1 This rule is the zero temperature limit of the more general algorithm studied for example at [110, 83, 105] 48 problem. Their results were given by comparing the performance of Bayes method to the performance of a hypothetical omniscient scientist who is able to use extra information about the labeling process that would not be available in the standard learning protocol. For example, if each label ....
N. Tishby, E. Levin, and S. Solla. Consistent inference of probabilities in layered networks: Predictions and generalizations. In Proceedings of the International Joint Conference on Neural Networks, volume 2, pages 403-409. IEEE, New York, 1989.
.... has suggested a general argument regarding the uniqueness of the above mapping given certain restrictions on the parametric form of V (X; W) The probability distribution constructions in (13) and (14) can also be motivated using a Markov random field framework (Besag, 1974; Marroquin, 1985; Tishby, Levin, Solla, 1989). To illustrate how p can be constructed, substitution of (11) into (14) with X = R; S) and the assumption that the sample space is a real vector space with dimensionality equal to the dimension of R yields the following belief function for a multi layer neural network: p(RjS; W) ....
Tishby, N., Levin, E., & Solla, S. (1989). Consistent inference of probabilities in layered networks: predictions and generalization. In Proceedings of the International Joint Conference on Neural Networks (Vol. 2, pp. 403--409). IEEE Press.
....mandatory, although simple parametric methods (e.g. Gaussian approximations in the large ) are likely to fail. 3 However, such a probabilistic recognition framework can be cast in terms of more powerful connectionist learning and classification algorithms in a rigorous and structured manner [3, 7, 14], leading to improved classifier performance. The classifier used for the experiments reported in this paper has been implemented with a connectionist architecture, trained and selected using the Bayesian methods for classification networks proposed by MacKay [8] our experimental results give ....
N. Tishby, E. Levin, and S. Solla. Consistent inference of probabilities in layered networks. In Proc. Internat. Joint Conf. on Neural Networks, Washington, 1989.
....work needs to be done on sensitivity analysis, and on simplifying the calculations so that larger problems can be analysed. Some success in tackling the difficult computations involved in certain Bayesian approaches to learning theory has been obtained by using the tools from statistical physics [32, 121, 47, 117, 97]. This work, and the other distribution specific learning work, provides an increasingly important counterpart to PAC theory 1 . Another variant of the PAC model designed to address these issues is the probability of mistake model explored in [57] 56] and [97] This model is designed ....
....a notion of average case big L risk to be minimized. The former goal is know as minimax optimality, and has been used in the PAC model. The later is the Bayesian notion of optimality [20, 67] and has been used in several approaches to learning in neural nets based on statistical mechanics [32, 121, 47, 117, 97, 98]. Unfortunately this last question has no clear cut answer, and leads us directly into a longstanding unresolved debate in statistics (see e.g. 74] and following discussion. Since we have set out to generalize the PAC model, and since our results are best illustrated in the minimax setting, we ....
N. Tishby, E. Levin, and S. Solla. Consistent inference of probabilities in layered networks: predictions and generalizations. In IJCNN International Joint Conference on Neural Networks, volume II, pages 403--409. IEEE, 1989.
....of networks, each with the same architecture but different parameter values. If you wish, imagine one network (one element of the ensemble) located at (and named for) each point W in parameter space. To each of these networks (after looking at m training points) we assign a number ae m (W ) (Tishby, Levin and Solla, 1989). We use this number as follows: when it comes time to test the performance of our learning system, we choose a network from the ensemble according to the probability density 1 Pm j Pm (W ) ae m (W ) R ae m (W 0 ) dW 0 (2) We average over this probability to get the risk (j expected ....
....Notation: we write A j B to indicate that A and B are synonymous; we write C : D to indicate the C is hereby defined to be equal to D. and testing) data to be independent and identically distributed (iid) Since the loss is additive and probabilities should be multiplicative, we are motivated (Tishby, Levin and Solla, 1989) to choose ae m (W ) ae 0 (W ) exp[ GammafiE m (W ) 4) where fi j 1=T is a measure of our confidence in the training data; T measures our tolerance of error. Note that in the limit T 0 the exponential is strongly peaked around the minimum loss (maximum likelihood) point(s) in weight space ....
[Article contains additional citation context not shown here]
Tishby, N., Levin, E., and Solla, S. A. (1989). Consistent Inference of Probabilities in Layered Networks: Predictions and Generalization. In Proceedings of the International Joint Conference on Neural Networks, Washington DC.
....based upon the foundation of mathematical programming. Mathematical models of generalization [Vapnik, 1995, Wolpert, 1995] have grown from several diverse fields: for instance, statistics [Geman et al. 1992, Wahba, 1990] computational complexity [Valiant, 1984] and statistical physics [Tishby et al. 1989]. We believe that the methods and tools of mathematical optimization are an effective way to explore the theory of generalization, which in turn will lead to more effective applications of supervised learning. To illustrate the idea of tolerant training, Figure 1 visually displays its effect on a ....
Tishby, N., Solla, S., and Levin, E. (1989). Consistent inference of probabilities in layered networks: Predictions and generalization. In IJCNN International Joint Conference on Neural Networks, volume II, pages 403--409, New York. IEEE.
....each possible pair (x; y) of a number P (yjx) representing the hypothetical probability of y given x. That is, a network with specified architecture A and weights W is viewed as a model M = fA; Wg predicting the outputs y as a function of input x in accordance with the probability distribution [35]: P (yjx; W;A) exp( GammafiE(yjx; W;A) Z(fi) 10) Evolving Optimal Neural Networks 12 where fi is a positive constant which determines the sensitivity of the probability to the error value and Z(fi) Z exp( GammafiE(yjx; W;A) dy (11) is a normalizing constant. Under the assumption of ....
N. Tishby, E. Levin, and S. A. Solla, "Consistent Inference of Probabilities in Layered Networks: Predictions and Generalization," in Proceedings of the International Joint Conference on Neural Networks (IJCNN89) , Vol. II, 403--409 (IEEE, 1989).
....for SE in Figure 1 suggests a growth that is consistent with the bounds with m 1= 2 . The diagram for CR rather suggests a functional dependence of the form m log 1= that has previously already been theoretically predicted and experimentally verified for some artificial datasets ([TLS89], SSSD90] CT92] HKST94] Altogether we believe that the possibility to compute with the help of our new algorithms T1 and T2 optimal hypothesis H 2 H for arbitrary (even larger) datasets opens a new chapter in the experimental investigation of learning curves for real world data. 5 ....
N. Tishby, E. Lavin, S. A. Solla, Consistent inference of probabilities in layered networks: Predictions and generalizations, Proc. of IJCNN 1989, Vol. II, 403 - 409.
.... (VC) dimension [34,4] In contrast, the average case sample complexity of learning in neural networks has recently been investigated from a standpoint that is essentially Bayesian 1 , and is strongly influenced by ideas and tools from statistical physics, as well as by information theory [10,31,15,29,24]. While each of these theories has its own distinct strengths and drawbacks, there is little understanding of what relationships hold between them. In this paper, we study an average case or Bayesian model of learning with two primary goals. First, we are interested in ultimately developing a ....
....algorithms in terms of an important random variable known as the volume ratio. In Section 5 we prove that the probabilities of mistake for our two learning algorithms can be bounded above and below by simple functions of the expected information gain. As in the paper of Tishby, Levin and Solla [31], we upper bound the probability of mistake by the information gain. We also provide an information theoretic lower bound on the probability of mistake, which can be viewed as a special case of Fano s inequality [9,14] Together these bounds provide a general characterization of learning curve ....
[Article contains additional citation context not shown here]
N. Tishby, E. Levin, and S. Solla. Consistent inference of probabilities in layered networks: predictions and generalizations. In IJCNN International Joint Conference on Neural Networks, volume II, pages 403--409. IEEE, 1989.
....no two pairs in q with the same input values x i but different output values y i . See [11] for more details. 48 2. One example of such a sleight of hand is confusing the prior distribution of input output functions in the real world with the prior distribution of feedforward neural nets (as in [17] and any attempts to apply studies like [18] to the real world) Another slight of hand is limiting the space of allowed target functions in some reasonable way . Yet another is allowing questions to run over the training set. After all, the only non trivial issue in the noise free case, the ....
Tishby, N., Levin, E., and Solla, S. (1989). Consistent inference of probabilities in layered networks: predictions and generalization. Proceedings of the international Joint Conference on Neural Networks, Washington, D.C., pp. II 403-409, IEEE.
....a situation there is an a priori measure 0 on Theta, that is reasonable to choose non atomic and having support on the whole space Theta. Supervised learning will be interpreted as a modification of this measure 0 in such a way that it will become concentrated on smaller and smaller sets [101]. To be more specific, we stick to the deterministic multilayered perceptron with binary neurones and constant transfer function f = sgn. For a given realisation , we denote, as usual, by F the mapping implemented by the network. We identify in the sequel any mapping g : X 0 XL with its ....
....reasons, this (thermodynamic) formalism allows a thorough understanding of the learning procedure in terms of information theory. To be more specific and to avoid unnecessary complications, assume that all the neurones are binary, all layers are finite, and, moreover, the parameter set is discrete [95, 101]. Now the finiteness of the sensor and motor layers together with the binary nature of the neurones implies that the set of all possible mappings M = ff : X 0 XL g is discrete and finite. The a priori measure 0 on Theta induces a measure on M, denoted by the same symbol, by 0 (f) Z 0 ....
N Tishby, E Levin, S Solla, Consistent inference of probabilities in layered networks: predictions and generalisation, IEEE Neural Net., 2, 403--410 (1989).
....power of the network. This principle increases the probability of correct generalization because Figure 1: Examples of original zipcodes from the testing set. it results in a specialized network architecture that has a reduced entropy (Denker et al. 1987; Patarnello and Carnevali, 1987; Tishby, Levin and Solla, 1989; Le Cun, 1989) On the other hand, some effort must be devoted to designing appropriate constraints into the architecture. 2 ZIPCODE RECOGNITION The handwritten digit recognition application was chosen because it is a relatively simple machine vision task: the input consists of black or white ....
Tishby, N., Levin, E., and Solla, S. A. (1989). Consistent Inference of Probabilities in Layered Networks: Predictions and Generalization. In Proceedings of the International Joint Conference on Neural Networks, Washington DC.
....al. 1991] This scenario is perhaps the simplest possible supervised learning scenario. It is identical to the noisefree Gibbs learning scenario studied recently in [Haussler et al. 1991] and can also be viewed as the zero temperature limit of the statistical mechanics work of Tishby et al. [Tishby et al. 1989]. It consists of examining the noise free performance of generalizers of the following type: exclude all hypothesis functions not consistent with the training set, and guess randomly amongst the rest. The central result of the conventional analysis of exhaustive learning [Schwartz et al. 1990] is ....
....The formalism used in this paper, which was introduced in [Wolpert 1992] is an extension of conventional Bayesian analysis. This formalism doesn t restrict itself to a certain kind of generalizer (as do the various versions of the statistical mechanics machine learning formalism, for example [Tishby et al. 1989, Seung et al. 1991] nor does it restrict itself to finding worst case 3 bounds, where one assumes one knows very little about the generalizer (as does PAC, for example [Blumer et al. 1987, Blumer et al. 1989, Valiant 1984] It does not need to assume that one only knows the size of the ....
Tishby et alia (1989). Consistent inference of probabilities in layered networks: Predictions and generalization. In IJCNN International Joint Conference on Neural Networks, Vol. II, 403-409, IEEE, New York.
....density function. Like the approaches presented in [FI93, MKS94] it partitions a real valued high dimensional input space into hypercubes. The output nodes, however, represent conditional densities, which are estimated using a frequentist approach [CB90] This is related to results reported in [TLS89, Mac92, Mit97] which show that under appropriate assumptions, artificial neural networks approximate conditional probability density functions. The mathematical approach for integrating information is adopted from the statistical literature [CB90, Pea88] The approach presented in this paper ....
N. Thishby, E. Levin, and S. A. Solla. Consistent inference of probabilities in layered networks: predictions and generalizations. In Proceedings of the First International Joint Conference on Neural Networks, Washington, DC, San Diego, 1989. IEEE TAB Neural Network Committee.
....density function. Like the approaches presented in [FI93, MKS94] it partitions a real valued high dimensional input space into hypercubes. The output nodes, however, represent conditional densities, which are estimated using a frequentist approach [CB90] This is related to results reported in [TLS89, Mac92, Mit97] which show that under appropriate assumptions, artificial neural networks approximate conditional probability density functions. The mathematical approach for integrating information is adopted from the statistical literature [CB90, Pea88] The approach presented in this paper ....
N. Thishby, E. Levin, and S. A. Solla. Consistent inference of probabilities in layered networks: predictions and generalizations. In Proceedingsof the First International Joint Conferenceon Neural Networks, Washington, DC, San Diego, 1989. IEEE TAB Neural Network Committee.
No context found.
N. Tishby, E. Levin, and S. Solla. Consistent inference of probabilities in layered networks: Predictions and generalization. In IJCNN International Joint Conference on Neural Networks, volume 2, pages 403--409. IEEE, 1989.
.... though not in its scope and results, to the Bayesian information theoretic approach, recently applied also to continuous networks[18, 17] A SM approach to learning from examples was first proposed by Carnevali and Patarnello[19] and by Denker et al. 20] and further elaborated by Tishby et al.[21]] 22] Studies of learning a classification task in a perceptron can be found in Hansel and Sompolinsky[23] and del Giudice et al. 24] using spin glass techniques. Gardner and Derrida[25] and Gyorgyi and Tishby[26, 27] have used these methods for studying learning of a perceptron rule. Related ....
....expected rate of improvement of the generalization with an increasing number of examples, denoted by the generalization curve. The PAC theory bounds the generalization curve by an inverse power law. Such a gradual improvement has also been observed in com3 puter experiments of supervised learning[21, 31, 32]. In other cases, however, one observes a rather sharp improvement when a critical number of examples is reached[20, 21, 30] These seemingly conflicting behaviors have analogies in psychological studies of animal learning. The dichotomy between gradual and sudden learning lay at the heart of the ....
[Article contains additional citation context not shown here]
N. Tishby, E. Levin, and S. Solla. Consistent inference of probabilities in layered networks: Predictions and generalization. In IJCNN International Joint Conference on Neural Networks, volume 2, pages 403--409. IEEE, 1989.
No context found.
N. Tishby, E. Levin, and S.A. Solla. Consistent inference of probabilities in layered networks: predictions and generalization. In Proc. IJCNN, Washington, 1989.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC