63 citations found. Retrieving documents...
G. E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual ACM Conference on Computational Learning Theory, pages 5-- 13, Santa Cruz, California, July 1993. ACM Press, New York. 23

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

Bayesian Inference for Reliable Biomedical Signal Processing - Sykacek (2000)   (1 citation)  (Correct)

....provides an estimate for the model evidence in (3.3) Ensemble learning Another method for approximating posterior distributions that will be used in a subsequent chapter of this thesis is a technique called ensemble learning or variational approximation of the posterior. As was introduced by [HvC93] this technique approximates the posterior over by a parameterized ensemble Q( The optimal approximating ensemble of the posterior is found by minimizing the variational free energy which is well established technique in statistical physics (see e.g. Fey72] F ( d Q( log( ....

G. E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proc. 6th Annu. Workshop on Computational Learning Theory, pages 5-13, New York, NY, 1993. ACM Press.


Bayesian model selection for Support Vector machines, Gaussian.. - Seeger   (11 citations)  (Correct)

....(y jD; almost everywhere with respect to the distribution P . Thus, F is an upper bound on log P (Dj ) and changing ( P ; to decrease F enlarges the evidence or decreases the divergence between the posterior and its approximation, both being favourable. This idea has been introduced in [3] as ensemble learning and has been successfully applied to MLPs [1] The latter work also introduced the model class we use here, namely the class of Gaussians with mean and factor analyzed This is the random e ects model with improper prior of [13] p.19, and works by placing a at ....

.... random e ects model with improper prior of [13] p.19, and works by placing a at improper prior on the bias parameter. We average di erent discriminants (given by y) over the ensemble P . covariance = D P M j=1 c j c j ; D diagonal with positive elements . Hinton and van Camp [3] used diagonal covariances which would be M = 0 in our setting. By choosing a small M , we are able to track the most important correlations between the components in the posterior using O(Mn) parameters to represent P . Having agreed on , the criterion F and its gradients with respect to and ....

G. E. Hinton and D. Van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Conference on COLT 6, pages 5-13, 1993.


Building Blocks For Hierarchical Latent Variable Models - Valpola, Raiko, Karhunen (2001)   (Correct)

....combined. Three important issues arise within this design: 1) the need for a cost function which can be used for learning the model structure, 2) a learning method which avoids over fitting and 3) the requirement of roughly linear computational complexity for scalability. Ensemble learning [1] has proven to satisfy these requirements. Ensemble learning and related variational methods have been successfully applied to various extensions of linear Gaussian factor analysis. The extensions have included mixtures of Gaussian distributions for source signals [2] nonlinear units [3, 4] and ....

G. E. Hinton and D. van Camp, "Keeping neural networks simple by minimizing the description length of the weights," in Proc. COLT'93, pp. 5--13, 1993.


Dynamical Factor Analysis Of Rhythmic.. - Särelä, Valpola.. (2001)   (Correct)

....shortcoming regarding MEG applications, where the signal to noise ratio can be extremely poor. In this paper we introduce a generative dynamical algorithm for noisy measurements. This algorithm is dynamical factor analysis (DFA) and it exploits a Bayesian treatment called ensemble learning [5, 6]. Ensemble learning This work is partially funded by EU BLISS project. RV is funded by EU (Marie Curie Fellowship HPMF CT 2000 00813) provides a general framework to learn generative models from a given data set and it can be used for model selection, e.g. in the determination of the most ....

.... , 0 : of the states is specified similarly using the function instead of the linear mapping H . All the parameters of the model have hierarchical Gaussian priors. For example the noise parameters O of different components of the data share a common prior [13] Ensemble learning [5, 6] is a recently developed method for fitting a parametric approximation to the exact posterior density function , 0 12 . The true posterior is approximated by a density , with a simple factorial form. The misfit of the approximation is measured by KullbackLeibler divergence between ....

G. Hinton and D. van Camp, "Keeping neural networks simple by minimizing the description length of the weights," in Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory, Santa Cruz, CA, USA, 1993, pp. 5--13.


Graphical Models and Variational Methods - Ghahramani, Beal (2001)   (4 citations)  (Correct)

....[17, 22] The over tting problem is avoided simply because no parameter in the pure Bayesian approach is actually t to the data. Having more parameters imparts an advantage in terms of the ability to model the data, but this is o set by the cost of having to code that parameter under the prior [14]. Along with the prior over parameters, a Bayesian approach to learning starts with some prior knowledge or assumptions about the model structure the set of arcs in the Bayesian network. This initial knowl 7 edge is represented in the form of a prior probability distribution over model ....

G.E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Sixth ACM Conference on Computational Learning Theory, Santa Cruz, 1993.


Propagation Algorithms for Variational Bayesian Learning - Ghahramani, Beal (2001)   (85 citations)  (Correct)

.... free distributions, Q x (x) and Q ( From (1) we can see that this maximisation is equivalent to minimising the KL divergence between Q x (x)Q ( and the joint posterior over hidden states and parameters P (x; jy; M) This approach was rst proposed for one hidden layer neural networks [6] under the restriction that Q ( is Gaussian. It has since been extended to models with hidden variables and the restrictions on Q ( and Q x (x) have been removed in certain models to allow arbitrary distributions [11, 8, 3, 1, 5] Free form optimisation with respect to the distributions Q ....

G.E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Sixth ACM Conference on Computational Learning Theory, Santa Cruz, 1993.


Classification and Regression using Mixtures of Experts - Waterhouse (1997)   (7 citations)  (Correct)

....also to control the model complexity. The approach used is Bayesian in flavour and owes much to the evidence framework of MacKay [133] and the variational free energy view of the EM algorithm of Neal and Hinton [154] I use ensemble learning, a technique originally proposed by Hinton and van Camp [87] and subsequently extended by MacKay [136] to motivate an alternating minimisation procedure that combines the standard EM training algorithm with re estimation of hyper parameters of priors on gate and expert parameters. The outline of this chapter is as follows. I separate the chapter into two ....

....to find integrals numerically [153] Bishop [14] gives a good review and tutorial on this area. In this chapter I use an approach to Bayesian inference which is motivated by the evidence framework of MacKay [133] Other motivations include the ensemble learning method of Hinton and van Camp [87] and MacKay [136] the EM viewpoint of Neal and Hinton [154] and the mean field theory method of Saul and Jordan [209] Whilst it is acknowledged that many Bayesian methods could be used for the mixtures of experts, I attempt in this chapter to generalise the maximum likelihood method described in ....

[Article contains additional citation context not shown here]

Hinton, G. E. and van Camp, D. [1993], Keeping neural networks simple by minimizing the description length of the weights, in `Proceedings of the 6th Annual conference on Computational Learning Theory', ACM Press, New York, NY, pp. 5--13.


Improving Cox survival analysis with a neural-Bayesian.. - Bakker, Heskes, Neijt.. (2001)   (Correct)

....number of network parameters. This not only takes a lot of computation time, it also introduces approximation errors. Furthermore, it is dicult to determine when enough samples have been drawn. As an alternative we propose a form of ensemble learning . This term was coined by Hinton and van Camp [11] and has been applied to multi layered perceptrons and Radial Basis Function networks in [12, 13] We approximate the posterior by minimising the Kullback Leibler (KL) divergence between the exact posterior given by Bayes formula and an approximating analytical distribution, varying only the ....

G. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the 6th Annual Workshop on Computational Learning Theory, pages 5-13, New York, 1993.


Estimating the number of layers in a distribution.. - Thomas El-Maraghi..   (Correct)

....methods of model selection have been proposed in the computer vision literature (see [3] for a survey) Here, we will investigate two techniques, which we shall show to be closely related. They are the Bayesian evidence framework [2] 11] and the minimum description length principle (MDL) 2][6][15] The primary criteria for evaluating these techniques will be how well they estimate the number of layers in a distribution. Before continuing, it should be noted that it is not sufficient to simply select the model that yields the highest likelihood, because it is always possible to increase ....

G. E. Hinton and D. van Camp, Keeping neural networks simple by minimizing the description length of the weights, Proceedings of COLT-93, 1993.


Approximate Algorithms for Neural-Bayesian Approaches. - Heskes, Bakker, Kappen   (Correct)

....samples and except for singular cases, converge to the exact probability distribution. Furthermore, they are easy to implement for many probability distributions. 3.2 Variational approach The variational approach o ers an alternative. It has been introduced under the term ensemble learning in [6] and has been applied to learning in multilayered perceptrons and radial basis function networks in [7,8] Recall that we are interested in the joint posterior P (W; jD) of model parameters and hyperparameters. Knowing that we cannot describe it in an analytical form, the best we can do is to ....

G. Hinton, D. van Camp, Keeping neural networks simple by minimizing the description length of the weights, in: Proceedings of the 6th Annual Workshop on Computational Learning Theory, ACM Press, New York, 1993, pp. 5-13.


Building Blocks For Hierarchical Latent Variable Models - Valpola, Raiko, Karhunen (2001)   (Correct)

....combined. Three important issues arise within this design: 1) the need for a cost function which can be used for learning the model structure, 2) a learning method which avoids over fitting and 3) the requirement of roughly linear computational complexity for scalability. Ensemble learning [1] has proven to satisfy these requirements. Ensemble learning and related variational methods have been successfully applied to various extensions of linear Gaussian factor analysis. The extensions have included mixtures of Gaussian distributions for source signals [2] nonlinear units [3, 4] and ....

G. E. Hinton and D. van Camp, "Keeping neural networks simple by minimizing the description length of the weights," in Proc. COLT'93, pp. 5--13, 1993.


Missing Values In Nonlinear Factor Analysis - Raiko, Valpola (2001)   (Correct)

....to their posterior probabilities. This approach, known as Bayesian learning, optimally solves the tradeo between under and over tting. In practice, exact treatment of the posterior pdfs of the models is impossible. Therefore, some suitable approximation method must be used. Ensemble learning [4, 1, 7, 9], which is one type of variational learning, is a method for parametric approximation of posterior pdfs. The basic idea in ensemble learning is to minimise the mis t between the posterior pdf and its parametric approximation. Let P ( jx) denote the exact posterior pdf and Q( its parametric ....

G.E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the COLT'93, pages 513, Santa Cruz, California, USA, July 26 28, 1993.


Independent Variable Group Analysis - Lagus, Alhoniemi, Valpola (2001)   (1 citation)  (Correct)

....i = 1; C (described by their means and a common variance) and indices of the winners w(t) for each data vector x(t) t = 1; N . For nding ln p(xjH) we use variational EM algorithm with = f ; wg as missing observations. In the E phase, an upper bound of the cost is minimized [2, 5]. The rest of the parameters are included in H , i.e. we use ML estimates for the following: c, the hyper parameter governing the prior probability for a codebook vector to be a winner; 2 x , the diagonal elements of the common covariance matrix; and and 2 , the hyper parameters ....

G. E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the COLT'93, pp. 5-13, Santa Cruz, California, USA, July 26-28, 1993.


Variational Learning for Multi-Layer Networks of Linear.. - Lawrence (2001)   (Correct)

....that any attempt to nd the optimal parameters in such a network gains no useful information from the gradient: the gradient of the error surface as computed by the generalised delta rule is zero at almost all points. One approach to learning in these networks has been to use noisy weights (Hinton and van Camp, 1993). This gives a probability of a threshold unit being active which is a smooth function of the weights. However, the approach leads to some restrictions on the structure of the network, in particular the approach, without further approximation, only applies to regression networks containing only ....

Hinton, G. E. and D. van Camp (1993). Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Anuual Conference on Computational Learning Theory, pp. 5-13.


Propagation Algorithms for Variational Bayesian Learning - Ghahramani, Beal (2000)   (85 citations)  (Correct)

.... free distributions, Q x (x) and Q ( From (1) we can see that this maximisation is equivalent to minimising the KL divergence between Q x (x)Q ( and the joint posterior over hidden states and parameters P (x; jy; M) This approach was rst proposed for one hidden layer neural networks [6] under the restriction that Q ( is Gaussian. It has since been extended to models with hidden variables and the restrictions on Q ( and Q x (x) have been removed in certain models to allow arbitrary distributions [11, 8, 3, 1, 5] Free form optimisation with respect to the distributions Q ....

G.E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Sixth ACM Conference on Computational Learning Theory, Santa Cruz, 1993.


Varieties of Helmholtz Machine - Dayan, Hinton (1996)   (9 citations)  (Correct)

.... 2 i ) p 2oe 2 i p i 1 oe 2 i (y i Gamma P j w ki z k )z j sigmoid For the deterministic machine (Dayan et al., 1995) when the z j are also Bernoulli, we calculated log p(y i ) w ji for the sigmoid under the assumption that P j w ji z j is approximately Gaussian and used a table (Hinton van Camp, 1993) for the effect of composing a normal distribution and the sigmoid. softmax This is for the case in which i is ann Gammavalued unit, with input weights for the kth value of w k ji , and the derivative of the log probability is with respect to w k ji . noisy or For the noisy or (Pearl, 1988; ....

Hinton, GE & van Camp, D (1993). Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the the Sixth Annual Conference on Computational Learning Theory, 5-13. Santa Cruz, CA.


Fuzzy System Identification By Generating and.. - Krause, Krone, Slawinski   (Correct)

....Valid Experiments http: www.cs. utoronto.ca #delve index.html 5 combination depth is the number of linguistic expressions in the premise mlp mdl vh Minimum description length (mdl) based training of a multilayer perceptron (mlp) feedforward neural network) with a single layer of hidden units [6]. The table 3 shows the average absolute error on the validation data of the di#erent methods. Table 3: Comparison of di#erent methods mlp mdl vh gp map 1 mars3.6 bag 1 Fuzzy ROSA knn cv 1 0.2979 0.2996 0.3281 0.329 0.3327 5 Conclusion In this paper we presented how a fuzzy approach can be ....

E. Hinton, G. and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Sixth Annual Conference on Computational Learning Theory, pages 5--13, 1993.


Variational Learning for Multi-Layer Networks of Linear.. - Lawrence (2001)   (Correct)

....that any attempt to nd the optimal parameters in such a network gains no useful information from the gradient: the gradient of the error surface as computed by the generalised delta rule is zero at almost all points. One approach to learning in these networks has been to use noisy weights (Hinton and van Camp, 1993). This gives a probability of a threshold unit being active which is a smooth function of the weights. However, the approach leads to some restrictions on the structure of the network, in particular the approach, without further approximation, only applies to regression networks containing only ....

Hinton, G. E. and D. van Camp (1993). Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Anuual Conference on Computational Learning Theory, pp. 5-13.


Bayesian Learning and the Fokker-Planck Machine - Verrelst, Suykens.. (1998)   (Correct)

....can for example be found by a local optimisation algorithm where the covariance matrix of the Gaussian is derived from the curvature of the posterior at the mode. This method was studied in depth by MacKay [15] and Bishop [4] The more general approach was introduced by Hinton and Van Camp in [11]. They discussed the idea of approximating p(wjD) by an ensemble Q(w; a probability density parameterised by ) and by optimising the quality of this approximation. A classically used measure for the distance between two densities Q(w; and p(wjD) is the Kullback Leibler divergence (or ....

Hinton G.E., van Camp D., "Keeping neural networks simple by minimizing the description length of the weights," Proc. of COLT-93, pp. 5-13, 1993.


Bayesian Parameter Estimation Via Variational Methods - Jaakkola, Jordan (1999)   (13 citations)  (Correct)

....19 squares or IRLS ) The advantage of the variational approach is that it guarantees monotone improvement in likelihood. We present the derivation of this algorithm in Appendix C. Finally, for an alternative perspective on the application of variational methods to Bayesian inference, see Hinton and van Camp (1993) and MacKay (1997) These authors have developed a variational method known as ensemble learning, which can be viewed as a mean eld approximation to the marginal likelihood. ....

G. Hinton and D. van Camp (1993). Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the 6th Annual Workshop on Computational Learning Theory. New York: ACM Press.


Nonlinear Dynamical Factor Analysis - Giannakopoulos, Valpola   (Correct)

....algorithm is more efficient than the fully nonlinear version. 2. Ensemble learning In ensemble learning, a simple, computationally tractable factorial approximation is fitted to the true posterior probability by minimising their Kullback Leiber information. The idea was first published in [2]. Introductory treatments of ensemble learning can be found in [3 5] Before going into further detail, we shall briefly outline why we have not used some of the more traditional approaches. MacKay and Gibbs have tried sampling for Bayesian learning of a nonlinear factor analysis model [6] ....

....(MLP) network with tanh nonlinearities on one hidden layer. In other words, the observations are assumed to be generated as x(t) f (s(t) n(t) B tanh(As(t) a) b n(t) 7) We use the notation where scalar functions operate on each element of the vector individually, that is, tanh[1 2] T = tanh 1 tanh 2] T . It has been shown that MLP networks have the universal approximation property [10] if there are enough hidden neurons 2 . This means that any nonlinearity can be modelled by the MLP network. In practice, of course, there are nonlinearities which are very difficult ....

G. E. Hinton and D. van Camp, "Keeping neural networks simple by minimizing the description length of the weights," in Proceedings of the COLT'93, (Santa Cruz, California), pp. 5--13, 1993.


Propagation Algorithms for Variational Bayesian Learning - Ghahramani, Beal (2000)   (85 citations)  (Correct)

.... two free distributions, Q x (x) and Q ( From (1) we can see that this maximisation is equivalent to minimising the KL divergence between Qx (x)Q ( and the joint posterior over hidden states and parameters P (x; jy; M) This approach was rst proposed for one hidden layer neural networks [5] under the restriction that Q ( is Gaussian. It has since been extended to models with hidden variables and the restrictions on Q ( and Q x (x) have been removed in certain models to allow arbitrary distributions [10, 7, 2, 1, 4] Free form optimisation with respect to the distributions Q ....

G.E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Sixth ACM Conference on Computational Learning Theory, Santa Cruz, 1993.


Graphical Models and Variational Methods - Ghahramani, Beal (2000)   (4 citations)  (Correct)

....[17, 22] The over tting problem is avoided simply because no parameter in the pure Bayesian approach is actually t to the data. Having more parameters imparts an advantage in terms of the ability to model the data, but this is o set by the cost of having to code that parameter under the prior [14]. Along with the prior over parameters, a Bayesian approach to learning starts with some prior knowledge or assumptions about the model structure the set of arcs in the Bayesian network. This initial knowledge 1 One of the de ning properties of Bayesian networks is that the joint probability of ....

G.E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Sixth ACM Conference on Computational Learning Theory, Santa Cruz, 1993.


The Fokker-Planck Machine as Deterministic Ensemble.. - Verrelst, Suykens.. (1998)   (Correct)

....by a local optimization algorithm where the covariance matrix of the Gaussian is derived from the curvature of the posterior at the mode. This method was studied in depth by MacKay [6] and Bishop [4] The more general approach, named ensemble learning, was introduced by Hinton and Van Camp in [5]. They discussed the idea of approximating p(wjD) by an ensemble Q(w; a probability density parameterized by ) and by optimizing the quality of this approximation trough minimization of the Kullback Leibler divergence between the two densities K(p(wjD) Q(w; R R d Q(w; log ....

....Kullback Leibler divergence between the two densities K(p(wjD) Q(w; R R d Q(w; log Q(w; p(wjD) dw = R R d Q(w; log Q(w; dw Gamma R R d Q(w; log p(wjD)dw = S( R R d Q(w; U(w)dw; 1) with S( the entropy of the approximating ensemble. Hinton and van Camp [5] considered a single Gaussian approximation with diagonal covariance matrix (i.e. a separable Gaussian) Barber and Bishop [3] discuss the extension to single Gaussians with full covariance matrices. The major reason for using only a single Gaussians, is the problem of calculating the entropy S( ....

Hinton G.E., van Camp D., "Keeping neural networks simple by minimizing the description length of the weights," Proc. of COLT-93, pp. 5-13, 1993.


Survival Analysis: A Neural-Bayesian Approach - Bakker, Kappen, Heskes   (Correct)

....Introducing sensible priors on the parameters, we adopt a Bayesian approach. The resulting posterior can be approximated either by drawing samples, using for example Hybrid Markov Chain Monte Carlo (HMCMC) sampling, or a form of ensemble learning . This last term was coined by Hinton and van Camp [2] and has been applied to MLP s in [3] In practice, medical experts do not work with probability distributions over model parameters directly: they rather use the concept of p values. The Bayesian This research was supported by the Technology Foundation STW, applied science division of NWO and ....

G. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the 6th Annual Workshop on Computational Learning Theory, pages 5--13, New York, 1993. 6


Ensemble Learning - Lappalainen, Miskin (2000)   (24 citations)  (Correct)

....could contain many peaks, but when there is lots of data, most of the probability mass is typically contained in a few peaks of the posterior distribution. Model selection means using only the most massive peaks and discarding the remaining models. 4 Ensemble learning Ensemble learning, [1], is a recently introduced method for parametric approximation of the posterior distributions where Kullback Leibler information, 2] 5] is used to measure the mis t between the actual posterior distribution and its approximation. Let us denote the observed variables by x and the unknown ....

....in the maximum likelihood solution. If y is encoded in too high accuracy, however, the rst part of the code will be very long. If y is encoded in too low accuracy, the rst part will be short but deviations from the optimal y will increase the second part of the code. The bits back argument, [1], overcomes the problem by using in nitesimally small dy but picking the y from a distribution q(y) and encoding a secondary message in the choice of y. Since the distribution q(y) is not needed for decoding x, both the sender and the receiver can run the same algorithm for determining q(y) from ....

Geo rey E. Hinton and Drew van Camp: `Keeping neural networks simple by minimizing the description length of the weights'. In: Proceedings of the COLT'93, (Santa Cruz, California, 1993) pp 5-13 Harri Lappalainen, James W. Miskin


Fuzzy System Identification By Generating and.. - Krause, Krone, Slawinski   (Correct)

....Valid Experiments http: www.cs. utoronto.ca delve index.html 5 combination depth is the number of linguistic expressions in the premise mlp mdl vh Minimum description length (mdl) based training of a multilayer perceptron (mlp) feedforward neural network) with a single layer of hidden units [6]. The table 3 shows the average absolute error on the validation data of the di erent methods. Table 3: Comparison of di erent methods mlp mdl vh gp map 1 mars3.6 bag 1 Fuzzy ROSA knn cv 1 0.2979 0.2996 0.3281 0.329 0.3327 5 Conclusion In this paper we presented how a fuzzy approach can be ....

E. Hinton, G. and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Sixth Annual Conference on Computational Learning Theory, pages 5-13, 1993.


A Variational Bayesian Framework for Graphical Models - Attias (2000)   (116 citations)  (Correct)

....the computation of the Hessian, which may become quite intensive. In this paper I present Variational Bayes (VB) a practical framework for Bayesian computations in graphical models. VB draws together variational ideas from intractable latent variables models [8] and from Bayesian inference [4,5,9], which, in turn, draw on the work of [6] This framework facilitates analytical calculations of posterior distributions over the hidden variables, parameters and structures. The posteriors fall out of a free form optimization procedure which naturally incorporates conjugate priors, and emerge in ....

Hinton, G.E. & Van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. Proc. 6th COLT, 5-13.


Bayesian model selection for Support Vector machines, Gaussian.. - Seeger   (11 citations)  (Correct)

....almost everywhere with respect to the distribution P . Thus, F is an upper bound on Gamma log P (Dj ) and changing ( P ; to decrease F enlarges the evidence or decreases the divergence between the posterior and its approximation, both being favourable. This idea has been introduced in [3] as ensemble learning 4 and has been successfully applied to MLPs [1] The latter work also introduced the model class Gamma we use here, namely the class of Gaussians with mean and factor analyzed covariance Sigma = D P M j=1 c j c 0 j ; D diagonal with positive elements 5 . Hinton ....

....works by placing a flat improper prior on the bias parameter. 4 We average different discriminants (given by y) over the ensemble P . 5 Although there is no danger of overfitting, the use of full covariances would render the optimization more difficult, time and memory consuming. van Camp [3] used diagonal covariances which would be M = 0 in our setting. By choosing a small M , we are able to track the most important correlations between the components in the posterior using O(Mn) parameters to represent P . Having agreed on Gamma, the criterion F and its gradients with respect ....

Geoffrey E. Hinton and D. Van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the 6th annual conference on computational learning theory, pages 5--13, 1993.


Nonlinear Source Separation Using Ensemble Learning .. - Lappalainen.. (2000)   (Correct)

....probability mass, though they often show a high but very narrow peak in their posterior pdfs corresponding to the overtted parameters. In practice, exact treatment of the posterior pdfs of the models is impossible. Therefore, some suitable ap proximation method must be used. Ensemble learning [7, 2, 13], also known as variational learning, is a recently developed method for parametric approximation of posterior pdfs where the search takes into account the probability mass of the models. Therefore, it does not suoeer from overtting. The basic idea in ensemble learning is to minimize the mist ....

G. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory, pages 513, Santa Cruz, Calif., 1993.


Ensemble Learning - Lappalainen, Miskin (2000)   (24 citations)  (Correct)

....could contain many peaks, but when there is lots of data, most of the probability mass is typically contained in a few peaks of the posterior distribution. Model selection means using only the most massive peaks and discarding the remaining models. 4 Ensemble learning Ensemble learning, [1], is a recently introduced method for parametric approximation of the posterior distributions where Kullback Leibler information, 2] 5] is used to measure the misfit between the actual posterior distribution and its approximation. Let us denote the observed variables by x and the unknown ....

....in the maximum likelihood solution. If y is encoded in too high accuracy, however, the first part of the code will be very long. If y is encoded in too low accuracy, the first part will be short but deviations from the optimal y will increase the second part of the code. The bits back argument, [1], overcomes the problem by using infinitesimally small dy but picking the y from a distribution q(y) and encoding a secondary message in the choice of y. Since the distribution q(y) is not needed for decoding x, both the sender and the receiver can run the same algorithm for determining q(y) from ....

Geoffrey E. Hinton and Drew van Camp: `Keeping neural networks simple by minimizing the description length of the weights'. In: Proceedings of the COLT'93, (Santa Cruz, California, 1993) pp. 5--13 Harri Lappalainen, James W. Miskin


Blurred Face Recognition via a Hybrid Network Architecture - Re   (Correct)

....parameter that is unknown a priori. Under a Bayesian formulation this parameter is a hyper parameter, therefore, one can integrate the network predictions over it s posterior distribution [10, 16] In the same way integrating over the posterior distribution of the weights can be considered [5]. A rough approximation of such integration leads to combining of suboptimal neural networks to regression ensembles which classify by the Bayesian rule from an average over all ensemble members outputs. We consider three types of regression ensembles: U unconstrained ensemble, corresponding to ....

G. E. Hinton and D. Camp. Keeping neural networks simple by minimizing the description length of the weights. Sixth ACM Conference on Computational Learning Theory, July 1993.


Variational Bayesian Inference for Partially Observed Diusions - Bo Wang And (2004)   (Correct)

No context found.

G. E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual ACM Conference on Computational Learning Theory, pages 5-- 13, Santa Cruz, California, July 1993. ACM Press, New York. 23


PAC-MDL bounds - Avrim Blum And (2003)   (1 citation)  (Correct)

No context found.

G. E. Hinton and D. van Camp, Keeping neural networks simple by minimizing the description length of the weights, COLT 1993.


Bayesian Learning Of Logical Hidden Markov Models - Raiko Kersting Karhunen   (Correct)

No context found.

Geo rey E. Hinton and Drew van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proc. COLT'93, pages 5-13, Santa Cruz, California, USA, July 26-28, 1993.


Hierarchical Nonlinear Factor Analysis - Raiko (2001)   (Correct)

No context found.

G. E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual ACM Conference on Computational Learning Theory, pages 5-13, Santa Cruz, California, USA, July 26-28, 1993.


Constructing Graphical Models for Bayesian Ensemble.. - Harri Valpola Tapani   (Correct)

No context found.

G. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory, pages 513, Santa Cruz, CA, USA, 1993.


Building Blocks For Variational Bayesian Learning Of.. - Raiko, Valpola.. (2006)   (Correct)

No context found.

G. E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory, pages 5--13, Santa Cruz, CA, USA, 1993. 48


Advances on BYY Harmony Learning: Information Theoretic.. - Xu (2004)   (Correct)

No context found.

G. E. Hinton and D. van Camp, "Keeping neural networks simple by minimizing the description length of the weights," in Proc. 6th ACM Conf. Computational Learning Theory, Santa Cruz, CA, July 1993.


Accelerating Cyclic Update Algorithms for Parameter.. - Antti Honkela Harri (2003)   (Correct)

No context found.

Hinton, G. E. and D. van Camp: 1993, `Keeping neural networks simple by minimizing the description length of the weights'. In: Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory. Santa Cruz, CA, USA, pp. 5--13.


The Structure of Bayesian Neural Network Posteriors - Lawrence, Azzouzi   (Correct)

No context found.

G. E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Anuual Conference on Computational Learning Theory, pages 5-13, 1993.


Nonlinear Blind Source Separation by Variational.. - VALPOLA, OJA, ILIN, .. (1999)   (Correct)

No context found.

G. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory, pages 5--13, Santa Cruz, CA, USA, 1993.


PAC-MDL bounds - Blum, Langford (2003)   (1 citation)  (Correct)

No context found.

G. E. Hinton and D. van Camp, Keeping neural networks simple by minimizing the description length of the weights, COLT 1993.


Bayesian Learning of Logical Hidden Markov Models - Raiko, Kersting, Karhunen..   (Correct)

No context found.

Geo rey E. Hinton and Drew van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proc. COLT'93, pages 5-13, Santa Cruz, California, USA, July 26-28, 1993.


On-Line Variational Bayesian Learning - Honkela, Valpola (2003)   (Correct)

No context found.

G. E. Hinton and D. van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory, pages 5--13, Santa Cruz, CA, USA, 1993.


A Comparison of State-of-the-Art Classification Techniques.. - Lerner, Lawrence (2001)   (Correct)

No context found.

Hinton GE, van Camp D. Keeping neural networks simple by minimizing the description length of the weights. Proceedings of the Sixth Anuual Conference on Computational Learning Theory 1993: 5--13


PAC-MDL bounds - Avrim Blum And (2003)   (1 citation)  (Correct)

No context found.

G. E. Hinton and D. van Camp, Keeping neural networks simple by minimizing the description length of the weights, COLT 1993.


An Ensemble Learning Approach to Nonlinear Dynamic.. - Valpola, Honkela.. (2002)   (Correct)

No context found.

G. Hinton and D. van Camp, "Keeping neural networks simple by minimizing the description length of the weights," in Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory, (Santa Cruz, CA, USA), pp. 5--13, 1993.


Nonlinear Independent Component Analysis Chapter for ICA.. - Karhunen   (Correct)

No context found.

G. Hinton and D. van Camp, "Keeping neural networks simple by minimizing the description length of the weights," in Proc. of the 6th Annual ACM Conf. on Computational Learning Theory, Santa Cruz, CA, USA, 1993, pp. 5-13.


A Comparison of State-of-the-Art Classification Techniques.. - Lerner, Lawrence (2001)   (Correct)

No context found.

Hinton GE and van Camp D. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Anuual Conference on Computational Learning Theory, pages 5--13, 1993.

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC