• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Training products of experts by minimizing contrastive divergence (0)

by G E Hinton
Venue:Neural Computation
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 850
Next 10 →

A fast learning algorithm for deep belief nets

by Geoffrey E. Hinton, Simon Osindero - Neural Computation , 2006
"... We show how to use “complementary priors ” to eliminate the explaining away effects that make inference difficult in densely-connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a ..."
Abstract - Cited by 970 (49 self) - Add to MetaCart
We show how to use “complementary priors ” to eliminate the explaining away effects that make inference difficult in densely-connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modelled by long ravines in the free-energy landscape of the top-level associative memory and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind. 1
(Show Context)

Citation Context

...izing the Kullback-Leibler divergence, KL(P 0 ||P ∞ θ ), between the distribution of the data, P 0 , and the equilibrium distribution defined by the model, P ∞ θ . In contrastive divergence learning (=-=Hinton, 2002-=-), we only run the Markov chain for n full steps 3 before measuring the second correlation. This is equivalent to ignoring the derivatives 3 Each full step consists of updating h given v then updating...

A Neural Probabilistic Language Model

by Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin - JOURNAL OF MACHINE LEARNING RESEARCH , 2003
"... A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen ..."
Abstract - Cited by 447 (19 self) - Add to MetaCart
A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.
(Show Context)

Citation Context

...cated in the early days of connectionism (Hinton, 1986, Elman, 1990). More recently, Hinton’s approach was improved and successfully demonstrated on learning several symbolic relations (Paccanaro and =-=Hinton, 2000-=-). The idea of using neural networks for language modeling is not new either (e.g. Miikkulainen and Dyer, 1991). In contrast, here we push this idea to a large scale, and concentrate on learning a sta...

Greedy layer-wise training of deep networks

by Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle , 2006
"... Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allow ..."
Abstract - Cited by 394 (48 self) - Add to MetaCart
Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hin-ton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this al-gorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input dis-tribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also conrm the hypothesis that the greedy layer-wise unsu-pervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.
(Show Context)

Citation Context

...0 is a sample from Q(h0|v0) and (vk, hk) is a sample of the Markov chain, and the expectation can be easily computed thanks to P (hk|vk) factorizing. The idea of the Contrastive Divergence algorithm (=-=Hinton, 2002-=-) is to take k small (typically k = 1). A pseudo-code for Contrastive Divergence training (with k = 1) of an RBM with binomial input and hidden units is presented in the Appendix (Algorithm RBMupdate(...

Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations

by Honglak Lee, Roger Grosse, Rajesh Ranganath, Andrew Y. Ng - IN ICML’09 , 2009
"... ..."
Abstract - Cited by 369 (19 self) - Add to MetaCart
Abstract not found

Fast exact inference with a factored model for natural language parsing.

by Dan Klein , Christopher D Manning - In Advances in Neural Information Processing Systems, , 2003
"... Abstract We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides conceptual simplicity, straightforward opportunities for separately improving the com ..."
Abstract - Cited by 306 (9 self) - Add to MetaCart
Abstract We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides conceptual simplicity, straightforward opportunities for separately improving the component models, and a level of performance comparable to similar, non-factored models. Most importantly, unlike other modern parsing models, the factored model admits an extremely effective A* parsing algorithm, which enables efficient, exact inference.
(Show Context)

Citation Context

...gs. Therefore, the total mass assigned to valid structures will be less than one. We could imagine fixing this by renormalizing. For example, this situation fits into the product-of-experts framework =-=[6]-=-, with one semantic expert and one syntactic expert that must agree on a single structure. However, since we are presently only interested in finding most-likely parses, no global renormalization cons...

Fields of experts: A framework for learning image priors

by Stefan Roth, Michael J. Black - In CVPR , 2005
"... We develop a framework for learning generic, expressive image priors that capture the statistics of natural scenes and can be used for a variety of machine vision tasks. The approach extends traditional Markov Random Field (MRF) models by learning potential functions over extended pixel neighborhood ..."
Abstract - Cited by 292 (4 self) - Add to MetaCart
We develop a framework for learning generic, expressive image priors that capture the statistics of natural scenes and can be used for a variety of machine vision tasks. The approach extends traditional Markov Random Field (MRF) models by learning potential functions over extended pixel neighborhoods. Field potentials are modeled using a Products-of-Experts framework that exploits nonlinear functions of many linear filter responses. In contrast to previous MRF approaches all parameters, including the linear filters themselves, are learned from training data. We demonstrate the capabilities of this Field of Experts model with two example applications, image denoising and image inpainting, which are implemented using a simple, approximate inference scheme. While the model is trained on a generic image database and is not tuned toward a specific application, we obtain results that compete with and even outperform specialized techniques. 1.
(Show Context)

Citation Context

...ery practical, because it may take a very long time until the Markov chain approximately converges. Instead of running the Markov X , chain until convergence we use the idea of contrastive divergence =-=[12]-=- to initialize the sampler at the data points and only run it for a small, fixed number of steps. If we denote the data distribution as p0 and the distribution after j MCMC iterations as pj , the cont...

Learning multiple layers of features from tiny images

by Alex Krizhevsky , 2009
"... Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it difficult to learn a good set of filters from the ..."
Abstract - Cited by 280 (5 self) - Add to MetaCart
Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it difficult to learn a good set of filters from the images. We show how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex. Using a novel parallelization algorithm to distribute the work among multiple machines connected on a network, we show how training such a model can be done in reasonable time. A second problematic aspect of the tiny images dataset is that there are no reliable class labels which makes it hard to use for object recognition experiments. We created two sets of reliable labels. The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. Using these labels, we show that object recognition is significantly
(Show Context)

Citation Context

...cedure in nitely many times. After in nitely many iterations, the model will have forgotten its starting point and we will be sampling from its equilibrium distribution. However, it has been shown in =-=[5]-=- that this expectation can be approximated well in nite time by a procedure known as Contrastive Divergence (CD). The CD learning procedure approximates (1.6) by running the sampling chain for only a ...

Deep Neural Networks for Acoustic Modeling in Speech Recognition

by Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, Brian Kingsbury
"... Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative ..."
Abstract - Cited by 272 (47 self) - Add to MetaCart
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition. I.
(Show Context)

Citation Context

... stochastic binary “visible” units that represent binary input data connected to a layer of stochastic binary hidden units that learn to model significant non-independencies between the visible units =-=[20]-=-. There are undirected connections between visible and hidden units but no visible-visible or hidden-hidden connections. An RBM is a type of Markov Random Field (MRF) but differs from most MRF’s in se...

Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition

by George E. Dahl, Dong Yu, Li Deng, Alex Acero - IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , 2012
"... We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to pr ..."
Abstract - Cited by 254 (50 self) - Add to MetaCart
We propose a novel context-dependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8 % and 9.2 % (or relative error reduction of 16.0 % and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively.

Boltzmann machines

by Geoffrey E. Hinton , 2007
"... A Boltzmann Machine is a network of symmetrically connected, neuronlike units that make stochastic decisions about whether to be on or off. Boltzmann machines have a simple learning algorithm that allows them to discover interesting features in datasets composed of binary vectors. The learning algor ..."
Abstract - Cited by 228 (21 self) - Add to MetaCart
A Boltzmann Machine is a network of symmetrically connected, neuronlike units that make stochastic decisions about whether to be on or off. Boltzmann machines have a simple learning algorithm that allows them to discover interesting features in datasets composed of binary vectors. The learning algorithm is very slow in networks with many layers of feature detectors, but it can be made much faster by learning one layer of feature detectors at a time. Boltzmann machines are used to solve two quite different computational problems. For a search problem, the weights on the connections are fixed and are used to represent the cost function of an optimization problem. The stochastic dynamics of a Boltzmann machine then allow it to sample binary state vectors that represent good solutions to the optimization problem. For a learning problem, the Boltzmann machine is shown a set of binary data vectors and it must find weights on the connections so that the data vectors are good solutions to the optimization problem defined by those weights. To solve a learning problem, Boltzmann machines make many small updates to their weights, and each update requires them to solve many different search problems. The stochastic dynamics of a Boltzmann machine When unit i is given the opportunity to update its binary state, it first computes its total input, zi, which is the sum of its own bias, bi, and the weights on connections coming from other active units: zi = bi + �
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University