Results 1  10
of
839
A fast learning algorithm for deep belief nets
 Neural Computation
, 2006
"... We show how to use “complementary priors ” to eliminate the explaining away effects that make inference difficult in denselyconnected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a ..."
Abstract

Cited by 968 (49 self)
 Add to MetaCart
(Show Context)
We show how to use “complementary priors ” to eliminate the explaining away effects that make inference difficult in denselyconnected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that finetunes the weights using a contrastive version of the wakesleep algorithm. After finetuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The lowdimensional manifolds on which the digits lie are modelled by long ravines in the freeenergy landscape of the toplevel associative memory and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind. 1
A Neural Probabilistic Language Model
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen ..."
Abstract

Cited by 443 (19 self)
 Add to MetaCart
(Show Context)
A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on ngrams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on stateoftheart ngram models, and that the proposed approach allows to take advantage of longer contexts.
Greedy layerwise training of deep networks
, 2006
"... Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multilayer neural networks have many levels of nonlinearities allow ..."
Abstract

Cited by 394 (48 self)
 Add to MetaCart
(Show Context)
Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multilayer neural networks have many levels of nonlinearities allowing them to compactly represent highly nonlinear and highlyvarying functions. However, until recently it was not clear how to train such deep networks, since gradientbased optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layerwise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also conrm the hypothesis that the greedy layerwise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are highlevel abstractions of the input, bringing better generalization.
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations
 IN ICML’09
, 2009
"... ..."
Fast exact inference with a factored model for natural language parsing.
 In Advances in Neural Information Processing Systems,
, 2003
"... Abstract We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides conceptual simplicity, straightforward opportunities for separately improving the com ..."
Abstract

Cited by 304 (9 self)
 Add to MetaCart
(Show Context)
Abstract We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides conceptual simplicity, straightforward opportunities for separately improving the component models, and a level of performance comparable to similar, nonfactored models. Most importantly, unlike other modern parsing models, the factored model admits an extremely effective A* parsing algorithm, which enables efficient, exact inference.
Fields of experts: A framework for learning image priors
 In CVPR
, 2005
"... We develop a framework for learning generic, expressive image priors that capture the statistics of natural scenes and can be used for a variety of machine vision tasks. The approach extends traditional Markov Random Field (MRF) models by learning potential functions over extended pixel neighborhood ..."
Abstract

Cited by 292 (4 self)
 Add to MetaCart
(Show Context)
We develop a framework for learning generic, expressive image priors that capture the statistics of natural scenes and can be used for a variety of machine vision tasks. The approach extends traditional Markov Random Field (MRF) models by learning potential functions over extended pixel neighborhoods. Field potentials are modeled using a ProductsofExperts framework that exploits nonlinear functions of many linear filter responses. In contrast to previous MRF approaches all parameters, including the linear filters themselves, are learned from training data. We demonstrate the capabilities of this Field of Experts model with two example applications, image denoising and image inpainting, which are implemented using a simple, approximate inference scheme. While the model is trained on a generic image database and is not tuned toward a specific application, we obtain results that compete with and even outperform specialized techniques. 1.
Learning multiple layers of features from tiny images
, 2009
"... Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it difficult to learn a good set of filters from the ..."
Abstract

Cited by 275 (5 self)
 Add to MetaCart
(Show Context)
Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it difficult to learn a good set of filters from the images. We show how to train a multilayer generative model that learns to extract meaningful features which resemble those found in the human visual cortex. Using a novel parallelization algorithm to distribute the work among multiple machines connected on a network, we show how training such a model can be done in reasonable time. A second problematic aspect of the tiny images dataset is that there are no reliable class labels which makes it hard to use for object recognition experiments. We created two sets of reliable labels. The CIFAR10 set has 6000 examples of each of 10 classes and the CIFAR100 set has 600 examples of each of 100 nonoverlapping classes. Using these labels, we show that object recognition is significantly
Deep Neural Networks for Acoustic Modeling in Speech Recognition
"... Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative ..."
Abstract

Cited by 268 (46 self)
 Add to MetaCart
(Show Context)
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feedforward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks with many hidden layers, that are trained using new methods have been shown to outperform Gaussian mixture models on a variety of speech recognition benchmarks, sometimes by a large margin. This paper provides an overview of this progress and represents the shared views of four research groups who have had recent successes in using deep neural networks for acoustic modeling in speech recognition. I.
ContextDependent Pretrained Deep Neural Networks for Large Vocabulary Speech Recognition
 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING
, 2012
"... We propose a novel contextdependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNNHMM) hybrid architecture that trains the DNN to pr ..."
Abstract

Cited by 252 (49 self)
 Add to MetaCart
We propose a novel contextdependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNNHMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pretraining algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CDDNNHMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CDDNNHMMs can significantly outperform the conventional contextdependent Gaussian mixture model (GMM)HMMs, with an absolute sentence accuracy improvement of 5.8 % and 9.2 % (or relative error reduction of 16.0 % and 23.2%) over the CDGMMHMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively.
Boltzmann machines
, 2007
"... A Boltzmann Machine is a network of symmetrically connected, neuronlike units that make stochastic decisions about whether to be on or off. Boltzmann machines have a simple learning algorithm that allows them to discover interesting features in datasets composed of binary vectors. The learning algor ..."
Abstract

Cited by 229 (21 self)
 Add to MetaCart
A Boltzmann Machine is a network of symmetrically connected, neuronlike units that make stochastic decisions about whether to be on or off. Boltzmann machines have a simple learning algorithm that allows them to discover interesting features in datasets composed of binary vectors. The learning algorithm is very slow in networks with many layers of feature detectors, but it can be made much faster by learning one layer of feature detectors at a time. Boltzmann machines are used to solve two quite different computational problems. For a search problem, the weights on the connections are fixed and are used to represent the cost function of an optimization problem. The stochastic dynamics of a Boltzmann machine then allow it to sample binary state vectors that represent good solutions to the optimization problem. For a learning problem, the Boltzmann machine is shown a set of binary data vectors and it must find weights on the connections so that the data vectors are good solutions to the optimization problem defined by those weights. To solve a learning problem, Boltzmann machines make many small updates to their weights, and each update requires them to solve many different search problems. The stochastic dynamics of a Boltzmann machine When unit i is given the opportunity to update its binary state, it first computes its total input, zi, which is the sum of its own bias, bi, and the weights on connections coming from other active units: zi = bi + �