Results 1  10
of
155
ContextDependent Pretrained Deep Neural Networks for Large Vocabulary Speech Recognition
 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING
, 2012
"... We propose a novel contextdependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNNHMM) hybrid architecture that trains the DNN to pr ..."
Abstract

Cited by 254 (50 self)
 Add to MetaCart
We propose a novel contextdependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNNHMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pretraining algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CDDNNHMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CDDNNHMMs can significantly outperform the conventional contextdependent Gaussian mixture model (GMM)HMMs, with an absolute sentence accuracy improvement of 5.8 % and 9.2 % (or relative error reduction of 16.0 % and 23.2%) over the CDGMMHMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively.
Representation learning: A review and new perspectives.
 of IEEE Conf. Comp. Vision Pattern Recog. (CVPR),
, 2005
"... AbstractThe success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can b ..."
Abstract

Cited by 173 (4 self)
 Add to MetaCart
(Show Context)
AbstractThe success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representationlearning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.
Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion
, 2010
"... ..."
Random search for hyperparameter optimization
 In: Journal of Machine Learning Research
"... Grid search and manual search are the most widely used strategies for hyperparameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyperparameter optimization than trials on a grid. Empirical evidence comes from a comparison with a ..."
Abstract

Cited by 125 (16 self)
 Add to MetaCart
Grid search and manual search are the most widely used strategies for hyperparameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyperparameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to configure neural networks and deep belief networks. Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search finds better models by effectively searching a larger, less promising configuration space. Compared with deep belief networks configured by a thoughtful combination of manual search and grid search, purely random search over the same 32dimensional configuration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyperparameters to validation set performance reveals that for most data sets only a few of the hyperparameters really matter, but that different hyperparameters are important on different data sets. This phenomenon makes
Deep learning via Hessianfree optimization
"... We develop a 2 ndorder optimization method based on the “Hessianfree ” approach, and apply it to training deep autoencoders. Without using pretraining, we obtain results superior to those reported by Hinton & Salakhutdinov (2006) on the same tasks they considered. Our method is practical, ea ..."
Abstract

Cited by 76 (5 self)
 Add to MetaCart
(Show Context)
We develop a 2 ndorder optimization method based on the “Hessianfree ” approach, and apply it to training deep autoencoders. Without using pretraining, we obtain results superior to those reported by Hinton & Salakhutdinov (2006) on the same tasks they considered. Our method is practical, easy to use, scales nicely to very large datasets, and isn’t limited in applicability to autoencoders, or any specific model class. We also discuss the issue of “pathological curvature ” as a possible explanation for the difficulty of deeplearning and how 2 ndorder optimization, and our method in particular, effectively deals with it. 1.
Deep Sparse Rectifier Neural Networks
"... While logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neurons, the latter work better for training multilayer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbol ..."
Abstract

Cited by 57 (17 self)
 Add to MetaCart
(Show Context)
While logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neurons, the latter work better for training multilayer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard nonlinearity and nondifferentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semisupervised setups with extraunlabeled data, deep rectifier networks can reach their best performance without requiring any unsupervised pretraining on purely supervised tasks with large labeled datasets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised neural networks, and closing the performance gap between neural networks learnt with and without unsupervised pretraining. 1
Tiled convolutional neural networks
 In NIPS, in press
, 2010
"... Convolutional neural networks (CNNs) have been successfully applied to many tasks such as digit and object recognition. Using convolutional (tied) weights significantly reduces the number of parameters that have to be learned, and also allows translational invariance to be hardcoded into the archit ..."
Abstract

Cited by 54 (7 self)
 Add to MetaCart
(Show Context)
Convolutional neural networks (CNNs) have been successfully applied to many tasks such as digit and object recognition. Using convolutional (tied) weights significantly reduces the number of parameters that have to be learned, and also allows translational invariance to be hardcoded into the architecture. In this paper, we consider the problem of learning invariances, rather than relying on hardcoding. We propose tiled convolution neural networks (Tiled CNNs), which use a regular “tiled ” pattern of tied weights that does not require that adjacent hidden units share identical weights, but instead requires only that hidden units k steps away from each other to have tied weights. By pooling over neighboring units, this architecture is able to learn complex invariances (such as scale and rotational invariance) beyond translational invariance. Further, it also enjoys much of CNNs’ advantage of having a relatively small number of learned parameters (such as ease of learning and greater scalability). We provide an efficient learning algorithm for Tiled CNNs based on Topographic ICA, and show that learning complex invariant features allows us to achieve highly competitive results for both the NORB and CIFAR10 datasets. 1
On Random Weights and Unsupervised Feature Learning
"... Recently two anomalous results in the literature have shown that certain feature learning architectures can perform very well on object recognition tasks, without training. In this paper we pose the question, why do random weights sometimes do so well? Our answer is that certain convolutional poolin ..."
Abstract

Cited by 45 (6 self)
 Add to MetaCart
Recently two anomalous results in the literature have shown that certain feature learning architectures can perform very well on object recognition tasks, without training. In this paper we pose the question, why do random weights sometimes do so well? Our answer is that certain convolutional pooling architectures can be inherently frequency selective and translation invariant, even with random weights. Based on this we demonstrate the viability of extremely fast architecture search by using random weights to evaluate candidate architectures, thereby sidestepping the timeconsuming learning process. We then show that a surprising fraction of the performance of certain stateoftheart methods can be attributed to the architecture alone. 1
The Manifold Tangent Classifier
"... We combine three important ideas present in previous work for building classifiers: the semisupervised hypothesis (the input distribution contains information about the classifier), the unsupervised manifold hypothesis (data density concentrates near lowdimensional manifolds), and the manifold hyp ..."
Abstract

Cited by 32 (10 self)
 Add to MetaCart
(Show Context)
We combine three important ideas present in previous work for building classifiers: the semisupervised hypothesis (the input distribution contains information about the classifier), the unsupervised manifold hypothesis (data density concentrates near lowdimensional manifolds), and the manifold hypothesis for classification (different classes correspond to disjoint manifolds separated by low density). We exploit a novel algorithm for capturing manifold structure (highorder contractive autoencoders) and we show how it builds a topological atlas of charts, each chart being characterized by the principal singular vectors of the Jacobian of a representation mapping. This representation learning algorithm can be stacked to yield a deep architecture, and we combine it with a domain knowledgefree version of the TangentProp algorithm to encourage the classifier to be insensitive to local directions changes along the manifold. Recordbreaking classification results are obtained. 1
A Connection between Score Matching and Denoising Autoencoders
, 2010
"... Denoising autoencoders have been previously shown to be competitive alternatives to Restricted Boltzmann Machines for unsupervised pretraining of each layer of a deep architecture. We show that a simple denoising autoencoder training criterion is equivalent to matching the score (with respect to th ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
Denoising autoencoders have been previously shown to be competitive alternatives to Restricted Boltzmann Machines for unsupervised pretraining of each layer of a deep architecture. We show that a simple denoising autoencoder training criterion is equivalent to matching the score (with respect to the data) of a specific energy based model to that of a nonparametric Parzen density estimator of the data. This yields several useful insights. It defines a proper probabilistic model for the denoising autoencoder technique which makes it in principle possible to sample from them or to rank examples by their energy. It suggests a different way to apply score matching that is related to learning to denoise and does not require computing second derivatives. It justifies the use of tied weights between the encoder and decoder, and suggests ways to extend the success of denoising autoencoders to a larger family of energybased models.