Results 1  10
of
135
ContextDependent Pretrained Deep Neural Networks for Large Vocabulary Speech Recognition
 IEEE Transactions on Audio, Speech, and Language Processing
, 2012
"... Abstract—We propose a novel contextdependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNNHMM) hybrid architecture that trains the ..."
Abstract

Cited by 224 (42 self)
 Add to MetaCart
Abstract—We propose a novel contextdependent (CD) model for large vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pretrained deep neural network hidden Markov model (DNNHMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pretraining algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CDDNNHMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CDDNNHMMs can significantly outperform the conventional contextdependent Gaussian mixture model (GMM)HMMs, with an absolute sentence accuracy improvement of 5.8 % and 9.2 % (or relative error reduction of 16.0 % and 23.2%) over the CDGMMHMMs trained using the minimum phone error rate (MPE) and maximum likelihood (ML) criteria, respectively. Index Terms—Speech recognition, deep belief network, contextdependent phone, LVSR, DNNHMM, ANNHMM I.
Representation Learning: A Review and New Perspectives
, 2012
"... The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to ..."
Abstract

Cited by 152 (4 self)
 Add to MetaCart
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representationlearning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and joint training of deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep architectures. This motivates longerterm unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion
, 2010
"... ..."
Random search for hyperparameter optimization
 In: Journal of Machine Learning Research
"... Grid search and manual search are the most widely used strategies for hyperparameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyperparameter optimization than trials on a grid. Empirical evidence comes from a comparison with a ..."
Abstract

Cited by 107 (16 self)
 Add to MetaCart
Grid search and manual search are the most widely used strategies for hyperparameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyperparameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to configure neural networks and deep belief networks. Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search finds better models by effectively searching a larger, less promising configuration space. Compared with deep belief networks configured by a thoughtful combination of manual search and grid search, purely random search over the same 32dimensional configuration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyperparameters to validation set performance reveals that for most data sets only a few of the hyperparameters really matter, but that different hyperparameters are important on different data sets. This phenomenon makes
Deep learning via Hessianfree optimization
"... We develop a 2 ndorder optimization method based on the “Hessianfree ” approach, and apply it to training deep autoencoders. Without using pretraining, we obtain results superior to those reported by Hinton & Salakhutdinov (2006) on the same tasks they considered. Our method is practical, ea ..."
Abstract

Cited by 74 (5 self)
 Add to MetaCart
(Show Context)
We develop a 2 ndorder optimization method based on the “Hessianfree ” approach, and apply it to training deep autoencoders. Without using pretraining, we obtain results superior to those reported by Hinton & Salakhutdinov (2006) on the same tasks they considered. Our method is practical, easy to use, scales nicely to very large datasets, and isn’t limited in applicability to autoencoders, or any specific model class. We also discuss the issue of “pathological curvature ” as a possible explanation for the difficulty of deeplearning and how 2 ndorder optimization, and our method in particular, effectively deals with it. 1.
Tiled convolutional neural networks
 In NIPS, in press
, 2010
"... Convolutional neural networks (CNNs) have been successfully applied to many tasks such as digit and object recognition. Using convolutional (tied) weights significantly reduces the number of parameters that have to be learned, and also allows translational invariance to be hardcoded into the archit ..."
Abstract

Cited by 50 (7 self)
 Add to MetaCart
(Show Context)
Convolutional neural networks (CNNs) have been successfully applied to many tasks such as digit and object recognition. Using convolutional (tied) weights significantly reduces the number of parameters that have to be learned, and also allows translational invariance to be hardcoded into the architecture. In this paper, we consider the problem of learning invariances, rather than relying on hardcoding. We propose tiled convolution neural networks (Tiled CNNs), which use a regular “tiled ” pattern of tied weights that does not require that adjacent hidden units share identical weights, but instead requires only that hidden units k steps away from each other to have tied weights. By pooling over neighboring units, this architecture is able to learn complex invariances (such as scale and rotational invariance) beyond translational invariance. Further, it also enjoys much of CNNs’ advantage of having a relatively small number of learned parameters (such as ease of learning and greater scalability). We provide an efficient learning algorithm for Tiled CNNs based on Topographic ICA, and show that learning complex invariant features allows us to achieve highly competitive results for both the NORB and CIFAR10 datasets. 1
Deep Sparse Rectifier Neural Networks
"... While logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neurons, the latter work better for training multilayer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbol ..."
Abstract

Cited by 49 (15 self)
 Add to MetaCart
(Show Context)
While logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neurons, the latter work better for training multilayer neural networks. This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard nonlinearity and nondifferentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data. Even though they can take advantage of semisupervised setups with extraunlabeled data, deep rectifier networks can reach their best performance without requiring any unsupervised pretraining on purely supervised tasks with large labeled datasets. Hence, these results can be seen as a new milestone in the attempts at understanding the difficulty in training deep but purely supervised neural networks, and closing the performance gap between neural networks learnt with and without unsupervised pretraining. 1
On Random Weights and Unsupervised Feature Learning
"... Recently two anomalous results in the literature have shown that certain feature learning architectures can perform very well on object recognition tasks, without training. In this paper we pose the question, why do random weights sometimes do so well? Our answer is that certain convolutional poolin ..."
Abstract

Cited by 44 (6 self)
 Add to MetaCart
Recently two anomalous results in the literature have shown that certain feature learning architectures can perform very well on object recognition tasks, without training. In this paper we pose the question, why do random weights sometimes do so well? Our answer is that certain convolutional pooling architectures can be inherently frequency selective and translation invariant, even with random weights. Based on this we demonstrate the viability of extremely fast architecture search by using random weights to evaluate candidate architectures, thereby sidestepping the timeconsuming learning process. We then show that a surprising fraction of the performance of certain stateoftheart methods can be attributed to the architecture alone. 1
The Manifold Tangent Classifier
"... We combine three important ideas present in previous work for building classifiers: the semisupervised hypothesis (the input distribution contains information about the classifier), the unsupervised manifold hypothesis (data density concentrates near lowdimensional manifolds), and the manifold hyp ..."
Abstract

Cited by 32 (10 self)
 Add to MetaCart
(Show Context)
We combine three important ideas present in previous work for building classifiers: the semisupervised hypothesis (the input distribution contains information about the classifier), the unsupervised manifold hypothesis (data density concentrates near lowdimensional manifolds), and the manifold hypothesis for classification (different classes correspond to disjoint manifolds separated by low density). We exploit a novel algorithm for capturing manifold structure (highorder contractive autoencoders) and we show how it builds a topological atlas of charts, each chart being characterized by the principal singular vectors of the Jacobian of a representation mapping. This representation learning algorithm can be stacked to yield a deep architecture, and we combine it with a domain knowledgefree version of the TangentProp algorithm to encourage the classifier to be insensitive to local directions changes along the manifold. Recordbreaking classification results are obtained. 1
DEEP BELIEF NETWORKS USING DISCRIMINATIVE FEATURES FOR PHONE RECOGNITION
"... Deep Belief Networks (DBNs) are multilayer generative models. They can be trained to model windows of coefficients extracted from speech and they discover multiple layers of features that capture the higherorder statistical structure of the data. These features can be used to initialize the hidden ..."
Abstract

Cited by 28 (3 self)
 Add to MetaCart
(Show Context)
Deep Belief Networks (DBNs) are multilayer generative models. They can be trained to model windows of coefficients extracted from speech and they discover multiple layers of features that capture the higherorder statistical structure of the data. These features can be used to initialize the hidden units of a feedforward neural network that is then trained to predict the HMM state for the central frame of the window. Initializing with features that are good at generating speech makes the neural network perform much better than initializing with random weights. DBNs have already been used successfully for phone recognition with input coefficients that are MFCCs or filterbank outputs [1, 2]. In this paper, we demonstrate that they work even better when their inputs are speaker adaptive, discriminative features. On the standard TIMIT corpus, they give phone error rates of 19.6 % using monophone HMMs and a bigram language model and 19.4 % using monophone HMMs and a trigram language model. Index Terms — Discriminative feature transformation, Deep belief networks, Phone recognition.