Results 1  10
of
14
Representation learning: A review and new perspectives.
 of IEEE Conf. Comp. Vision Pattern Recog. (CVPR),
, 2005
"... AbstractThe success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can b ..."
Abstract

Cited by 173 (4 self)
 Add to MetaCart
(Show Context)
AbstractThe success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representationlearning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.
On the difficulty of training recurrent neural networks
"... There are two widely known issues with properly training recurrent neural networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geo ..."
Abstract

Cited by 42 (6 self)
 Add to MetaCart
(Show Context)
There are two widely known issues with properly training recurrent neural networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section. 1.
A Deep and Tractable Density Estimator
"... The Neural Autoregressive Distribution Estimator (NADE) and its realvalued version RNADE are competitive density models of multidimensional data across a variety of domains. These models use a fixed, arbitrary ordering of the data dimensions. One can easily condition on variables at the beginning ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
The Neural Autoregressive Distribution Estimator (NADE) and its realvalued version RNADE are competitive density models of multidimensional data across a variety of domains. These models use a fixed, arbitrary ordering of the data dimensions. One can easily condition on variables at the beginning of the ordering, and marginalize out variables at the end of the ordering, however other inference tasks require approximate inference. In this work we introduce an efficient procedure to simultaneously train a NADE model for each possible ordering of the variables, by sharing parameters across all these models. We can thus use the most convenient model for each inference task at hand, and ensembles of such models with different orderings are immediately available. Moreover, unlike the original NADE, our training procedure scales to deep models. Empirically, ensembles of Deep NADE models obtain state of the art density estimation performance. 1.
Advances in optimizing recurrent networks
 In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on
"... ar ..."
(Show Context)
New types of deep neural network learning for speech recognition and related applications: An overview
 in Proc. Int. Conf. Acoust., Speech, Signal Process
, 2013
"... In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications, ” as organized by the authors. We also describe the historical context in ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
(Show Context)
In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications, ” as organized by the authors. We also describe the historical context in which acoustic models based on deep neural networks have been developed. The technical overview of the papers presented in our special session is organized into five ways of improving deep learning methods: (1) better optimization; (2) better types of neural activation function and better network architectures; (3) better ways to determine the myriad hyperparameters of deep neural networks; (4) more appropriate ways to preprocess speech for deep neural networks; and (5) ways of leveraging multiple languages or dialects that are more easily achieved with deep neural networks than with Gaussian mixture models. Index Terms — deep neural network, convolutional neural network, recurrent neural network, optimization, spectrogram features, multitask, multilingual, speech recognition, music processing
Training Neural Networks with Stochastic HessianFree Optimization
"... Hessianfree (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvaturevector products that can be computed on the same order of time as gradients. In this paper we e ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Hessianfree (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvaturevector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature minibatches independent of the dataset size. We modify Martens ’ HF for these settings and integrate dropout, a method for preventing coadaptation of feature detectors, to guard against overfitting. Stochastic Hessianfree optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments. 1
Learning Input and Recurrent Weight Matrices in Echo State Networks
"... Abstract The traditional echo state network (ESN) is a special type of a temporally deep model, the recurrent network (RNN), which carefully designs the recurrent matrix and fixes both the recurrent and input matrices in the RNN. The ESN also adopts the linear output (or readout) units to simplify ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Abstract The traditional echo state network (ESN) is a special type of a temporally deep model, the recurrent network (RNN), which carefully designs the recurrent matrix and fixes both the recurrent and input matrices in the RNN. The ESN also adopts the linear output (or readout) units to simplify the leanring of the only output matrix in the RNN. In this paper, we devise a special technique that takes advantage of the linearity in the output units in the ESN to learn the input and recurrent matrices, not carried on earlier ESNs due to the wellknown difficulty of their learning. Compared with the technique of BackProp Through Time (BPTT) in learning the general RNNs, our proposed technique makes use of the linearity in the output units to provide constraints among various matrices in the RNN, enabling the computation of the gradients as the learning signal in an analytical form instead of by recursion as in the BPTT. Experimental results on phone state classification show that learning either or both the input and recurrent matrices in the ESN is superior to the traditional ESN without learning them, especially when longer time steps are used in analytically computing the gradients.
How autoencoders could provide credit assignment in deep networks via target propagation
, 2014
"... We propose to exploit reconstruction as a layerlocal training signal for deep learning. Reconstructions can be propagated in a form of target propagation playing a role similar to backpropagation but helping to reduce the reliance on derivatives in order to perform credit assignment across many l ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
We propose to exploit reconstruction as a layerlocal training signal for deep learning. Reconstructions can be propagated in a form of target propagation playing a role similar to backpropagation but helping to reduce the reliance on derivatives in order to perform credit assignment across many levels of possibly strong nonlinearities (which is difficult for backpropagation). A regularized autoencoder tends produce a reconstruction that is a more likely version of its input, i.e., a small move in the direction of higher likelihood. By generalizing gradients, target propagation may also allow to train deep networks with discrete hidden units. If the autoencoder takes both a representation of input and target (or of any side information) in input, then its reconstruction of input representation provides a target towards a representation that is more likely, conditioned on all the side information. A deep autoencoder decoding path generalizes gradient propagation in a learned way that can could thus handle not just infinitesimal changes but larger, discrete changes, hopefully allowing credit assignment through a long chain of nonlinear operations. In addition to each layer being a good autoencoder, the encoder also learns to please the upper layers by transforming the data into a space where it is easier to model by them, flattening manifolds and disentangling factors. The motivations and theoretical justifications for this approach are laid down in this paper, along with conjectures that will have to be verified either mathematically or experimentally, including a hypothesis stating that such autoencoder mediated target propagation could play in brains the role of credit assignment through many nonlinear, noisy and discrete transformations. 1
Recurrent deepstacking networks for sequence classification,”
 in Signal and Information Processing (ChinaSIP), 2014 IEEE China Summit International Conference on,
, 2014
"... ABSTRACT Deep Stacking Networks (DSNs) are constructed by stacking shallow feedforward neural networks on top of each other using concatenated features derived from the lower modules of the DSN and the raw input data. DSNs do not have recurrent connections, making them less effective to model and ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
ABSTRACT Deep Stacking Networks (DSNs) are constructed by stacking shallow feedforward neural networks on top of each other using concatenated features derived from the lower modules of the DSN and the raw input data. DSNs do not have recurrent connections, making them less effective to model and classify input data with temporal dependencies. In this paper, we embed recurrent connections into the DSN, giving rise to Recurrent Deep Stacking Networks (RDSNs). Each module of the RDSN consists of a special form of recurrent neural networks. Generalizing from the earlier DSN, the use of linearity in the output units of the RDSN enables us to derive a closed form for computing the gradient of the cost function with respect to all network matrices without backpropagating errors. Each module in the RDSN is initialized with an echo state network, where the input and recurrent weights are fixed to have the echo state property. Then all connection weights within the module are fine tuned using batchmode gradient descent where the gradient takes an analytical form. Experiments are performed on the TIMIT dataset for framelevel phone state classification with 183 classes. The results show that the RDSN gives higher classification accuracy over a single recurrent neural network without stacking.
A primaldual method for training recurrent neural networks constrained by the echostate property
 In ICLR
"... We present an architecture of a recurrent neural network (RNN) with a fullyconnected deep neural network (DNN) as its feature extractor. The RNN is equipped with both causal temporal prediction and noncausal lookahead, via autoregression (AR) and movingaverage (MA), respectively. The focus of t ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We present an architecture of a recurrent neural network (RNN) with a fullyconnected deep neural network (DNN) as its feature extractor. The RNN is equipped with both causal temporal prediction and noncausal lookahead, via autoregression (AR) and movingaverage (MA), respectively. The focus of this paper is a primaldual training method that formulates the learning of the RNN as a formal optimization problem with an inequality constraint that provides a sufficient condition for the stability of the network dynamics. Experimental results demonstrate the effectiveness of this new method, which achieves 18.86 % phone recognition error on the TIMIT benchmark for the core test set. The result approaches the best result of 17.7%, which was obtained by using RNN with long shortterm memory (LSTM). The results also show that the proposed primaldual training method produces lower recognition errors than the popular RNN methods developed earlier based on the carefully tuned threshold parameter that heuristically prevents the gradient from exploding. 1