Results 1  10
of
136
Gradientbased learning applied to document recognition
 Proceedings of the IEEE
, 1998
"... Multilayer neural networks trained with the backpropagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradientbased learning algorithms can be used to synthesize a complex decision surface that can classify hi ..."
Abstract

Cited by 1533 (84 self)
 Add to MetaCart
Multilayer neural networks trained with the backpropagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradientbased learning algorithms can be used to synthesize a complex decision surface that can classify highdimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of two dimensional (2D) shapes, are shown to outperform all other techniques. Reallife document recognition systems are composed of multiple modules including field extraction, segmentation, recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN’s), allows such multimodule systems to be trained globally using gradientbased methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank check is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal checks. It is deployed commercially and reads several million checks per day.
Connectionist Learning Procedures
 ARTIFICIAL INTELLIGENCE
, 1989
"... A major goal of research on networks of neuronlike processing units is to discover efficient learning procedures that allow these networks to construct complex internal representations of their environment. The learning procedures must be capable of modifying the connection strengths in such a way ..."
Abstract

Cited by 410 (9 self)
 Add to MetaCart
A major goal of research on networks of neuronlike processing units is to discover efficient learning procedures that allow these networks to construct complex internal representations of their environment. The learning procedures must be capable of modifying the connection strengths in such a way that internal units which are not part of the input or output come to represent important features of the task domain. Several interesting gradientdescent procedures have recently been discovered. Each connection computes the derivative, with respect to the connection strength, of a global measure of the error in the performance of the network. The strength is then adjusted in the direction that decreases the error. These relatively simple, gradientdescent learning procedures work well for small tasks and the new challenge is to find ways of improving their convergence rate and their generalization abilities so that they can be applied to larger, more realistic tasks.
Hidden Markov processes
 IEEE Trans. Inform. Theory
, 2002
"... Abstract—An overview of statistical and informationtheoretic aspects of hidden Markov processes (HMPs) is presented. An HMP is a discretetime finitestate homogeneous Markov chain observed through a discretetime memoryless invariant channel. In recent years, the work of Baum and Petrie on finite ..."
Abstract

Cited by 264 (5 self)
 Add to MetaCart
(Show Context)
Abstract—An overview of statistical and informationtheoretic aspects of hidden Markov processes (HMPs) is presented. An HMP is a discretetime finitestate homogeneous Markov chain observed through a discretetime memoryless invariant channel. In recent years, the work of Baum and Petrie on finitestate finitealphabet HMPs was expanded to HMPs with finite as well as continuous state spaces and a general alphabet. In particular, statistical properties and ergodic theorems for relative entropy densities of HMPs were developed. Consistency and asymptotic normality of the maximumlikelihood (ML) parameter estimator were proved under some mild conditions. Similar results were established for switching autoregressive processes. These processes generalize HMPs. New algorithms were developed for estimating the state, parameter, and order of an HMP, for universal coding and classification of HMPs, and for universal decoding of hidden Markov channels. These and other related topics are reviewed in this paper. Index Terms—Baum–Petrie algorithm, entropy ergodic theorems, finitestate channels, hidden Markov models, identifiability, Kalman filter, maximumlikelihood (ML) estimation, order estimation, recursive parameter estimation, switching autoregressive processes, Ziv inequality. I.
Network Information Criterion  Determining the Number of Hidden Units for an Artificial Neural Network Model
 IEEE Transactions on Neural Networks
, 1994
"... The problem of model selection, or determination of the number of hidden units, can be approached statistically, by generalizing Akaike's information criterion (AIC) to be applicable to unfaithful (i.e., unrealizable) models with general loss criteria including regularization terms. The relatio ..."
Abstract

Cited by 182 (8 self)
 Add to MetaCart
(Show Context)
The problem of model selection, or determination of the number of hidden units, can be approached statistically, by generalizing Akaike's information criterion (AIC) to be applicable to unfaithful (i.e., unrealizable) models with general loss criteria including regularization terms. The relation between the training error and the generalization error is studied in terms of the number of the training examples and the complexity of a network which reduces to the number of parameters in the ordinary statistical theory of the AIC. This relation leads to a new Network Information Criterion (NIC) which is useful for selecting the optimal network model based on a given training set. 3 IEEE Transactions on Neural Networks, Vol. 5, No. 6, pp. 865872, November 1994 y Department of Mathematical Engineering and Information Physics, Faculty of Engineering, University of Tokyo, 731 Hongo, Bunkyoku, Tokyo 113, Japan. 1 Introduction In engineering fields, one of the most important applicati...
Adaptive OnLine Learning Algorithms for Blind Separation  Maximum Entropy and Minimum Mutual Information
 Neural Computation
, 1997
"... There are two major approaches for blind separation: Maximum Entropy (ME) and Minimum Mutual Information (MMI). Both can be implemented by the stochastic gradient descent method for obtaining the demixing matrix. The MI is the contrast function for blind separation while the entropy is not. To just ..."
Abstract

Cited by 133 (16 self)
 Add to MetaCart
There are two major approaches for blind separation: Maximum Entropy (ME) and Minimum Mutual Information (MMI). Both can be implemented by the stochastic gradient descent method for obtaining the demixing matrix. The MI is the contrast function for blind separation while the entropy is not. To justify the ME, the relation between ME and MMI is firstly elucidated by calculating the first derivative of the entropy and proving that 1) the the meansubtraction is necessary in applying the ME and 2) at the solution points determined by the MI the ME will not update the demixing matrix in the directions of increasing the crosstalking. Secondly, the natural gradient instead of the ordinary gradient is introduced to obtain efficient algorithms, because the parameter space is a Riemannian space consisting of matrices. The mutual information is calculated by applying the GramCharlier expansion to approximate probability density functions of the outputs. Finally, we propose an efficient learn...
Support vector machines for speech recognition
 Proceedings of the International Conference on Spoken Language Processing
, 1998
"... Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative informati ..."
Abstract

Cited by 117 (2 self)
 Add to MetaCart
Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative information and are prone to overfitting and overparameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. In this paper, we show that SVMs provide a significant improvement in performance on a static pattern classification task based on the Deterding vowel data. We also describe an application of SVMs to large vocabulary speech recognition, and demonstrate an improvement in error rate on a continuous alphadigit task (OGI Aphadigits) and a large vocabulary conversational speech task (Switchboard). Issues related to the development and optimization of an SVM/HMM hybrid system are discussed.
Parallelized stochastic gradient descent
 Advances in Neural Information Processing Systems 23
, 2010
"... Abstract With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization a ..."
Abstract

Cited by 97 (4 self)
 Add to MetaCart
(Show Context)
Abstract With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms
Adaptive blind signal processingneural network approaches
 Proc. of the IEEE
, 1998
"... Learning algorithms and underlying basic mathematical ideas are presented for the problem of adaptive blind signal processing, especially instantaneous blind separation and multichannel blind deconvolution/equalization of independent source signals. We discuss recent developments of adaptive learnin ..."
Abstract

Cited by 61 (9 self)
 Add to MetaCart
(Show Context)
Learning algorithms and underlying basic mathematical ideas are presented for the problem of adaptive blind signal processing, especially instantaneous blind separation and multichannel blind deconvolution/equalization of independent source signals. We discuss recent developments of adaptive learning algorithms based on the natural gradient approach and their properties concerning convergence, stability, and efficiency. Several promising schemas are proposed and reviewed in the paper. Emphasis is given to neural networks or adaptive filtering models and associated online adaptive nonlinear learning algorithms. Computer simulations illustrate the performances of the developed algorithms. Some results presented in this paper are new and are being published for the first time.
Neural Learning in Structured Parameter Spaces  Natural Riemannian Gradient
 In Advances in Neural Information Processing Systems
, 1997
"... The parameter space of neural networks has the Riemannian metric structure. The natural Riemannian gradient should be used instead of the conventional gradient, since the former denotes the steepest descent direction of a loss function in the Riemannian space. The behavior of the stochastic gradient ..."
Abstract

Cited by 59 (6 self)
 Add to MetaCart
The parameter space of neural networks has the Riemannian metric structure. The natural Riemannian gradient should be used instead of the conventional gradient, since the former denotes the steepest descent direction of a loss function in the Riemannian space. The behavior of the stochastic gradient learning algorithm is much more effective if the natural gradient is used. The present paper studies the informationgeometrical structure of perceptrons and other networks, and prove that the online learning method based on the natural gradient is asymptotically as efficient as the optimal batch algorithm. Adaptive modification of the learning constant is proposed and analyzed in terms of the Riemannian measure and is shown to be efficient. The natural gradient is finally applied to blind separation of mixtured independent signal sources. 1 Introduction Neural learning takes place in the parameter space of modifiable synaptic weights of a neural network. The role of each parameter is dif...