| McDermott, E. (1997). Discriminative Training for Speech Recognition. Doctoral thesis, Waseda University, Tokyo. |
....the case of insufficient amount of training material, despite the fact that clever addition of material may make a substantial contribution to the solution of remaining problems. In order to tackle the first problem, discriminative training techniques such as Minimum Classification Error (MCE, see [1]) training were introduced as a powerful means to learn to separate the acoustic classes optimally, rather than to estimate the true distributions that lie underneath these classes. In MCE this is accomplished by taking an initial set of ML trained HMMs and defining a loss function over the ....
....loss computation and model adaptation are iterated until loss decrease drops below some convergence threshold. Model parameter estimation is now directly related to the ultimate goal of the models, viz. to make the optimal classification decision. Substantial accuracy improvements were reported in [1]. Moreover, discriminative training may yield an especially effective way to tackle the problem of recognising highly confusable words. However, the problem of mismatch in training and operational circumstances is not addressed by MCE. In order to learn more about this problem, we focus on a ....
[Article contains additional citation context not shown here]
E. McDermott, Discriminative Training for Speech Recognition, Ph.D. Thesis, Waseda Japan, 1997, http://www.hip.atr.co.jp/ ~mcd/mcd_thesis.ps.gz
....that explicitly incorporates classification performance into the training criterion. Given discriminant functions for each category, MCE defines a loss function that is a smoothed approximation of the recognition error rate, and then uses this function as the criterion function for optimization [4, 5, 6, 7]. Through minimization of this criterion function, MCE is aimed directly at minimizing classification error rather than at learning the true data probability distributions, the target of Maximum Likelihood Estimation (MLE) MCE has been used succesfully in various pattern recognition tasks [1, 2, ....
....to unseen data, and commonly used in existing MCE and Maximum Mutual Information (MMI) speech recognition studies, can be rigorously linked to the theoretical classification risk. 2. THE MINIMUM CLASSIFICATION ERROR FRAMEWORK The MCE framework has been described in several publications [4, 6, 7]. For each training token, MCE maps a training pattern token x and the system parameters # (e.g. all the hidden Markov model means and covariances) to a 0 1 loss function reflecting classification error. The pattern x could be a single pattern vector or a sequence of, e.g. speechderived feature ....
[Article contains additional citation context not shown here]
McDermott, E. (1997). Discriminative Training for Speech Recognition. Doctoral thesis, Waseda University, Tokyo.
....explicitly incorporates classification performance into the training criterion. Given discriminant functions for each category, MCE defines a loss function that is a smoothed approximation of the recognition error rate and uses this function as the criterion function for optimization [3] 5] 6][7]. Through minimization of this criterion function, MCE is aimed directly at minimizing classification error rather than at learning the true data probability distributions, the target of Maximum Likelihood Estimation (MLE) via BaumWelch or Viterbi training. MCE has been used succesfully to train ....
....MCE is aimed directly at minimizing classification error rather than at learning the true data probability distributions, the target of Maximum Likelihood Estimation (MLE) via BaumWelch or Viterbi training. MCE has been used succesfully to train Hidden Markov Models for speech recognition tasks [1][7][8] Here, a new theoretical perspective on MCE is presented. This addresses the nature of the smoothness of the MCE loss function, as well as the relationship between minimization of an overall MCE loss summed over a finite set of training data and minimization of the theoretical classification ....
[Article contains additional citation context not shown here]
McDermott, E. (1997). Discriminative Training for Speech Recognition. Doctoral thesis, Waseda University, Tokyo.
....aspect of MECS is its implementation of HMM optimization based on the Minimum Classification Error (MCE) framework, this chapter devotes several sections to the theoretical background of MCE and to its specific application to HMMs. For a fuller description of MCE optimization of HMMs, see [McDermott, 1997]. 6.1 Hidden Markov models In accordance with the general properties of a speech recognizer described in Chapter 1, hidden Markov models can be used for dynamic pattern recognition, involving the recognition of sequences of pattern vectors. HMMs are a simple way of extending a static ....
....adaptation. We here describe one practical approach to second order optimization, the Quickprop algorithm [Fahlman, 1988] This algorithm is easy to implement and parallelize, and illustrates the basic principles of Newton s method. It has been used to train both Multi Layer Perceptrons and HMMs [McDermott, 1997]. Quickprop can be seen as a rough approximation to the classic Newton s method. The central idea in Newton s method is to build a model M( of the function of interest F ( using the first three terms of the Taylor series expansion of the function, around the current point #, for a given step ....
[Article contains additional citation context not shown here]
McDermott, E. (1997). Discriminative Training for Speech Recognition. PhD Thesis, School of Science and Engineering, Waseda University, Japan.
....of time frames ago. 4. FIRST EVALUATIONS We carried out preliminary evaluations on continuous phoneme recognition using Viterbi trained contextdependent acoustical models and trigram phoneme language models for the Timit database. For further readings concerning other Timit results best consult [1]. The context dependent models are right hiphone models of three states in a linear topology, clustered according to Tree Based State Clustering resulting in 1,760 physical states. The probability distribution of each of these states is modeled as a mixture of 10 Gaussian distribu tions. The test ....
E. McDermott, "Discriminative Training for Speech Recognition" Doctoral thesis, Waseda University, 1997.
....Quickprop algorithm We here describe one practical approach to second order optimization, the Quickprop algorithm [14] This algorithm is easy to implement and paral lelize, and illustrates the basic principles of Newton s method. It has been used to train both Multi Layer Perceptrons and HMMs [15], and may be an attractive method for prototype based classifier design. Quickprop can be seen as a rough approximation to the classic New ton s method. The central idea in Newton s method is to build a model M 0 of the function of interest F 0 using the first three terms of the Taylor series ....
....5.11 Given this discriminant function, the misclassification measure (Eq. 5.52) and loss function (Eq. 5.53) used earlier can be adopted. As be fore, the gradient of the loss function with respect to all classifier parameters can be calculated using the chain rule. The reader should refer to [15] for the exact parameter updates and their derivation, they closely follow those for HMMs, given later on in this chapter. As in the MCE trained DTW clas sifier examined earlier, the parameter update becomes significantly simpler 61 by assuming large values for (entailing that only the nearest ....
[Article contains additional citation context not shown here]
E. McDermott. Discriminative Training for Speech Recognition. PhD thesis, Waseda University, March 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC