Results 1 - 10
of
16
Support vector machines for speech recognition
- Proceedings of the International Conference on Spoken Language Processing
, 1998
"... Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative informati ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative information and are prone to overfitting and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. In this paper, we show that SVMs provide a significant improvement in performance on a static pattern classification task based on the Deterding vowel data. We also describe an application of SVMs to large vocabulary speech recognition, and demonstrate an improvement in error rate on a continuous alphadigit task (OGI Aphadigits) and a large vocabulary conversational speech task (Switchboard). Issues related to the development and optimization of an SVM/HMM hybrid system are discussed.
Localized spectro-temporal features for automatic speech recognition
- Proc. Eurospeech
, 2003
"... Recent results from physiological and psychoacoustic studies indicate that spectrally and temporally localized time-frequency envelope patterns form a relevant basis of auditory perception. This motivates new approaches to feature extraction for automatic speech recognition (ASR) which utilize two-d ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Recent results from physiological and psychoacoustic studies indicate that spectrally and temporally localized time-frequency envelope patterns form a relevant basis of auditory perception. This motivates new approaches to feature extraction for automatic speech recognition (ASR) which utilize two-dimensional spectro-temporal modulation filters. The paper provides a motivation and a brief overview on the work related to Localized Spectro-Temporal Features (LSTF). It further focuses on the Gabor feature approach, where a feature selection scheme is applied to automatically obtain a suitable set of Gabor-type features for a given task. The optimized feature sets are examined in ASR experiments with respect to robustness and their statistical properties are analyzed. 1. Getting auditory... again?
Speech recognition with support vector machines in a hybrid system
- in Proc. EuroSpeech, 2005
, 2005
"... While the temporal dynamics of speech can be represented very efficiently by Hidden Markov Models (HMMs), the classification of speech into single speech units (phonemes) is usually done with Gaussian mixture models which do not discriminate well. Here, we use Support Vector Machines (SVMs) for clas ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
While the temporal dynamics of speech can be represented very efficiently by Hidden Markov Models (HMMs), the classification of speech into single speech units (phonemes) is usually done with Gaussian mixture models which do not discriminate well. Here, we use Support Vector Machines (SVMs) for classification by integrating this method in a HMM-based speech recognition system. In this hybrid SVM/HMM system we translate the outputs of the SVM classifiers into conditional probabilities and use them as emission probabilities in a HMM-based decoder. SVMs are very appealing due to their association with statistical learning theory. They have already shown very good classification results in other fields of pattern recognition. We train and test our hybrid system on the DARPA Resource Management (RM1) corpus. Our results show better performance than HMM-based decoder using Gaussian mixtures. 1.
Refining Hidden Markov Models with Recurrent Neural Networks
, 1999
"... Both hidden Markov models (HMMs) and recurrent neural networks (RNNs) have been applied to sequence recognition problems. While HMMs are easy to train, they generally do not perform satisfactorily on difficult recognition problems. On the other hand, RNNs are excellent recognizers but are very hard ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Both hidden Markov models (HMMs) and recurrent neural networks (RNNs) have been applied to sequence recognition problems. While HMMs are easy to train, they generally do not perform satisfactorily on difficult recognition problems. On the other hand, RNNs are excellent recognizers but are very hard to train. Hybrid HMM/NN approaches aim at taking advantage of the strengths of both paradigms while avoiding their respective weaknesses. This paper proposes a novel approach of combining HMMs with RNNs. We discuss an algorithm for directly mapping a trained HMM into a RNN architecture and derive a gradient-descent learning algorithm for refining the knowledge.
A hybrid hmm-based speech recognizer using kernel-based discriminants as acoustic models
- In 18th International Conference on Pattern Recognition (ICPR 2006), 20-24 August 2006, Hong Kong
, 2006
"... In this paper we propose a novel order-recursive training algorithm for kernel-based discriminants which is computationally efficient. We integrate this method in a hybrid HMM-based speech recognition system by translating the outputs of the kernel-based classifier into class-conditional probabiliti ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In this paper we propose a novel order-recursive training algorithm for kernel-based discriminants which is computationally efficient. We integrate this method in a hybrid HMM-based speech recognition system by translating the outputs of the kernel-based classifier into class-conditional probabilities and using them instead of Gaussian mixtures as production probabilities of a HMM-based decoder for speech recognition. The performance of the described hybrid structure is demonstrated on the DARPA Resource Management (RM1) corpus.
Kernel fisher discriminants as acoustic models in hmm-based speech recognition
- in 10th International Conference on Speech and Computer
, 2005
"... While the temporal dynamic of speech can be handled very efficiently by Hidden Markov Models (HMMs), the classification of the single speech units (phonemes) is usually done with Gaussian probability density functions which are not discriminative. In this paper we use the Kernel Fisher Discriminant ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
While the temporal dynamic of speech can be handled very efficiently by Hidden Markov Models (HMMs), the classification of the single speech units (phonemes) is usually done with Gaussian probability density functions which are not discriminative. In this paper we use the Kernel Fisher Discriminant (KFD) for classification by integrating this method in a HMM-based speech recognition system. In this structure we translate the outputs of the KFD into class-conditional probabilities and use them as production probabilities in an HMM-based speech decoder. The KFD has already shown good classification results in other fields (e. g. pattern recognition). To obtain good performance also in terms of computational complexity the KFD is implemented iteratively with a sparse greedy approach. We train and test the described hybrid structure on the Resource Management (RM1) task.
the State Based Mixture of Expert HMM with Applications to the Recognition of Spontaneous Speech
, 2001
"... Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Although the performance of speech recognition systems has increased substantially over the last decades, there still remain a number of tasks which pose considerable problems for current state-of-the-art te ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Although the performance of speech recognition systems has increased substantially over the last decades, there still remain a number of tasks which pose considerable problems for current state-of-the-art techniques. One of these tasks is the recognition of spontaneous speech which differs from read or planned speech in that its underlying dynamics change frequently over time. The negative effect of changes in acoustic background condition on recognition performance can also be observed in other situations as, for instance, in the case of speech that is corrupted by non-stationary noise. This thesis is concerned with the development of an acoustic model for speech recognition which automatically detects changes in the background condition of a signal and compensates for the model-data mismatch by combining the information of several expert models. These experts are specialised on the different acoustic conditions under consideration and their influ-ence on the recognition process is determined by how well their associated condition matches
A Speaker Independent Continuous Speech Recognizer for Amharic
"... The paper discusses an Amharic speaker independent continuous speech recognizer based on an HMM/ANN hybrid approach. The model was constructed at a context dependent phone part sub-word level with the help of the CSLU Toolkit. A promising result of 74.28 % word and 39.70 % sentence recognition rate ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The paper discusses an Amharic speaker independent continuous speech recognizer based on an HMM/ANN hybrid approach. The model was constructed at a context dependent phone part sub-word level with the help of the CSLU Toolkit. A promising result of 74.28 % word and 39.70 % sentence recognition rate was achieved. These are the best figures reported so far for speech recognition for the Amharic language. 1.
Training HMM/ANN Hybrid Speech Recognizers by Probabilistic Sampling
"... Abstract. Most machine learning algorithms are sensitive to class imbalances of the training data and tend to behave inaccurately on classes represented by only a few examples. The case of neural nets applied to speech recognition is no exception, but this situation is unusual in the sense that the ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Most machine learning algorithms are sensitive to class imbalances of the training data and tend to behave inaccurately on classes represented by only a few examples. The case of neural nets applied to speech recognition is no exception, but this situation is unusual in the sense that the neural nets here act as posterior probability estimators and not as classifiers. Most remedies designed to handle the class imbalance problem in classification invalidate the proof that justifies the use of neural nets as posterior probability models. In this paper we examine one of these, the training scheme called probabilistic sampling, and show that it is fortunately still applicable. First, we argue that theoretically it makes the net estimate scaled class-conditionals instead of class posteriors, but for the hidden Markov model speech recognition framework it causes no problems, and in fact fits it even better. Second, we will carry out experiments to show the feasibility of this training scheme. In the experiments we create and examine a transition between the conventional and the class-based sampling, knowing that in practice the conditions of the mathematical proofs are unrealistic. The results show that the optimal performance can indeed be attained somewhere in between, and is slightly better than the scores obtained in the traditional way. 1
Lexical Structure for Dialogue Act Recognition
"... Abstract — This paper deals with automatic dialogue acts (DAs) recognition in Czech. Dialogue acts are sentence-level labels that represent different states of a dialogue, such as questions, hesitations,... In our application, a multimodal reservation system, four dialogue acts are considered: state ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract — This paper deals with automatic dialogue acts (DAs) recognition in Czech. Dialogue acts are sentence-level labels that represent different states of a dialogue, such as questions, hesitations,... In our application, a multimodal reservation system, four dialogue acts are considered: statements, orders, yes/no questions and other questions. The main contribution of this work is to propose and compare several approaches that recognize dialogue acts based on three types of information: lexical information, prosody and word positions. These approaches are tested on a Czech Railways corpus that contains human-human dialogues, which are transcribed both manually and with an automatic speech recognizer for comparison. The experimental results confirm that every type of feature (lexical, prosodic and word positions) bring relevant and somewhat complementary information. The proposed methods that take into account word positions are especially interesting, as they bring global information about the structure of the sentence, at the opposite of traditional n-gram models that only capture local cues. When word sequences are estimated from a speech recognizer, the resulting decrease of accuracy of all proposed approaches is very small (about 3 %), which confirms the capability of the proposed approaches to perform well in real applications. Index Terms — dialogue act, language model, prosody, sentence structure, speech recognition

