Results 1 - 10
of
21
A tutorial on hidden markov models and selected applications in speech recognition
- Proceedings of the IEEE
, 1989
"... Although initially introduced and studied in the late 1960s and early 1970s, statistical methods of Markov source or hidden Markov modeling have become increasingly popular in the last several years. There are two strong reasons why this has occurred. First the models are very rich in mathematical s ..."
Abstract
-
Cited by 3117 (0 self)
- Add to MetaCart
Although initially introduced and studied in the late 1960s and early 1970s, statistical methods of Markov source or hidden Markov modeling have become increasingly popular in the last several years. There are two strong reasons why this has occurred. First the models are very rich in mathematical structure and hence can form the theoretical basis for use in a wide range of applications. Sec-ond the models, when applied properly, work very well in practice for several important applications. In this paper we attempt to care-fully and methodically review the theoretical aspects of this type of statistical modeling and show how they have been applied to selected problems in machine recognition of speech. I.
Signal modeling techniques in speech recognition
- PROCEEDINGS OF THE IEEE
, 1993
"... We have seen three important trends develop in the last five years in speech recognition. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similariry transform techniques, often used to norm ..."
Abstract
-
Cited by 99 (5 self)
- Add to MetaCart
We have seen three important trends develop in the last five years in speech recognition. First, heterogeneous parameter sets that mix absolute spectral information with dynamic, or time-derivative, spectral information, have become common. Second, similariry transform techniques, often used to normalize and decor-relate parameters in some computationally inexpensive way, have become popular. Third, the signal parameter estimation problem has merged with the speech recognition process so that more sophisticated statistical models of the signal’s spectrum can be estimated in a closed-loop manner. In this paper, we review the signal processing components of these algorithms. These al-gorithms are presented as part of a unified view of the signal parameterization problem in which there are three major tasks: measurement, transformation, and statistical modeling. This paper is by no means a comprehensive survey of all possible techniques of signal modeling in speech recognition. There are far too many algorithms in use today to make an exhaustive survey feasible (and cohesive). Instead, this paper is meant to serve as a tutorial on signal processing in state-of-the-art speech recognition systems and to review those techniques most commonly used. In keeping with this goal, a complete mathematical description of each algorithm has been included in the paper.
A Maximum-Likelihood Approach to Stochastic Matching for Robust Speech Recognition
- IEEE Transactions on Speech and Audio Processing
, 1996
"... is granted. A Maximum-Likelihood Approach to Stochastic Matching for Robust Speech Recognition Ananth Sankar 2 and Chin-Hui Lee Speech Research Department AT&T Bell Laboratories Murray Hill, NJ 07974 1 Introduction Recently there has been much interest in the problem of improving the performanc ..."
Abstract
-
Cited by 86 (14 self)
- Add to MetaCart
is granted. A Maximum-Likelihood Approach to Stochastic Matching for Robust Speech Recognition Ananth Sankar 2 and Chin-Hui Lee Speech Research Department AT&T Bell Laboratories Murray Hill, NJ 07974 1 Introduction Recently there has been much interest in the problem of improving the performance of automatic speech recognition (ASR) systems in adverse environments. When there is a mismatch between the training and testing environments, ASR systems suffer a degradation in performance. The goal of robust speech recognition is to remove the effect of this mismatch so as to bring the recognition performance as close as possible to the matched conditions. In speech recognition, the speech is usually modeled by a set of hidden Markov models (HMM) X . During recognition the observed utterance Y is decoded using these models. Due to the mismatch between training and testing conditions, this often results in a degradation in performance compared to the matched conditions. The mismatch b...
Survey of the State of the Art in Human Language Technology
, 1995
"... Contents 1 Spoken Language Input 1 Ron Cole & Victor Zue, chapter editors 1.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Victor Zue & Ron Cole 1.2 Speech Recognition : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 Victor Zue, Ron Cole, & Wayne Ward 1.3 Sig ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
Contents 1 Spoken Language Input 1 Ron Cole & Victor Zue, chapter editors 1.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 Victor Zue & Ron Cole 1.2 Speech Recognition : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 Victor Zue, Ron Cole, & Wayne Ward 1.3 Signal Representation : : : : : : : : : : : : : : : : : : : : : : : : : : 11 Melvyn J. Hunt 1.4 Robust Speech Recognition : : : : : : : : : : : : : : : : : : : : : : 17 Richard M. Stern 1.5 HMM Methods in Speech Recognition : : : : : : : : : : : : : : : 24 Renato De Mori & Fabio Brugnara 1.6 Language Representation : : : : : : : : : : : : : : : : : : : : : : : : 35 Salim Roukos 1.7 Speaker Recognition : : : : : : : : : : : : : : : : : : : : : : : : : : :<F35.37
A Model of Dynamic Auditory Perception and Its Application to Robust Word Recognition
, 1997
"... This paper describes two mechanisms that augment the common automatic speech recognition (ASR) front end and provide adaptation and isolation of local spectral peaks. A dynamic model consisting of a linear filterbank with a novel additive logarithmic adaptation stage after each filter output is prop ..."
Abstract
-
Cited by 22 (7 self)
- Add to MetaCart
This paper describes two mechanisms that augment the common automatic speech recognition (ASR) front end and provide adaptation and isolation of local spectral peaks. A dynamic model consisting of a linear filterbank with a novel additive logarithmic adaptation stage after each filter output is proposed. An extensive series of perceptual forward masking experiments, together with previously reported forward masking data, determine the model's dynamic parameters. Once parameterized, the simple exponential dynamic mechanism predicts the nature of forward masking data from several studies across wide ranging frequencies, input levels, and probe delay times. An initial evaluation of the dynamic model together with a local peak isolation mechanism as a front end for dynamic time warp (DTW) and hidden Markov model (HMM) word recognition systems shows an improvement in robustness to background noise when compared to Mel-frequency cepstral coefficients (MFCC), linear prediction cepstral coefficients (LPCC), and relative spectra (RASTA) based front ends.
Spectral Signal Processing for ASR
- Proc. ASRU’99
, 1999
"... The paper begins by discussing the difficulties in obtaining repeatable results in speech recognition. Theoretical arguments are presented for and against copying human auditory properties in automatic speech recognition. The "standard" acoustic analysis for automatic speech recognition, consisting ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
The paper begins by discussing the difficulties in obtaining repeatable results in speech recognition. Theoretical arguments are presented for and against copying human auditory properties in automatic speech recognition. The "standard" acoustic analysis for automatic speech recognition, consisting of melscale cepstrum coefficients and their temporal derivatives, is described. Some variations and extensions of the standard analysis --- PLP, cepstrum correlation methods, LDA, and variants on log power --- are then discussed. These techniques pass the test of having been found useful at multiple sites, especially with noisy speech. The extent to which auditory properties can account for the advantage found for particular techniques is considered. It is concluded that the advantages do not in fact stem from auditory properties, and that there is so far little or no evidence that the study of the human auditory system has contributed to advances in automatic speech recognition. Contributio...
An Efficient and Scalable 2D DCT-Based Feature Coding Scheme for Remote Speech Recognition
, 2001
"... A 2D DCT-based approach to compressing acoustic features for remote speech recognition applications is presented. The coding scheme involves computing a 2D DCT on blocks of feature vectors followed by uniform scalar quantization, run-length and Huffman coding. Digit recognition experiments were ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
A 2D DCT-based approach to compressing acoustic features for remote speech recognition applications is presented. The coding scheme involves computing a 2D DCT on blocks of feature vectors followed by uniform scalar quantization, run-length and Huffman coding. Digit recognition experiments were conducted in which training was done with unquantized cepstral features from clean speech and testing used the same features after coding and decoding with 2D DCT and entropy coding and in various levels of acoustic noise. The coding scheme results in recognition performance comparable to that obtained with unquantized features at low bitrates. 2D DCT coding of MFCCs together with a method for variable frame rate analysis [Zhu and Alwan, 2000] and peak isolation [Strope and Alwan, 1997] maintains the noise robustness of these algorithms at low SNRs even at 624 bps. The low-complexity scheme is scalable resulting in graceful degradation in performance with decreasing bit rate.
Filtering the Time Sequences of Spectral Parameters for Speech Recognition
, 1997
"... In automatic speech recognition, the signal is usually represented by a set of time sequences of spectral parameters Z. TSSPs that model the temporal evolution of the spectral envelope frame-to-frame. Those sequences are then filtered either Z. to make them more robust to environmental conditions ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
In automatic speech recognition, the signal is usually represented by a set of time sequences of spectral parameters Z. TSSPs that model the temporal evolution of the spectral envelope frame-to-frame. Those sequences are then filtered either Z. to make them more robust to environmental conditions or to compute differential parameters dynamic features which enhance discrimination. In this paper, we apply frequency analysis to TSSPs in order to provide an interpretation framework for the various types of parameter filters used so far. Thus, the analysis of the average long-term spectrum of the successfully filtered sequences reveals a combined effect of equalization and band selection that provides insights into TSSP filtering. Also, we show in the paper that, when supplementary differential parameters are not used, the recognition rate can be improved even for clean speech, just by properly filtering the TSSPs. To support this claim, a number of experimental results are presented, bot...
Scalable Distributed Speech Recognition Using Multi-Frame GMM-Based Block Quantization
"... In this paper, we propose the use of the multi-frame Gaussian mixture model-based block quantizer for the coding of Mel frequencywarped cepstral coefficient (MFCC) features in distributed speech recognition (DSR) applications. This coding scheme exploits intraframe correlation via the Karhunen-Lo ev ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
In this paper, we propose the use of the multi-frame Gaussian mixture model-based block quantizer for the coding of Mel frequencywarped cepstral coefficient (MFCC) features in distributed speech recognition (DSR) applications. This coding scheme exploits intraframe correlation via the Karhunen-Lo eve transform (KLT) and interframe correlation via the joint processing of adjacent frames together with the computational simplicity of scalar quantization. The proposed coder is bit-rate scalable, which means that the bitrate can be adjusted without the need for re-training of the quantizers. Static parameters such as the probability density function (PDF) model and KLT orthogonal matrices are stored at the encoder and decoder and bit allocations are calculated `on-the-fly' without intensive processing. This coding scheme is evaluated in this paper on the Aurora-2 database in a DSR framework. It is shown that this coding scheme achieves high recognition performance at lower bitrates, with a word error rate (WER) of 2.5% at 800 bps, which is less than 1% degradation from the baseline word recognition accuracy, and graceful degradation down to a WER of 7% at 300 bps.
Cepstrum Derived From Differentiated Power Spectrum for Robust Speech Recognition
, 2003
"... Inthi paper, cepstral featuresderire from thedifi#BMxRT# power spectrum (DPS) are proposed forirRW#VVx the robustness of a speechrecogniE# i presence of backgroundnoikg These robust features are computed from the speech sieec of agiWB frame through thefollowij four steps. Fips. the short-tiT power s ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Inthi paper, cepstral featuresderire from thedifi#BMxRT# power spectrum (DPS) are proposed forirRW#VVx the robustness of a speechrecogniE# i presence of backgroundnoikg These robust features are computed from the speech sieec of agiWB frame through thefollowij four steps. Fips. the short-tiT power spectrum of speech siechi computed from the speech siech through the fastFouri# transformalgorimR# Second, DPSi obtaij# bydiWEq entiEqxx the power spectrumwic respect to frequency.Thiqu themagni#EE of DPSi projected fromlimRW frequency to the mel scale and smoothed by a filter bank.Fik.Rfi# the outputs of the filter bank are transformed to cepstral coe#cilRW by thedijqVBE cosiV transform after a nonliERT transformatiRi It i shown that thi new feature set can be decomposed as thesuperposifixW of the standard cepstrum and id nonliEjRTfi linliE counterpart.Whin aliMfij liMfij has no e#ect on the conti#VRT densi hisi Markov model based speechrecogniBVVR we show that the proposed feature set embeddedwid a nonlijRT lilijRT transformatiR i quin e#ectio for robust speechrecognifiVBR ForthiW we conduct a number of speechrecogniB#x experiB#x (ieriB# iieri wordrecogniERTfi connecteddinec recogniRWVx and large vocabulary contilary speechrecogniB#fiR i varign operati enviiB#fiR and compare the DPS featureswit the standard mel-frequency cepstral coe#cilR features used wid cepstral meannormaliMRTfij and spectral subtractij techniijq # 2003Elseviq B.V. AllriRjW reserved. Keywords: Robust speech recogniWV n;HijBq Markov model; Diel;R tie power spectrum;Lictr litrum;R# Cepstral mean normaliBfi ior Spectralsubtractix 1.IR oduction Speech siech carri esiRjW matix from many sources. But not allilRj mati ni s relevant or ifiMEEqRT for speechrecogni tiog In speech recogniWVxE the firstcruci l stepi...

