Results 1 -
6 of
6
Automatic Generation Of Detailed Pronunciation Lexicons
, 1995
"... We explore different ways of "spelling" a word in a speech recognizer's lexicon and how to obtain those spellings. In particular, we compare using as the source of sub-words units for which we build acoustic models (1) a coarse phonemic representation, (2) a single, fine phonetic realization, and (3 ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
We explore different ways of "spelling" a word in a speech recognizer's lexicon and how to obtain those spellings. In particular, we compare using as the source of sub-words units for which we build acoustic models (1) a coarse phonemic representation, (2) a single, fine phonetic realization, and (3) multiple phonetic realizations with associated likelihoods. We describe how we obtain these different pronunciations from text-to-speech systems and from procedures that build decision trees trained on phonetically-labeled corpora. We evaluate these methods applied to speech recognition with the DARPA Resource Management (RM) and the North American Business News (NAB) tasks. For the RM task (with perplexity 60 grammar), we obtain 93.4% word accuracy using phonemic pronunciations, 94.1% using a single phonetic pronunciation per word, and 96.3% using multiple phonetic pronunciations per word with associated likelihoods. For the NAB task (with 60K vocabulary and 34M 1-5 grams), we obtain 87.3% word accuracy with phonemic pronunciations and 90.0% using multiple phonetic pronunciations
Identifying Non-Linguistic Speech Features
- Proc Eurospeech
"... Over the last decade technological advances have been made which enable us to envision real-world applications of speech technologies. It is possible to foresee applications, for example, information centers in public places such as train stations and airports, where the spoken query is to be recogn ..."
Abstract
-
Cited by 24 (13 self)
- Add to MetaCart
Over the last decade technological advances have been made which enable us to envision real-world applications of speech technologies. It is possible to foresee applications, for example, information centers in public places such as train stations and airports, where the spoken query is to be recognized without even prior knowledge of the languagebeing spoken. Other applications may require accurate identification of the speaker for security reasons, including control of access to confidential information or for telephone-based transactions.
A Phone-based Approach to Non-Linguistic Speech Feature Identification
- Computer Speech and Language
, 1995
"... In this paper we present a general approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. The basic idea is to process the unknown speech signal by feature-specific phone model sets in parallel, and to hypothesize the feature value ass ..."
Abstract
-
Cited by 14 (9 self)
- Add to MetaCart
In this paper we present a general approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. The basic idea is to process the unknown speech signal by feature-specific phone model sets in parallel, and to hypothesize the feature value associated with the model set having the highest likelihood. This technique is shown to be effective for text-independent gender, speaker, and language identification. Text-independent speaker identification accuracies of 98.8% on TIMIT (168 speakers) and 99.2% on BREF (65 speakers), were obtained with one utterance per speaker, and 100% with 2 utterances for both corpora. Experiments in which speaker-specific models were estimated without using of the phonetic transcriptions for the TIMIT speakers had the same identification accuracies obtained with the use of the transcriptions. French/English language identification is better than 99% with 2s of read, laboratory speech. On spontaneous teleph...
The LIMSI Continuous Speech Dictation System
"... A major axis of research at LIMSI is directed at multilingual, speaker-independent, large vocabulary speech dictation. In this pa-per the LIMSI recognizer which was evaluated in the ARPA NOV93 CSR test is described, and experimental results on the WSJ and BREF corpora under closely matched condition ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
A major axis of research at LIMSI is directed at multilingual, speaker-independent, large vocabulary speech dictation. In this pa-per the LIMSI recognizer which was evaluated in the ARPA NOV93 CSR test is described, and experimental results on the WSJ and BREF corpora under closely matched conditions are reported. For both corpora word recognition expenrnents were carried out with vocabularies containing up to 20k words. The recognizer makes use of continuous density HMM with Gaussian mixture for acous-tic modeling and n-gram statistics estimated on the newspaper texts for language modeling. The recognizer uses a time-synchronous graph-search strategy which is shown to still be viable with a 20k-word vocabulary when used with bigram back-off language models. A second forward pass, which makes use of a word graph generated with the bigram, incorporates a trigram language model. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models.
Speaker Verification Over The Telephone
- SPEECH COMMUNICATION
, 2000
"... The aim of the research reported in this paper was to assess the capability of state-of-the-art methods for speaker verification in order to determine if high enough performance levels could be obtained to support the development of telecom applications. This experimental study quantified speaker re ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
The aim of the research reported in this paper was to assess the capability of state-of-the-art methods for speaker verification in order to determine if high enough performance levels could be obtained to support the development of telecom applications. This experimental study quantified speaker recognition performance out of the context of any specific application, as a function of factors more-or-less acknowledged to affect the accuracy. Some issues investigated are: the speaker model (Gaussian mixture models are compared with phonebased models), the influence of the amount and content of training and test data on performance; performance degradation due to model ageing and how can this be counteracted by using adaptation techniques; achievable performance levels using text-dependent and textindependent recognition modes. In particular the effect of linguistic content on performance is shown for both read and spontaneous speech. These and other factors were addressed using a large corpus of read and spontaneous speech (over 2000 hours collected from 100 target speakers and 1000 impostors) in French designed and recorded for the purpose of this study. On this data, the lowest equal error rate is 1% for the text-dependent mode when 2 trials are allowed per attempt and with a minimum of 1.5s of speech per trial.
Experiments With Speaker Verification Over The Telephone
- Proc. Eurospeech’95
"... In this paper we present a study on speaker verification showing achievable performance levels for both high quality speech and telephone speech and for two operational modes, i.e. textdependent and text-independent speaker verification. A statistical modeling approach is taken, where for text indep ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
In this paper we present a study on speaker verification showing achievable performance levels for both high quality speech and telephone speech and for two operational modes, i.e. textdependent and text-independent speaker verification. A statistical modeling approach is taken, where for text independent verification the talker is viewed as a source of phones, modeled by a fully connected Markov chain, where the lexical and syntactic structures of the language are approximated by local phonotactic constraints. A first series of experiments were carried out on high quality speech from the BREF corpus to validate this approach and resulted in an a posteriori equal error rate of 0.3% in textdependent as well as in text-independent mode. A second series of experiments were carried out on a telephone corpus recorded specifically for speaker verification algorithm development. On this data, the lowest equal error rate is 2.9% for the text-dependent mode when 2 trials are allowed per attempt...

