Results 1 -
5 of
5
Sub-phonetic modeling for capturing pronunciation variation in conversational speech synthesis
- in Proceedings of IEEE Int. Conf. Acoust., Speech, and Signal Processing
, 2006
"... In this paper we address the issue of pronunciation modeling for conversational speech synthesis. We experiment with two different HMM topologies (fully connected state model and forward connected state model) for sub-phonetic modeling to capture the deletion and insertion of sub-phonetic states dur ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
(Show Context)
In this paper we address the issue of pronunciation modeling for conversational speech synthesis. We experiment with two different HMM topologies (fully connected state model and forward connected state model) for sub-phonetic modeling to capture the deletion and insertion of sub-phonetic states during speech production process. We show that the experimented HMM topologies have higher log likelihood than the traditional 5-state sequential model. We also study the first and second mentions of content words and their influence on the pronunciation variation. Finally we report phone recognition experiments using the modified HMM topologies. 1.
Automatic Building of Synthetic Voices from Large Multi-Paragraph Speech Databases
"... Large multi paragraph speech databases encapsulate prosodic and contextual information beyond the sentence level which could be exploited to build natural sounding voices. This paper discusses our efforts on automatic building of synthetic voices from large multi-paragraph speech databases. We show ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
Large multi paragraph speech databases encapsulate prosodic and contextual information beyond the sentence level which could be exploited to build natural sounding voices. This paper discusses our efforts on automatic building of synthetic voices from large multi-paragraph speech databases. We show that the primary issue of segmentation of large speech file could be addressed with modifications to forced-alignment technique and that the proposed technique is independent of the duration of the audio file. We also discuss how this framework could be extended to build a large number of voices from public domain large multi-paragraph recordings. Index Terms: speech synthesis, large multi-paragraph speech databases, forced-alignment, public domain recordings
TTS from zero: Building synthetic voices for new languages
, 2009
"... A developer wanting to create a speech synthesizer in a new voice for an under-resourced language faces hard problems. These include difficult decisions in defining a phoneme set and a laborious process of accumulating a pronunciation lexicon. Previously this has been handled through involvement of ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
A developer wanting to create a speech synthesizer in a new voice for an under-resourced language faces hard problems. These include difficult decisions in defining a phoneme set and a laborious process of accumulating a pronunciation lexicon. Previously this has been handled through involvement of a language technologies expert. By definition, experts are in short supply. The goal of this thesis is to lower barriers facing a non-technical user in building “TTS from Zero. ” Our approach focuses on simplifying the lexicon building task by having the user listen to and select from a list of pronunciation alternatives. The candidate pronunciations are predicted by grapheme-to-phoneme (G2P) rules that are learned incrementally as the user works through the vocabulary. Studies demonstrate success for Iraqi, Hindi, German, and Bulgarian, among others. We compare various word selection strategies that the active learner uses to acquire maximally predictive rules. Incremental G2P learning enables iterative voice building. Beginning with 20 minutes of recordings, a bootstrapped synthesizer provides pronunciation examples for lexical review, which is fed into the next round of training with more recordings to create a larger, better voice... and so
Re-Engineering Letter-to-Sound Rules
, 2001
"... Using finite-state automata for the text analysis component in a text-to-speech system is problematic in several respects: the rewrite rules from which the automata are compiled are difficult to write and maintain, and the resulting automata can become very large and therefore inefficient. Convertin ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Using finite-state automata for the text analysis component in a text-to-speech system is problematic in several respects: the rewrite rules from which the automata are compiled are difficult to write and maintain, and the resulting automata can become very large and therefore inefficient. Converting the knowledge represented explicitly in rewrite rules into a more efficient format is difficult. We take an indirect route, learning an efficient decision tree representation from data and tapping information contained in existing rewrite rules, which increases performance compared to learning exclusively from a pronunciation lexicon.