• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Sub-phonetic Modeling for Capturing Pronunciation Variations for Conversational Speech Synthesis. In: (2006)

by K Prahallad, A W Black, M Ravishankar
Venue:Proc. ICASSP
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 14
Next 10 →

Automatic Building of Synthetic Voices from Large Multi-Paragraph Speech Databases

by Kishore Prahallad, Arthur R Toth
"... Large multi paragraph speech databases encapsulate prosodic and contextual information beyond the sentence level which could be exploited to build natural sounding voices. This paper discusses our efforts on automatic building of synthetic voices from large multi-paragraph speech databases. We show ..."
Abstract - Cited by 9 (1 self) - Add to MetaCart
Large multi paragraph speech databases encapsulate prosodic and contextual information beyond the sentence level which could be exploited to build natural sounding voices. This paper discusses our efforts on automatic building of synthetic voices from large multi-paragraph speech databases. We show that the primary issue of segmentation of large speech file could be addressed with modifications to forced-alignment technique and that the proposed technique is independent of the duration of the audio file. We also discuss how this framework could be extended to build a large number of voices from public domain large multi-paragraph recordings. Index Terms: speech synthesis, large multi-paragraph speech databases, forced-alignment, public domain recordings
(Show Context)

Citation Context

...g produced/uttered by human-beings. For example, it is known that the prosody and acoustics of a word spoken in a sentence significantly differs from that of spoken in isolation. The work done in [3] =-=[4]-=- [5] [6] [7] [8] suggests that a similar analogy of prosodic and acoustic difference exists for sentences spoken in isolation versus sentences spoken in paragraphs and similarly for paragraphs too. So...

TTS from zero: Building synthetic voices for new languages

by John Kominek, Alexander I. Rudnicky , 2009
"... A developer wanting to create a speech synthesizer in a new voice for an under-resourced language faces hard problems. These include difficult decisions in defining a phoneme set and a laborious process of accumulating a pronunciation lexicon. Previously this has been handled through involvement of ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
A developer wanting to create a speech synthesizer in a new voice for an under-resourced language faces hard problems. These include difficult decisions in defining a phoneme set and a laborious process of accumulating a pronunciation lexicon. Previously this has been handled through involvement of a language technologies expert. By definition, experts are in short supply. The goal of this thesis is to lower barriers facing a non-technical user in building “TTS from Zero. ” Our approach focuses on simplifying the lexicon building task by having the user listen to and select from a list of pronunciation alternatives. The candidate pronunciations are predicted by grapheme-to-phoneme (G2P) rules that are learned incrementally as the user works through the vocabulary. Studies demonstrate success for Iraqi, Hindi, German, and Bulgarian, among others. We compare various word selection strategies that the active learner uses to acquire maximally predictive rules. Incremental G2P learning enables iterative voice building. Beginning with 20 minutes of recordings, a bootstrapped synthesizer provides pronunciation examples for lexical review, which is fed into the next round of training with more recordings to create a larger, better voice... and so
(Show Context)

Citation Context

...honetic transcriptions for the purpose of learning probabilistic phonological rules [158]. Closer to our purpose, Prahallad altered the architecture of a speech decoder at the level of phone topology =-=[126]-=-. The topologies were either a) linear 5-state sequential with no skip states, b) fully connected 5-state networks, or c) linear 5-state sequential with fully connecting forward jumps. The states are ...

Text-To-Speech for Languages without an Orthography

by Sukhada Palkar, Alan W Black, Alok Parlikar
"... Speech synthesis models are typically built from a corpus of speech that has accurate transcriptions. However, many of the languages of the world do not have a standardized writing system. This paper is an initial attempt at building synthetic voices for such languages. It may seem useless to develo ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Speech synthesis models are typically built from a corpus of speech that has accurate transcriptions. However, many of the languages of the world do not have a standardized writing system. This paper is an initial attempt at building synthetic voices for such languages. It may seem useless to develop a text-to-speech system when there is no text available. But we will discuss some well defined use cases where we need these models. We will present our method to build synthetic voices from only speech data. We will present experimental results and oracle studies that show that we can automatically devise an artificial writing system for these languages, and build synthetic voices that are understandable and usable.
(Show Context)

Citation Context

...t and long vowels could improve our models. We decoded the speech data with the previous English acoustic model. We then aligned the speech to these phonetic transcripts using an EHMM alignment tool (=-=Prahallad et al., 2006-=-). We determined the duration of different vowels and clustered them into two groups based on the duration. We then labeled these vowel clusters as being two different vowels when training the synthet...

QUALITY CONTROL OF AUTOMATIC LABELLING USING HMM-BASED SYNTHESIS

by Sathish Pammi, Marcela Charfuelan, Marc Schröder , 2009
"... This paper presents a measure to verify the quality of automatically aligned phone labels. The measure is based on a similarity cost between automatically generated phonetic segments and phonetic segments generated by an HMM-based synthesiser. We investigate the effectiveness of the measure for iden ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
This paper presents a measure to verify the quality of automatically aligned phone labels. The measure is based on a similarity cost between automatically generated phonetic segments and phonetic segments generated by an HMM-based synthesiser. We investigate the effectiveness of the measure for identifying problems of three types: alignment errors, phone identity problems and noise insertion. Our experiments show that the measure is best at finding noise errors, followed by phone identity mismatches and serious misalignments.
(Show Context)

Citation Context

...r labelling speech data is manual labelling by linguistic experts. This task is both time consuming and complicated, therefore automatic methods and algorithms have been developed over the last years =-=[1, 2, 3]-=-. Automatic labelling methods are still error-prone, so that they are often followed by a stage of manual correction, which can be directed by a confidence information that indicates which labels to v...

Comparison of phonetic segmentation tools for european Portuguese

by Luís Figueira , Luís C Oliveira - in Proceedings of the 8th International Conference on Computational Processing of the Portuguese Language , 2008
"... Abstract. Currently, the majority of the text-to-speech synthesis systems that provide the most natural output are based on the selection and concatenation of variable size speech units chosen from an inventory of recordings. There are many different approaches to perform automatic speech segmentat ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract. Currently, the majority of the text-to-speech synthesis systems that provide the most natural output are based on the selection and concatenation of variable size speech units chosen from an inventory of recordings. There are many different approaches to perform automatic speech segmentation. The most used are based on (Hidden Markov Models) HMM In this work we compare several phonetic segmentation tools, based in different technologies, and study the transition types where each segmentation tool achieves better results. To evaluate the segmentation tools we chose the criterion of the number of phonetic transitions (phone borders) with an error below 20ms when compared to the manual segmentation. This value is of common use in the literature [6] as a majorant of a phone error. Afterwards, we combine the individual segmentation tools, taking advantage of their differentiate behavior accordingly to the phonetic transition type. This approach improves the results obtained with any standalone tool used by itself. Since the goal of this work is the evaluation of fully automatic tools, we did not use any manual segmentation data to train models. The only manual information used during this study was the phonetic sequence. The speech data was recorded by a professional male native European Portuguese speaker. The corpus contains 724 utterances, corresponding to 87 minutes of speech (including silences). It was manually segmented at the phonetic level by two expert phoneticians. It has a total of 45282 phones, with the following distribution by phonetic classes: vowels (45%), plosives (19.2%), fricatives (14.6%), liquids (9.9%), nasal consonants (5.7%) and silences (5.5%). The data was split in 5 training/test sets -with a ratio of 4/1 of the available data, without superposition. For this work we selected the following phonetic segmentation tools: Multiple Acoustic Features-Dynamic Time Warping (MAF-DTW): tool that improves the performance of the traditional DTW alignment algorithm by using a combination of multiple acoustic
(Show Context)

Citation Context

... HMM/Multi-Layer Perceptron (MLP) acoustic model combining posterior phone probabilities generated by several MLP’s trained on distinct input features [7,8]. The MLP network weights were re– estimated to adapt the models to the speaker; Hidden Markov Model Toolkit: (HTK) [9], using unsupervised speaker-adapted, context-independent Hidden Markov Models (HMM). The models were adapted based on initial segmentations generated by the MAF–DTW tool. The models have ergodical left–right topology, with 5 states each (3 emitting states); eHMM: phonetic alignment tool oriented for speech synthesis tasks [10] , developed in Carnegie Mellon University and distributed together with a set tools for building voices for Festival, called Festvox 2.1 [11]. The adopted model topology is the same as described for HTK; eHMM was also used doing acoustic model adaptation to the speaker. In Table 1 we present the overall performance of each segmentation tool. From this table, it can be seen that the MAF–DTW is the tool with the worst performance in terms of Absolute Mean Error (AME): 41ms. This value is almost twice as much as the second worst result (eHMM). This was already expected, as DTW algorithms are usu...

Modeling Pause-Duration for Style-Specific Speech Synthesis

by Alok Parlikar, Alan W Black
"... A major contribution to speaking style comes from both the location of phrase breaks in an utterance, as well as the duration of these breaks. This paper is about modeling the duration of style specific breaks. We look at six styles of speech here. We present analysis that shows that these styles di ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
A major contribution to speaking style comes from both the location of phrase breaks in an utterance, as well as the duration of these breaks. This paper is about modeling the duration of style specific breaks. We look at six styles of speech here. We present analysis that shows that these styles differ in the duration of pauses in natural speech. We have built CART models to predict the pause duration in these corpora and have integrated them into the Festival speech synthesis system. Our objective results show that if we have sufficient training data, we can build style specific models. Our subjective tests show that people can perceive the difference between different models and that they prefer style specific models over simple pause duration models.
(Show Context)

Citation Context

..., “audio-book” but isdifferent from the TATS corpus. We extracted the pause duration from natural speech in our corpora. To do that, we force-aligned the speech and transcriptions using an EHMM tool =-=[16]-=- that allows for short silences to be inserted during the alignment. We used these alignments to find out the length of these inserted silences. We ignored all inserted pauses that were less than 80ms...

Optimizing segment label . . .

by Alan W Black, John Kominek , 2009
"... This paper introduces a new optimization technique for moving segment labels (phone and subphonetic) to optimize statistical parametric speech synthesis models. The choice of objective measures is investigated thoroughly and listening tests show the results to significantly improve the quality of th ..."
Abstract - Add to MetaCart
This paper introduces a new optimization technique for moving segment labels (phone and subphonetic) to optimize statistical parametric speech synthesis models. The choice of objective measures is investigated thoroughly and listening tests show the results to significantly improve the quality of the generated speech equivalent to increasing the database size by 3 fold.

decision

by Sam Bowman, Karen Livescu
"... pronunciation variation with context-dependent articulatory feature ..."
Abstract - Add to MetaCart
pronunciation variation with context-dependent articulatory feature
(Show Context)

Citation Context

...tion modeling in isolation, and note that it is an important problem in its own right (as part of the study of phonology) as well as necessary for other tasks (such as conversational speech synthesis =-=[11]-=-). 2. Feature-based pronunciation modeling The pronunciation models we consider are based on ideas from autosegmental [3] and articulatory [1] phonology. In such a model, the typical single sequence o...

INTERSPEECH 2006- ICSLP CLUSTERGEN: A Statistical Parametric Synthesizer using Trajectory Modeling

by unknown authors
"... Unit selection synthesis has shown itself to be capable of producing high quality natural sounding synthetic speech when constructed from large databases of well-recorded, well-labeled speech. However, the cost in time and expertise of building such voices is still too expensive and specialized to b ..."
Abstract - Add to MetaCart
Unit selection synthesis has shown itself to be capable of producing high quality natural sounding synthetic speech when constructed from large databases of well-recorded, well-labeled speech. However, the cost in time and expertise of building such voices is still too expensive and specialized to be able to build individual voices for everyone. The quality in unit selection synthesis is directly related to the quality and size of the database used. As we require our speech synthesizers to have more variation, style and emotion, for unit selection synthesis, much larger databases will be required. As an alternative, more recently we have started looking for parametric models for speech synthesis, that are still trained from databases of natural speech but are more robust to errors and allow for better modeling of variation. This paper presents the CLUSTERGEN synthesizer which is implemented within the Festival/FestVox voice building environment. As well as the basic technique, three methods of modeling dynamics in the signal are presented and compared: a simple point model, a basic trajectory model and a trajectory model with overlap and add. Index Terms: speech synthesis, statistical parametric synthesis, trajectory HMMs.
(Show Context)

Citation Context

...by others. 2.1. Training The first stage, which is not technically part of the CLUSTERGEN synthesizer is to label the database using an HMM labeler. For the results presented here, we have used EHMM, =-=[11]-=-, which is included within the latest FestVox release. It uses Baum Welch from a flat start to train context independent HMM models, which it then uses to force align the phonemes generated from the t...

Blizzard 2008: Experiments on Unit Size for Unit Selection Speech Synthesis

by E. Veera Raghavendra, Srinivas Desai, B. Yegnanarayana, Alan W Black, Kishore Prahallad
"... This paper describes the techniques and approaches developed at IIIT Hyderabad for building synthetic voices in Blizzard 2008 speech synthesis challenge. We have submitted three different voices: English full voice, English ARCTIC voice and Mandarin voice. Our system is identified as D. In building ..."
Abstract - Add to MetaCart
This paper describes the techniques and approaches developed at IIIT Hyderabad for building synthetic voices in Blizzard 2008 speech synthesis challenge. We have submitted three different voices: English full voice, English ARCTIC voice and Mandarin voice. Our system is identified as D. In building the three voices, our approach has been to experiment and exploit syllable-like large units for concatenative synthesis. Inspite of large database supplied in Blizzard 2008, we find that a backoff strategy is essential in using syllable-like units. In this paper, we propose a novel technique of approximate matching of the syllables as back-off technique for building voices. Index Terms: speech synthesis, unit size, tonal unit, prominence 1.
(Show Context)

Citation Context

...speech analysis respectively. In turn they produce phone sequences and signal features, fundamental frequency, mel-cepstral coefficients (MCEP) and energy. Phone sequence and MCEPs are passed to EHHM =-=[10]-=- for labeling the speech signal with respect to phone sequence of the utterance. EHMM would produce the labels with phones and it’s time stamps in the speech signal. Rest of the procedure is broken in...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University