• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Automatic building of synthetic voices from large multi-paragraph speech databases,” in INTERSPEECH, (2007)

by K Prahallad, A R Toth, A W Black
Add To MetaCart

Tools

Sorted by:
Results 1 - 9 of 9

Automatic sentence selection from speech corpora including diverse speech for improved HMM-TTS synthesis quality

by Norbert Braunschweiler, Sabine Buchholz - in Proc. Interspeech , 2011
"... Using publicly available audiobooks for HMM-TTS poses new challenges. This paper addresses the issue of diverse speech in audiobooks. The aim is to identify diverse speech likely to have a negative effect on HMM-TTS quality. Manual removal of diverse speech was found to yield better synthesis qualit ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
Using publicly available audiobooks for HMM-TTS poses new challenges. This paper addresses the issue of diverse speech in audiobooks. The aim is to identify diverse speech likely to have a negative effect on HMM-TTS quality. Manual removal of diverse speech was found to yield better synthesis quality despite halving the training corpus. To handle large amounts of data an automatic approach is proposed. The approach uses a small set of acoustic and text based features. A series of listening tests showed that the manual selection is most preferred, while the automatic selection showed significant preference over the full training set. Index Terms: speech synthesis, HMM-TTS, corpus creation, diverse speech, speaking styles, audiobooks
(Show Context)

Citation Context

...n effort to realize more expressive synthetic speech,sespecially for long coherent texts, the use of other speechscorpora like audiobooks has become the focus of research forssynthetic voice buildings=-=[1]-=-. However, publicly availablesaudiobooks are often read by non-professional readers undersnon-optimal recording conditions. Many of these audiobookssinclude a wide variety of speaking styles with one ...

Lightly Supervised GMM VAD to use Audiobook for Speech Synthesiser

by Yoshitaka Mamiya, Junichi Yamagishi, Oliver Watts, Robert A. J. Clark, Simon King, Adriana Stan - in Proc. ICASSP , 2013
"... Audiobooks have been focused on as promising data for training Text-to-Speech (TTS) systems. However, they usually do not have a correspondence between audio and text data. Moreover, they are usually divided only into chapter units. In practice, we have to make a correspondence of audio and text dat ..."
Abstract - Cited by 7 (5 self) - Add to MetaCart
Audiobooks have been focused on as promising data for training Text-to-Speech (TTS) systems. However, they usually do not have a correspondence between audio and text data. Moreover, they are usually divided only into chapter units. In practice, we have to make a correspondence of audio and text data before we use them for building TTS synthesisers. However aligning audio and text data is time-consuming and involves manual labor. It also requires persons skilled in speech processing. Previously, we have proposed to use graphemes for automatically aligning speech and text data. This paper further integrates a lightly supervised voice activity detection (VAD) technique to detect sentence boundaries as a pre-processing step before the grapheme approach. This lightly supervised technique requires time stamps of speech and silence only for the first fifty sentences. Combining those, we can semiautomatically build TTS systems from audiobooks with minimum manual intervention. From subjective evaluations we analyse how the grapheme-based aligner and/or the proposed VAD technique impact the quality of HMM-based speech synthesisers trained on audiobooks. Index Terms — voice activity detection, lightly supervised, audiobook, HMM-based speech synthesis 1.

Lightly Supervised Discriminative Training of Grapheme Models for Improved Sentence-level

by Adriana Stan, Peter Bell, Junichi Yamagishi, Simon King - Alignment of Speech and Text Data,” in Proc. of Interspeech (accepted , 2013
"... This paper introduces a method for lightly supervised discriminative training using MMI to improve the alignment of speech and text data for use in training HMM-based TTS systems for low-resource languages. In TTS applications, due to the use of long-span contexts, it is important to select training ..."
Abstract - Cited by 5 (4 self) - Add to MetaCart
This paper introduces a method for lightly supervised discriminative training using MMI to improve the alignment of speech and text data for use in training HMM-based TTS systems for low-resource languages. In TTS applications, due to the use of long-span contexts, it is important to select training utterances which have wholly correct transcriptions. In a low-resource setting, when using poorly trained grapheme models, we show that the use of MMI discriminative training at the grapheme-level enables us to increase the amount of correctly aligned data by 40%, while maintaining a 7 % sentence error rate and 0.8 % word error rate. We present the procedure for lightly supervised discriminative training with regard to the objective of minimising sentence error rate. Index Terms: automatic alignment, grapheme models, light supervision, MMI, text-to-speech
(Show Context)

Citation Context

...nology applications, requires a reliable matching transcription to be obtained. The use of an existing automatic speech recognition (ASR) system for this purpose has been proposed by many researchers =-=[3, 8, 9, 10, 11, 12, 13]-=-. But these methods are applicable only to languages where the resources for training a good speakerindependent ASR system already exist. For an under-resourced language, the only audio data available...

TUNDRA: A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision

by A Stan , O Watts , Y Mamiya , M Giurgiu , R A J Clark , J Yamagishi , S King - in Proc. of Interspeech , 2013
"... Abstract Simple4All Tundra (version 1.0) is the first release of a standardised multilingual corpus designed for text-to-speech research with imperfect or found data. The corpus consists of approximately 60 hours of speech data from audiobooks in 14 languages, as well as utterance-level alignments ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Abstract Simple4All Tundra (version 1.0) is the first release of a standardised multilingual corpus designed for text-to-speech research with imperfect or found data. The corpus consists of approximately 60 hours of speech data from audiobooks in 14 languages, as well as utterance-level alignments obtained with a lightly-supervised process. Future versions of the corpus will include finer-grained alignment and prosodic annotation, all of which will be made freely available. This paper gives a general outline of the data collected so far, as well as a detailed description of how this has been done, emphasizing the minimal language-specific knowledge and manual intervention used to compile the corpus. To demonstrate its potential use, textto-speech systems have been built for all languages using unsupervised or lightly supervised methods, also briefly presented in the paper.
(Show Context)

Citation Context

...tabase of ‘found’ data, which we release under the name Tundra. There has been much recent interest in in using found data to produce TTS systems, in particular, speech data from audiobook recordings =-=[1, 2, 3, 4, 5, 6, 7]-=-. We note that the Arctic databases [8] have provided a valuable resource for research into TTS using conventional purpose-recorded databases, in that they are freely available and serve as a common p...

Modeling Pause-Duration for Style-Specific Speech Synthesis

by Alok Parlikar, Alan W Black
"... A major contribution to speaking style comes from both the location of phrase breaks in an utterance, as well as the duration of these breaks. This paper is about modeling the duration of style specific breaks. We look at six styles of speech here. We present analysis that shows that these styles di ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
A major contribution to speaking style comes from both the location of phrase breaks in an utterance, as well as the duration of these breaks. This paper is about modeling the duration of style specific breaks. We look at six styles of speech here. We present analysis that shows that these styles differ in the duration of pauses in natural speech. We have built CART models to predict the pause duration in these corpora and have integrated them into the Festival speech synthesis system. Our objective results show that if we have sufficient training data, we can build style specific models. Our subjective tests show that people can perceive the difference between different models and that they prefer style specific models over simple pause duration models.
(Show Context)

Citation Context

...reaks between utterances in a paragraph, and breaks at the ends of paragraphs. Modeling the duration of these breaks is tricky, since databases from which we can train them are not readily available. =-=[20]-=- have proposed a method with which large speech corpora could be aligned to their text, to automatically build a corpus for TTS voices that includes information at sentence and paragraph boundaries. O...

Competency Area

by Charl J. Van Heerden, Febe De Wet, Marelie H. Davel
"... Abstract—This paper reports on the automatic alignment of audiobooks in Afrikaans. An existing Afrikaans pronunciation dictionary and corpus of Afrikaans speech data are used to generate baseline acoustic models. The baseline system achieves an average duration independent overlap rate of 0.977 on t ..."
Abstract - Add to MetaCart
Abstract—This paper reports on the automatic alignment of audiobooks in Afrikaans. An existing Afrikaans pronunciation dictionary and corpus of Afrikaans speech data are used to generate baseline acoustic models. The baseline system achieves an average duration independent overlap rate of 0.977 on the first three chapters of an audio version of “Ruiter in die Nag”, an Afrikaans book by Mikro. The average duration independent overlap rate increases to 0.990 when the speech data from the audiobook is used to perform Maximum A Posteriori adaptation on the baseline models. The corresponding value for models trained on the audiobook data is 0.996. An automatic measure of alignment accuracy is also introduced and compared to accuracies measured relative to a gold standard. I.
(Show Context)

Citation Context

...l alignments between the audio and text versions of audiobooks are used either to enhance the level of accessibility of the books [2], [3] or to develop resources for text-to-speech (TTS) development =-=[4]-=-, [5], [6]. A large project was undertaken in Portugal to improve the access to digital audiobooks by print-disabled readers [2]. Amongst other things, an ASR system was developed to automatically ali...

Using Adaptation to Improve Speech Transcription Alignment in Noisy and Reverberant Environments

by Y. Mamiya, A. Stan, J. Yamagishi, P. Bell, O. Watts, R. A. J. Clark, S. King , 2013
"... When using data retrieved from the internet to create new speech databases, the recording conditions can often be highly variable within and between sessions. This variance influences the overall performance of any automatic speech and text alignment techniques used to process this data. In this pap ..."
Abstract - Add to MetaCart
When using data retrieved from the internet to create new speech databases, the recording conditions can often be highly variable within and between sessions. This variance influences the overall performance of any automatic speech and text alignment techniques used to process this data. In this paper we discuss the use of speaker adaptation methods to address this issue. Starting from a baseline system for automatic sentence-level segmentation and speech and text alignment based on GMMs and grapheme HMMs, respectively, we employ Maximum A Posteriori (MAP) and Constrained Maximum Likelihood Linear Regression (CMLLR) techniques to model the variation in the data in order to increase the amount of confidently aligned speech. We tested 29 different scenarios, which include reverberation, 8 talker babble noise and white noise, each in various combinations and SNRs. Results show that the MAP-based segmentation’s performance is very much influenced by the noise type, as well as the presence or absence of reverberation. On the other hand, the CMLLR adaptation of the acoustic models gives an average 20 % increase in the aligned data percentage for the majority of the studied scenarios. Index Terms: speech alignment, speech segmentation, adaptive training, CMLLR, MAP, VAD
(Show Context)

Citation Context

...recognition or synthesis systems for new domains or new languages. Automatic alignment of speech with imperfect transcripts has already been well addressed in the previous work of others, for example =-=[1, 2, 3, 4, 5, 6, 7]-=-. Unfortunately, all of these approaches make use of expert knowledge and/or expensive resources, such as very good speaker-independent acoustic models or large vocabulary ‘biased’ language models, an...

Annotating Speech Corpus for Prosody Modeling in Indian Language Text to Speech Systems

by Kiruthiga S, Krishnamoorthy K
"... A spoken language system, it may either be a speech synthesis or a speech recognition system, starts with building a speech corpora. We give a detailed survey of issues and a methodology that selects the appropriate speech unit in building a speech corpus for Indian language Text to Speech systems. ..."
Abstract - Add to MetaCart
A spoken language system, it may either be a speech synthesis or a speech recognition system, starts with building a speech corpora. We give a detailed survey of issues and a methodology that selects the appropriate speech unit in building a speech corpus for Indian language Text to Speech systems. The paper ultimately aims to improve the intelligibility of the synthesized speech in Text to Speech synthesis systems. To begin with, an appropriate text file should be selected for building the speech corpus. Then a corresponding speech file is generated and stored. This speech file is the phonetic representation of the selected text file. The speech file is processed in different levels viz., paragraphs, sentences, phrases, words, syllables and phones. These are called the speech units of the file. Researches have been done taking these units as the basic unit for processing. This paper analyses the researches done using phones, diphones, triphones, syllables and polysyllables as their basic unit for speech synthesis. The paper also provides a recommended set of combinations for polysyllables. Concatenative speech synthesis involves the concatenation of these basic units to synthesize an intelligent, natural sounding speech. The speech units are annotated with relevant prosodic information about each unit, manually or automatically, based on an algorithm. The database consisting of the units along with their annotated information is called as the annotated speech corpus. A Clustering technique is used in the annotated speech corpus that provides way to select the appropriate unit for concatenation, based on the lowest total join cost of the speech unit.

Overview of NITECH HMM-based speech synthesis system for Blizzard Challenge 2013

by Shinji Takaki, Kei Sawada, Kei Hashimoto, Keiichiro Oura, Keiichi Tokuda
"... This paper describes a hidden Markov model (HMM) based speech synthesis system developed for the Blizzard Challenge 2013. In the Blizzard Challenge 2013, audiobooks are provided as training data. In this paper, we focus on a construction of databases for training acoustic models from audiobooks. An ..."
Abstract - Add to MetaCart
This paper describes a hidden Markov model (HMM) based speech synthesis system developed for the Blizzard Challenge 2013. In the Blizzard Challenge 2013, audiobooks are provided as training data. In this paper, we focus on a construction of databases for training acoustic models from audiobooks. An automatic alignment technique based on speech recognition is used for obtaining pairs of audio and transcriptions. We also focus on training high natural and neutral acoustic models from audiobooks. Audiobooks consist of speech with various qualities, styles, emotions, etc. It is necessary to appropriately handle such data for training high quality acoustic models. We pruned unneutral and mistakable speech data from the aligned data with multiple techniques and trained acoustic models normalized differences of speaking styles, recording conditions, and file formats among chapters with adaptive training for each chapter. Subjective evaluation results show that the developed system synthesized the high natural and intelligible speech. Index Terms: speech synthesis, hidden Markov model, audiobook, data pruning, adaptive training
(Show Context)

Citation Context

... for synthesizing speech. Therefore, this is the very important problem and under discussions. Techniques to handle the large speech corpora such as audiobooks for speech synthesis have been proposed =-=[8, 9, 10]-=-. In this paper, the lightly supervised technique is used for the alignment because there are helpful texts corresponding to audio in audiobooks. Using this technique, the pairs of transcriptions and ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2016 The Pennsylvania State University