Results 1 -
9 of
9
Automatic sentence selection from speech corpora including diverse speech for improved HMM-TTS synthesis quality
- in Proc. Interspeech
, 2011
"... Using publicly available audiobooks for HMM-TTS poses new challenges. This paper addresses the issue of diverse speech in audiobooks. The aim is to identify diverse speech likely to have a negative effect on HMM-TTS quality. Manual removal of diverse speech was found to yield better synthesis qualit ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Using publicly available audiobooks for HMM-TTS poses new challenges. This paper addresses the issue of diverse speech in audiobooks. The aim is to identify diverse speech likely to have a negative effect on HMM-TTS quality. Manual removal of diverse speech was found to yield better synthesis quality despite halving the training corpus. To handle large amounts of data an automatic approach is proposed. The approach uses a small set of acoustic and text based features. A series of listening tests showed that the manual selection is most preferred, while the automatic selection showed significant preference over the full training set. Index Terms: speech synthesis, HMM-TTS, corpus creation, diverse speech, speaking styles, audiobooks
Lightly Supervised GMM VAD to use Audiobook for Speech Synthesiser
- in Proc. ICASSP
, 2013
"... Audiobooks have been focused on as promising data for training Text-to-Speech (TTS) systems. However, they usually do not have a correspondence between audio and text data. Moreover, they are usually divided only into chapter units. In practice, we have to make a correspondence of audio and text dat ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
Audiobooks have been focused on as promising data for training Text-to-Speech (TTS) systems. However, they usually do not have a correspondence between audio and text data. Moreover, they are usually divided only into chapter units. In practice, we have to make a correspondence of audio and text data before we use them for building TTS synthesisers. However aligning audio and text data is time-consuming and involves manual labor. It also requires persons skilled in speech processing. Previously, we have proposed to use graphemes for automatically aligning speech and text data. This paper further integrates a lightly supervised voice activity detection (VAD) technique to detect sentence boundaries as a pre-processing step before the grapheme approach. This lightly supervised technique requires time stamps of speech and silence only for the first fifty sentences. Combining those, we can semiautomatically build TTS systems from audiobooks with minimum manual intervention. From subjective evaluations we analyse how the grapheme-based aligner and/or the proposed VAD technique impact the quality of HMM-based speech synthesisers trained on audiobooks. Index Terms — voice activity detection, lightly supervised, audiobook, HMM-based speech synthesis 1.
Lightly Supervised Discriminative Training of Grapheme Models for Improved Sentence-level
- Alignment of Speech and Text Data,” in Proc. of Interspeech (accepted
, 2013
"... This paper introduces a method for lightly supervised discriminative training using MMI to improve the alignment of speech and text data for use in training HMM-based TTS systems for low-resource languages. In TTS applications, due to the use of long-span contexts, it is important to select training ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
(Show Context)
This paper introduces a method for lightly supervised discriminative training using MMI to improve the alignment of speech and text data for use in training HMM-based TTS systems for low-resource languages. In TTS applications, due to the use of long-span contexts, it is important to select training utterances which have wholly correct transcriptions. In a low-resource setting, when using poorly trained grapheme models, we show that the use of MMI discriminative training at the grapheme-level enables us to increase the amount of correctly aligned data by 40%, while maintaining a 7 % sentence error rate and 0.8 % word error rate. We present the procedure for lightly supervised discriminative training with regard to the objective of minimising sentence error rate. Index Terms: automatic alignment, grapheme models, light supervision, MMI, text-to-speech
TUNDRA: A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision
- in Proc. of Interspeech
, 2013
"... Abstract Simple4All Tundra (version 1.0) is the first release of a standardised multilingual corpus designed for text-to-speech research with imperfect or found data. The corpus consists of approximately 60 hours of speech data from audiobooks in 14 languages, as well as utterance-level alignments ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract Simple4All Tundra (version 1.0) is the first release of a standardised multilingual corpus designed for text-to-speech research with imperfect or found data. The corpus consists of approximately 60 hours of speech data from audiobooks in 14 languages, as well as utterance-level alignments obtained with a lightly-supervised process. Future versions of the corpus will include finer-grained alignment and prosodic annotation, all of which will be made freely available. This paper gives a general outline of the data collected so far, as well as a detailed description of how this has been done, emphasizing the minimal language-specific knowledge and manual intervention used to compile the corpus. To demonstrate its potential use, textto-speech systems have been built for all languages using unsupervised or lightly supervised methods, also briefly presented in the paper.
Modeling Pause-Duration for Style-Specific Speech Synthesis
"... A major contribution to speaking style comes from both the location of phrase breaks in an utterance, as well as the duration of these breaks. This paper is about modeling the duration of style specific breaks. We look at six styles of speech here. We present analysis that shows that these styles di ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
A major contribution to speaking style comes from both the location of phrase breaks in an utterance, as well as the duration of these breaks. This paper is about modeling the duration of style specific breaks. We look at six styles of speech here. We present analysis that shows that these styles differ in the duration of pauses in natural speech. We have built CART models to predict the pause duration in these corpora and have integrated them into the Festival speech synthesis system. Our objective results show that if we have sufficient training data, we can build style specific models. Our subjective tests show that people can perceive the difference between different models and that they prefer style specific models over simple pause duration models.
Competency Area
"... Abstract—This paper reports on the automatic alignment of audiobooks in Afrikaans. An existing Afrikaans pronunciation dictionary and corpus of Afrikaans speech data are used to generate baseline acoustic models. The baseline system achieves an average duration independent overlap rate of 0.977 on t ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—This paper reports on the automatic alignment of audiobooks in Afrikaans. An existing Afrikaans pronunciation dictionary and corpus of Afrikaans speech data are used to generate baseline acoustic models. The baseline system achieves an average duration independent overlap rate of 0.977 on the first three chapters of an audio version of “Ruiter in die Nag”, an Afrikaans book by Mikro. The average duration independent overlap rate increases to 0.990 when the speech data from the audiobook is used to perform Maximum A Posteriori adaptation on the baseline models. The corresponding value for models trained on the audiobook data is 0.996. An automatic measure of alignment accuracy is also introduced and compared to accuracies measured relative to a gold standard. I.
Using Adaptation to Improve Speech Transcription Alignment in Noisy and Reverberant Environments
, 2013
"... When using data retrieved from the internet to create new speech databases, the recording conditions can often be highly variable within and between sessions. This variance influences the overall performance of any automatic speech and text alignment techniques used to process this data. In this pap ..."
Abstract
- Add to MetaCart
(Show Context)
When using data retrieved from the internet to create new speech databases, the recording conditions can often be highly variable within and between sessions. This variance influences the overall performance of any automatic speech and text alignment techniques used to process this data. In this paper we discuss the use of speaker adaptation methods to address this issue. Starting from a baseline system for automatic sentence-level segmentation and speech and text alignment based on GMMs and grapheme HMMs, respectively, we employ Maximum A Posteriori (MAP) and Constrained Maximum Likelihood Linear Regression (CMLLR) techniques to model the variation in the data in order to increase the amount of confidently aligned speech. We tested 29 different scenarios, which include reverberation, 8 talker babble noise and white noise, each in various combinations and SNRs. Results show that the MAP-based segmentation’s performance is very much influenced by the noise type, as well as the presence or absence of reverberation. On the other hand, the CMLLR adaptation of the acoustic models gives an average 20 % increase in the aligned data percentage for the majority of the studied scenarios. Index Terms: speech alignment, speech segmentation, adaptive training, CMLLR, MAP, VAD
Annotating Speech Corpus for Prosody Modeling in Indian Language Text to Speech Systems
"... A spoken language system, it may either be a speech synthesis or a speech recognition system, starts with building a speech corpora. We give a detailed survey of issues and a methodology that selects the appropriate speech unit in building a speech corpus for Indian language Text to Speech systems. ..."
Abstract
- Add to MetaCart
A spoken language system, it may either be a speech synthesis or a speech recognition system, starts with building a speech corpora. We give a detailed survey of issues and a methodology that selects the appropriate speech unit in building a speech corpus for Indian language Text to Speech systems. The paper ultimately aims to improve the intelligibility of the synthesized speech in Text to Speech synthesis systems. To begin with, an appropriate text file should be selected for building the speech corpus. Then a corresponding speech file is generated and stored. This speech file is the phonetic representation of the selected text file. The speech file is processed in different levels viz., paragraphs, sentences, phrases, words, syllables and phones. These are called the speech units of the file. Researches have been done taking these units as the basic unit for processing. This paper analyses the researches done using phones, diphones, triphones, syllables and polysyllables as their basic unit for speech synthesis. The paper also provides a recommended set of combinations for polysyllables. Concatenative speech synthesis involves the concatenation of these basic units to synthesize an intelligent, natural sounding speech. The speech units are annotated with relevant prosodic information about each unit, manually or automatically, based on an algorithm. The database consisting of the units along with their annotated information is called as the annotated speech corpus. A Clustering technique is used in the annotated speech corpus that provides way to select the appropriate unit for concatenation, based on the lowest total join cost of the speech unit.
Overview of NITECH HMM-based speech synthesis system for Blizzard Challenge 2013
"... This paper describes a hidden Markov model (HMM) based speech synthesis system developed for the Blizzard Challenge 2013. In the Blizzard Challenge 2013, audiobooks are provided as training data. In this paper, we focus on a construction of databases for training acoustic models from audiobooks. An ..."
Abstract
- Add to MetaCart
(Show Context)
This paper describes a hidden Markov model (HMM) based speech synthesis system developed for the Blizzard Challenge 2013. In the Blizzard Challenge 2013, audiobooks are provided as training data. In this paper, we focus on a construction of databases for training acoustic models from audiobooks. An automatic alignment technique based on speech recognition is used for obtaining pairs of audio and transcriptions. We also focus on training high natural and neutral acoustic models from audiobooks. Audiobooks consist of speech with various qualities, styles, emotions, etc. It is necessary to appropriately handle such data for training high quality acoustic models. We pruned unneutral and mistakable speech data from the aligned data with multiple techniques and trained acoustic models normalized differences of speaking styles, recording conditions, and file formats among chapters with adaptive training for each chapter. Subjective evaluation results show that the developed system synthesized the high natural and intelligible speech. Index Terms: speech synthesis, hidden Markov model, audiobook, data pruning, adaptive training