| Huang, X., Acero, A., Adcock, J., Hon, H., Goldsmith, J., Liu, J., Plumpe, M., 1996. Whistler: a trainable text-to-speech system. In: Proc. ICSLP-96, Philadelphia, PA, pp. 659--662. |
....a search heuristic which maximized continuity in formant structure and pitch. As a proof of concept, the next section discusses experiments performed automatically with no human guidance. 3. 5 Sub phonetic: automatic experiments In this section, concatenations are made at the sub phonetic level [60] and their locations are automatically determined by a search heuristic seeking to maximize continuity in formant structure and pitch. To enable the concatenation to occur at any point within a phone, the units are designated to be individual pitch period with no phonetic identity. This is the ....
....entirely natural with neither formant nor pitch discontinuities. Though individual pitch periods o#er a great amount of control, it may be possible to retain much of this control by properly segmenting semivowels and diphthongs and by adjusting segment boundaries to coincide with pitch periods [60]. For example, in Chapter 4, it is discovered from data that inter vocalic [l] can be divided into half phones at the spectral extremum [51] and that these half phones can be used for concatenations. The spectral extrema that would have been found anyways by the sub phonetic search are pushed out ....
X. Huang, A. Acero, J. Adcock, H. Hon, J. Goldsmith, J. Liu, and M. Plumpe, "Whistler: A trainable text-to-speech system," in Proc. ICSLP '96, Philadelphia, PA, Oct. 1996, vol. 4, pp. 2387--2390.
....more suitable for a given phonetic and prosodic context, hence reducing the amount of additional signal processing. Selection of variable size units from large singlespeaker speech databases usually involves minimizing distortion introduced when selected units are modified and concatenated, e.g. [54, 35, 34, 19, 24, 7]. This distortion is represented in terms of two cost functions: 1) a target cost, which is an estimate of the difference between the database unit u i and the target t i , and 2) a concatenation cost, which is an estimate of the quality of concatenation of two units u i 1 and u i . The task of ....
X. Huang et al., "Whistler: a trainable text-to-speech system, " Proc. ICSLP, 4:169-172, 1996.
....segmentation of the speech corpus, which is a labor intensive process. Nevertheless, more contexts than diphones should be taken into account to achieve more natural synthetic voice quality. The suitable unit seems to be a triphone. This paper presents a new approach to speech synthesis [3] 4] [5]. It is based on recent advances in speech recognition technology. It uses triphones as the basic speech units. These units contain both the stationary parts of the phone and the transitional information of the adjacent phones. HMMs are used to model triphones. They are trained on the basis of the ....
....recognition community, and thus can be applied to get rid of the above mentioned problems. For each set of triphones derived from the same base phone, corresponding states are clustered using binary decision trees (see Figure 1) HMM states in triphone are often called fenemes [4] or senones [5]. A binary decision tree is constructed automatically for each feneme using a large list of questions concerning immediate phonetic context, and two clustering parameters. The process runs as follows. Corresponding fenemes of triphones derived from the same base phone are pooled. All the data ....
[Article contains additional citation context not shown here]
Huang X., Acero A., Adcock J., Hon H-W., Goldsmith J., Liu J., and Plumpe M. (1996): "Whistler: A Trainable Text-to-Speech System"; Proceedings of lCSLP '96, Philadelphia, pp. 2387- 2390.
....continuity measures. It also describes a new continuity measure developed at IBM which substantially out performs all other measures tested. 1. Introduction Data driven unit selection concatenative approaches to speech synthesis have become increasingly popular in recent years, 1] 2] [3], 4] 5] 6] 7] In these systems speech is synthesised by concatenating units selected from a database typically containing many thousands of units. The selection is usually made using a dynamic programming search to optimise a cost function given some target specification of the sentence to ....
Huang, X., Acero, A., Adcock, J., Hon, H-W., Goldsmith, J., Liu, J., and Plumpe, M. (1996) Whistler: A Trainable Text-to-Speech System, Proc. ICSLP'96, Philadelphia.
....producing audio and compile a set of event triggered rules to govern the generation of the nonverbal behavior. The first approach must be used for recorded audiobased animation or TTS engines such as Festival [32] while the second must be used with TTS engines such as Microsoft s Whistler [19]. We have used both approaches in our systems, and the current toolkit is capable of producing both kinds of animation schedules, but we will focus our discussion here on absolute time based scheduling with a TTS engine such as Festival. The first step in time based scheduling is to extract only ....
Huang, X., Acero, A., Adcock, J., Hon, H.-W., Goldsmith, J., Liu, J., and Plumpe, M., Whistler: A Trainable Text-to-Speech System. Proc. 4th Int'l. Conf. on Spoken Language Processing (ICSLP '96), pp. 2387-2390, Piscataway, NJ, 1996.
....producing audio and compile a set of event triggered rules to govern the generation of the nonverbal behavior. The first approach must be used for recorded audiobased animation or TTS engines such as Festival [32] while the second must be used with TTS engines such as Microsoft s Whistler [19]. We have used both approaches in our systems, and the current toolkit is capable of producing both kinds of animation schedules, but we will focus our discussion here on absolute timebased scheduling with a TTS engine such as Festival. The first step in time based scheduling is to extract only ....
Huang, X., Acero, A., Adcock, J., Hon, H.-W., Goldsmith, J., Liu, J., and Plumpe, M., Whistler: A Trainable Text-to-Speech System. Proc. 4th Int'l. Conf. on Spoken Language Processing (ICSLP '96), pp. 23872390, Piscataway, NJ, 1996.
....Segmentation In the initial segmentation, a word utterance of a new speaker is roughly divided into phonemic segments by DP alignment with a reference speaker s segmented utterance. Most of automatic segmentation systems for speech synthesis are using powerful HMM speech alignment techniques[1 3]. Currently, we use here a typical DP matching as an alignment tool since it is capable of segmenting word utterances for this rough alignment purpose, although the more powerful aligner would be the better. This rough alignment gives initial values for phonemic boundaries of the input utterance ....
X. Huang et al., "Whistler: A Trainable Text-to-Speech System," Proceedings of ICSLP'96, pp.2387-2390, 1996.
....1. INTRODUCTION In recent years corpus based approaches to unit selection for concatenative speech synthesis have become increasing popular due to their improved sensitivity to unit context, both phonetic and prosodic, compared to earlier diphone and polyphone approaches, 1] 3] 4] [7], 8] These systems are usually based on large speech databases, typically from 30 minutes to several hours in duration, and use sophisticated search algorithms to determine which segments to concatenate to synthesise a given sentence. In many of these systems however, all, or nearly all, the ....
Huang, X., Acero, A., Adcock, J., Hon, H-W., Goldsmith, J., Liu, J., and Plumpe, M. (1996) Whistler: A Trainable Text-toSpeech System, Proc. ICSLP'96, Philadelphia, pp. 2387--2390.
.... clustering tree, triphone contexts that do not occur in the database and were unseen during 105 training can be reconstructed and mapped appropriately, a standard procedure in speech recognition (Jelinek and Mercer, 1980; Young, 1992) A similar approach was implemented in Microsoft s TTS system (Huang et al. 1996; Hon et al. 1998) 7 Word and syllable concatenation Attempts to record and play back words have not been successful, largely due to the large and changing number of words and the need to make contextual adjustments. Allen, 1992, page 768) For restricted domains a version of the unit ....
Huang, Xuedong, Alex Acero, Jim Adcock, Hsiao-Wuen Hon, John Goldsmith, Jingsong Liu, and Mike Plumpe. 1996. Whistler: A trainable text-to-speech system. In Proceedings of the International Conference on Spoken Language Processing (Philadelphia, PA), volume 4, pages 2387-2390.
....the system may have to initiate clarification dialogue to reduce the amount of information returned from the back end, in order not to generate unwieldy verbal responses. On the speech side, recent work in synthesis based on non uniform units has resulted in much improved synthetic speech quality [33, 18]. However, we must continue to improve speech synthesis capabilities, particularly with regard to the encoding of prosodic and paralinguistic information such as emotion. As is the case on the input side, we must also develop integration strategies for language generation and speech synthesis. ....
Huang, X. "WHISTLER: A Trainable Text-toSpeech System," Proc. ICSLP, 2387-2390.
.... These include linguistically motivated units such as words, syllables, demi syllables, diphones, and phones, as well as automatically derived variablelength units [4, 25] other units include sub phonetic segments corresponding to the states in a trained hidden Markov model (HMM) of a phone [5, 12]. Diphones have been the most popular synthesis units because they provide a reasonable tradeoff between capturing many coarticulation effects, minimizing concatenation discontinuities, and being relatively small in number. A diphone segment captures the transition between two phones by starting ....
.... have typically been augmented to include some longer units that are three or four phones in length [28] More recently there has been work with phone and sub phone units primarily due to application of continuous speech recognition modeling techniques (e.g. HMMs) to the speech synthesis task [5, 12]. In the past, the use of phone based units for concatenative synthesis was difficult because of the strong contextual variation in the acoustic realizations of each phone and the resulting problems of storing multiple variants of each phone, selecting the appropriate units during synthesis, and ....
X. Huang, A. Acero, J. Adcock, H. Hon, J. Goldsmith, J. Liu, and M. Plumpe, "Whistler: A trainable text-to-speech system," in Proceedings of the International Conference on Spoken Language Processing, Philadelphia, PA, vol. 4, pp. 2387--2390, Oct. 1996.
....Figure 1 1, current synthesizers have tended to strive for flexibility of vocabulary and sentences at the expense of naturalness (i.e. arbitrary words can be synthesized, but do not sound very natural. This applies to articulatory, terminal analog, and concatenative methods of speech synthesis [2, 4, 8, 16, 17, 22, 28]. Our Approach Ultimate Synthesis Pre recorded Speech Usual Approach Naturalness Quality Sentence Vocabulary Flexibility Figure 1 1: Synthesis development curve in this thesis work. An alternative strategy is one which seeks to maintain naturalness by operating in a constrained domain. There ....
....cost [2, 4, 17, 28] Target costs can incorporate information about phonological environment, spectral measures, and prosody measures. Concatenation costs can incorporate information about spectral continuity and prosody continuity. It can also contain trigram context in the form of triphones [16]. However, most of these works have operated over a generic speech corpus not specifically designed for concatenative speech synthesis. In this thesis, we attempt to define an appropriate set of units, and design a corpus which can be used for synthesis of isolated words. Because concatenative ....
X. Huang, A. Acero, J. Adcock, H. Hon, J. Goldsmith, J. Liu, and M. Plumpe. Whistler: A trainable text-to-speech system. In Proc. ICSLP '96, pages 2387-- 2390, Philadelphia, PA, October 1996.
....how current synthesizers have tended to strive for flexibility of vocabulary and sentences at the expense of naturalness (i.e. arbitrary words and sentences can be synthesized, but do not sound very natural. This applies to articulatory, rule based, and concatenative methods of speech synthesis [2, 5, 6, 9]. An alternative strategy is one which seeks to maintain naturalness by operating in a constrained domain. There are potentially many applications where this mode of operation is perfectly suitable. In conversational systems for example, the domain of operation is often quite limited, and is known ....
X. Huang, A. Acero, J. Adcock, H. Hon, J. Goldsmith, J. Liu, and M. Plumpe, "Whistler: A trainable text-to-speech system," in Proc. ICSLP, Philadelphia, PA, pp. 2387--2390, Oct. 1996.
....the system may have to initiate clarification dialogue to reduce the amount of information returned from the back end, in order not to generate unwieldy verbal responses. On the speech side, recent work in synthesis based on non uniform units has resulted in much improved synthetic speech quality [33, 18]. However, we must continue to improve speech synthesis capabilities, particularly with regard to the encoding of prosodic and paralinguistic information such as emotion. As is the case on the input side, we must also develop integration strategies for language generation and speech synthesis. ....
Huang, X. "WHISTLER: A Trainable Text-toSpeech System," Proc. ICSLP, 2387-2390.
....and prosodic context, hence reducing the amount of additional signal processing. The unit selection process, however, becomes more complex as it requires a dynamic search. 2.4. 2 Selecting Units from Multiple Candidates Selection of variable size units from large single speaker speech databases [82, 42, 39, 11, 24, 38, 6, 31] is typically based on minimizing acoustic distortion introduced when selected units are modi ed and concatenated. This distortion is represented in terms of two cost functions (Figure 2.4) 1) target cost C (u i ; t i ) which is an estimate of the di erence between the database unit u i and ....
X. Huang, A. Acero, J. Adcock, H. Hon, J. Goldsmith, J. Liu, and M. Plumpe. Whistler: a trainable text-to-speech system. In Proceedings of the Intl. Conf. on Spoken Language Processing, volume 4, pages 169-172, 1996.
.... (TTS) systems (including both formant and concatenative synthesizers) that requires human experts to hand craft and fine tune the synthesis units (or unit parameters) 2] 3] Whistler uses an automatic procedure to configure and generate the synthesis units directly from any recording database [7][8] Whistler is a concatenative synthesizer, using units that are segmented by Microsoft s speech recognition engine Whisper [6] Whistler uses decisiontree clustered phone based units. Each unit is a cluster of phones, whose phonetic contexts and other characteristics such as stress are used to ....
....To estimate the smoothing parameters of each unit, we need to first define the number of states N and their time instants T. To do this we select the unit with highest probability from the segmentation HMM that lies in a neighborhood of the mean pitch, duration, and amplitude (discarding outliers) [7], and call this the golden unit, whose parameters will be used as the static mean . We set N and T to be the number of states the time instants for the states of the golden unit. We can estimate the means and variances in (1) for each unit u, from all the instances of that unit u j present in a ....
Huang X., Acero A., Adcock J., Hon H., Goldsmith J., Liu J., and Plumpe M. "Whistler: A Trainable Text-to-Speech System". proc. ICSLP. Philadelphia, Oct, 1996.
....This data driven formant synthesizer bridges the gaps between rulebased formant synthesizers and concatenative synthesizers by synthesizing speech that is both smooth and resembles the speaker in the training data. 1. INTRODUCTION Both rule based formant synthesis [2] and concatenative synthesis [4][5] yield unnatural speech, although for different reasons. Concatenative synthesizers sound quite natural within a unit, but overall naturalness can be low due to the presence of discontinuities at unit boundaries. Rule based formant synthesizers never exhibit such discontinuities, but their ....
....the smoothness of the synthesized speech, the synthesized speech of [7] exhibited formant bandwidths that are wider than the natural ones. This prompted us to investigate using formants as features instead of cepstrum. In our experiments we used 3state tree clustered context dependent phone HMMs [4], where each state is modeled with one Gaussian density function with diagonal covariance matrices. A total of 24 parameters per state are then needed: 3 formant means, 3 formant variances, 3 bandwidth means, 3 bandwidths variances, as well as the corresponding delta parameters. The first three ....
Huang X., Acero A., Adcock J., Hon H., Goldsmith J., Liu J., and Plumpe M. "Whistler: A Trainable Text-to-Speech System". International Conference on Spoken Language Processing. Philadelphia, pp. 2387-2390. Oct, 1996.
....use of probabilistic learning methods which are aimed at the same optimization criteria. Through this automatic unit generation, Whistler can automatically produce synthetic speech that sounds very natural and resembles the acoustic characteristics of the original speaker. 1. INTRODUCTION In [4][7], we have presented Whistler: Microsoft s Trainable Textto Speech (TTS) System. In contrast to most other TTS systems (including both formant and concatenative synthesizers) 1] 2] 12] which require human experts to hand craft and finetune the synthesis units (or unit parameters) Whistler uses ....
....detail the design issues and improvements we made to the construction the synthesis unit inventory automatically from speech databases. Whistler is based upon a concatenative synthesizer whose unit inventory is generated by cutting speech segments from a database recorded by a target speaker [4][7][11] There are typical three phases in the process of building a unit inventory: 1. Determine the synthesis units and derive the conversion between a phoneme string and a unit string. 2. Segmentation of each unit from spoken speech. 3. Selection of one (or a few) good unit instance when many ....
[Article contains additional citation context not shown here]
Huang X., Acero A., Adcock J., Hon H., Goldsmith J., Liu J., and Plumpe M. "Whistler: A Trainable Text-to-Speech System". International Conference on Spoken Language Processing. Philadelphia, Oct, 1996
....acoustics and pitch as well as when it was taken in combination with synthetic pitch. The synthetic pitch was found to be the aspect of the system that results in greatest quality degradation. 1. INTRODUCTION We have presented Whistler, Microsoft s Trainable Text ToSpeech (TTS) system in [1][2]. We will primarily look at three aspects of the system, the pitch (fundamental frequency) phoneme duration, and acoustics. Whistler has a concatenative synthesizer, using context dependent phoneme units that are automatically selected from a training database. The pitch is generated by rule, ....
Huang X., Acero A., Adcock J., Hon H., Goldsmith J., Liu J., and Plumpe M. "Whistler: A Trainable Text-toSpeech System". Proceedings International Conference on Spoken Language Processing. Philadelphia, Oct, 1996.
....rate reduction at the 5 false rejection point. A natural direction for future work is in generating data for words with less than 4 occurrences by concatenative synthesis, using the real acoustic segments (which could be either senones or triphones) as is done in Microsoft s Whistler TTS system [9]. The use of the real acoustic segments in the synthesis may ensure that the synthesized data exhibits the same similarity patterns as the real data. Although our confidence measure is being measured in the supervised adaptation scenario, it should be feasible to extend this work to unsupervised ....
Huang, X., Acero, A., Adcock, J., Hon, H.-W., Goldsmith, J., Liu, J., and Plumpe, M., "Whistler: A Trainable Text-to-Speech System," ICASSP-96 4:2397-- 2390, 1996.
No context found.
Huang, X., Acero, A., Adcock, J., Hon, H., Goldsmith, J., Liu, J., Plumpe, M., 1996. Whistler: a trainable text-to-speech system. In: Proc. ICSLP-96, Philadelphia, PA, pp. 659--662.
No context found.
X. Huang, A. Acero, J. Adcock, H.W. Hon, J. Goldsmith, J. Liu, and M. Plumpe, "WHISTLER: A Trainable Text-to-Speech System," Proc. ICSLP, 2387--2390, 1996.
No context found.
X. Huang, A. Acero, J. Adcock, H. Hon, J. Goldsmith, J. Liu, and M. Plumpe, "Whistler: A trainable text-tospeech system," in ICSLP '96, Philadelphia, PA, 1996.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC