| Stylianou, Y., Dutoit T. and J. Schroeter "Diphone concatenation using a harmonic plus noise model of speech," Proc. of Eurospeech, 1997, pp.613-616. |
....2001 IEEE 40 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 1, JANUARY 2001 TABLE I COMPOSITION OF MATERIAL FOR THE PERCEPTUAL EXPERIMENT; THE TOTAL NUMBER OF C VC STIMULI IS 2415 23 C 2 5V2 21 C thesis overlap and add (MBROLA) 8] and harmonic plus noise modeling (HNM) [28]. Reference [5] present several different techniques for spectral smoothing, none of which they found really satisfactory. A fourth approach is to include context sensitive or specialized units in the database [24] This implies that one knows which contexts can be clustered so as to keep the ....
Y. Stylianou, T. Dutoit, and J. Schroeter, "Diphone concatenation using a harmonic plus noise model of speech," in Proc. 5th Eur. Conf. Speech Communication Technology (Eurospeech'97), vol. 2, Rhodes, Greece, 1997, pp. 613--615.
....of Edinburgh. Within the general architecture of Festival, FlexTalk modules provide text analysis and produce the phonetic prosodic specifications. Unit selection is based on CHATR s implementation of unit selection, but has been modified extensively. Synthesis is typically performed by HNM [10]. At AT T, the unit selection approach has been extended by using half phone units [5] This, together with a a large and accurately labeled speech database, has resulted in very high quality synthesis [2, 1] 4. CACHING CONCATENATION COSTS This section outlines the motivation for ....
Y. Stylianou, T. Dutoit, and J. Schroeter. Diphone concatenation using a harmonic plus noise model of speech. In Proceedings of Eurospeech'97, Rhodes, Greece, 1997.
....of synthesis conditions. There were a total of six variations of prosody generation used, and two synthesizers. The segments and durations specified were the same in every case. There were a total of 144 test items. The test conditions examined were as follows: ffl Two synthesizers, HNM [9] and PSOLA [6] were used to synthesize the test utterances. ffl A natural prosody control case (Nat) A smoothed F0 contour was extracted directly from the recorded test utterances using the icda program available with the Festival system. The contour and the segment specification were fed to ....
J. Schroeter Y. Stylianou, T. Dutoit. Diphones concatenation using a harmonic plus noise model of speech. Proc. EUROSPEECH, Sept. 1997.
.... Selection among candidates is done dynamically at synthesis, in a manner that is based on and extends unit selection implemented in the CHATR synthesis system [1] 4] Selected units may be either phones or diphones, and they can be synthesized by a variety of methods, including PSOLA [5] HNM [11], and simple unit concatenation. The AT T system, with CHATR unit selection, was implemented within the framework of the Festival Speech Synthesis System [2] The voice database amounted to approximately one and one half hours of speech and was constructed from read text taken from three sources. ....
....Because of the rich database and the use of prosodic targets in unit selection, the results are often acceptable. For the purpose of this paper, we denote this method as WAV. Two synthesis options were added in the experimental AT T system. The new modules implement Harmonic plus Noise Model (HNM) [11] and PSOLA synthesis, and may be selected at run time. Both optionally support prosody modification. HNM also performs spectral smoothing at concatenation points. It does this by attenuating mismatched formants near the concatenation boundary, effecting a fadeout fade in formant transition. For ....
Y. Stylianou, T. Dutoit, and J. Schroeter. Diphones concatenation using a harmonic plus noise model of speech. Proc. EUROSPEECH, Sept. 1997.
....For each of the three conditions 12 utterances were chosen from the test set. Six utterances were taken from the WSJ corpus and six from the Prompts corpus. The chosen utterances exhibit a large variety of complexity. We used the Festival speech synthesis system [2] with an HNM synthesizer [15], and a PSOLA synthesizer [10] to produce the stimuli with the desired F0 contours. The test utterances were presented in a randomized order to four groups of 10 11 listeners (43 total) experienced with voice quality tests but not familiar with text to speech synthesis. They judged the quality of ....
J. Schroeter Y. Stylianou, T. Dutoit. Diphones concatenation using a harmonic plus noise model of speech. Proc. EUROSPEECH, Sept. 1997.
....harmonic synthesizer uses the same parametric representation (harmonic amplitudes and phases) as the one used in MBROLA synthesizer (actually it even uses the same harmonic analyzer) but it synthesizes speech in a completely harmonic way. The harmonic model we use is very similar to the HNM model [7] with the following differences; the database (of harmonic parameters) is obtained by processing pitch asynchronous fixed frame size MBE analysis results to obtain pitch synchronous representations (There is no need for accurate pitch marking) Synthesis stage is pitch asynchronous. Duration ....
Stylianou, Y., Dutoit T. and J. Schroeter "Diphone concatenation using a harmonic plus noise model of speech," Proc. of Eurospeech, 1997, pp.613-616.
....on a pitchsynchronousHarmonic plus Noise (HNM) representation of speech. HNM has shown the capability of providing high quality prosodic modifications [10] without buzziness and tonal quality encountered in previously reported methods. Recently, HNM has been proposed for diphone concatenation [11] and informal listening tests have shown that HNM based synthetic speech is of high quality. Note that HNM does not require pitch marks unlike other pitch synchronous speech representations. In order to select a speech representation for our next generation TTS, it was decided to compare TD PSOLA, ....
Y. Stylianou, T. Dutoit, and J. Schroeter, "Diphone Concatenation using a Harmonic plus Noise Model of Speech," Proc. EUROSPEECH, 1997.
....significantly increasing naturalness. Clearly, a high quality database with accurate, multi level labels is a crucial element of the system. More details can be found in [2] 4. HNM SYNTHESIS BACKEND The speech representation of choice for our female voice is the Harmonic plus Noise Model (HNM) [3]. In HNM the speech spectrum is divided into two bands: a low band, which is represented by harmonically related sinusoids with slowly varying amplitudes and frequencies, and a high band that is instantiated by a time varying AR model that is excited by Gaussian noise. HNM analysis consists of 3 ....
Y. Stylianou, T. Dutoit, J. Schroeter (1997) "Diphone concatenation using a Harmonic plus Noise Model of speech." In: Eurospeech '97, pp. 613-616.
....are located. While this simplifies the analysis process, it increases the complexity of synthesis. In synthesis, the inter frame incoherence problem (phase mismatch between frames from different acoustic units) has to be taken into account. In previously reported versions of HNM for synthesis [25] [26] cross correlation functions have been used for estimating phase mismatches. However, this approach increased the complexity of the synthesizer while sometimes lacking efficiency. In this workshop, a novel method for synchronization of sig nals is presented [27] The method is based on ....
....technique around a concatenation point, t i . First, the differences of the pitch values and of the amplitudes of each harmonic are measured at t i . Then, these differences are weighted and propagated left and right from 2 This was used in a previously reported HNM version for speech synthesis [25] t i . The number of frames used in the interpolation process depends on the variance of the number of harmonics and the size, in frames, of the basic units (e.g. phoneme) across the concatenation point. This simple linear interpolation of the spectral envelopes makes formant discontinuities ....
Y. Stylianou, T. Dutoit, and J. Schroeter, "Diphone Concatenation using a Harmonic plus Noise Model of Speech," Proc. EUROSPEECH, pp. 613--616, 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC