| Klatt, D. H. Review of Text-to-Speech Conversion for English. Journal of the Acoustical Society of America, 82, 3 (1987), 737-793. |
....on ANNs or HMMs. While the first model type requires explicit rules formulated by an expert, the other three types extract their knowledge from phone duration distributions calculated from large spoken language resources. Much work in the field of constructing duration models was done by Klatt [2], Kohler [3] and van Santen [4] The goal of any duration model to generate natural sounding timing cannot in any practical sense be achieved because durational phenomena are too complex (van Santen 1993 [1, p. 1398] Five years later he writes: there is a sizable amount of durational ....
Dennis H. Klatt, "Review of text-to-speech conversion for English," J. of the Acoustical Society of America, vol. 82, no. 3, pp. 737--793, 1987.
....form the test set for synthesis evaluations. 153 6.2 Makeupofautomaticallyacquiredintonationclasses. 159 6.3 Recognition error analysis for di#erent synthesis configurations. 168 6.4 Worderrorratesforspeechsynthesizedwithrescoring. 170 Introduction Speech synthesis [71] is an automatic encoding process carried out by machine through which symbols conveying linguistic information are converted into an acoustic pressure waveform. The speech message may originate from a representation of the meaning to be communicated. The process of language generation produces a ....
....Now, the machine can then retrieve information based on the query and formulate a reply via language generation. The symbolic response is then transformed back to the speech domain by the process of speech synthesis described above. 1. 1 Background Although speech synthesis has had a long history [71, 87], progress is still being made [43, 152] and recent attention in the field has been primarily focused on concatenating real human speech selected from a large corpus or inventory. The method of concatenation [107] is not a new idea and has been applied to other parametric speech representations ....
[Article contains additional citation context not shown here]
D. Klatt, "Review of text to speech conversion for English," Journal of the Acoustical Society of America, vol. 82, pp. 737--793, 1987.
....of each phoneme, resulting in a sorted list of frequencies of the form ### ## , ## ## , ## ## , ## ## , ## ## , ## ## , ##, where # is a constant upper limit frequency beyond which no spectral modification is desired. When formants were not visible, we used formant frequencies from locus theory [3]. During synthesis, # has access to both the original features## ## and the desired features # # , using them to calculate a non uniform sampling of the original frequency locations. To obtain the final spectrum, the magnitude and unwrapped phases are interpolated at the new frequencies. ....
D. Klatt, "Review of text-to-speech conversion for english," J. Acoust. Soc. Am., vol. 82, no. 3, pp. 737--793, September 1987.
....the power to speak and listen can create a user friendly, hands free and eyes free environment for the user, and the speech medium can provide an ecient and economical mode of transmission. Great strides have been made in many areas of speech research over the past few decades. Speech synthesizers [41] have achieved a reasonable degree of clarity and naturalhess, and are striving to cover unlimited vocabularies. Speech recognizers are now capable of speaker independent, large vocabulary, continuous speech recognition. The speech in put may either be read or spontaneous. 1 Vocabulary sizes can ....
....attempts to capture letter sound regularities for the development of pronunciation and spelling systems. 1.4 Previous Work 1.4.1 Letter to Sound Generation A myriad of approaches have been applied to the problem of letter to sound gener ation. Excellent reviews can be found in [18] 29] and [41]. The various approaches have given rise to a wide range of letter to sound generation accuracies. Many of these accuracies are based on different corpora, and some corpora may be more dif ficult than others. Furthermore, certain systems are evaluated by human subjects, while others have their ....
Klatt, D., "Review of Text-to-speech Conversion for English," JASA 82 (3), Acoustic Society of America, pp. 737-793, 1987.
....of intelligibility, they typically sound unnatural. The process of deriving these rules is not only labor intensive but also difficult to generalize to a new language, a new voice, or a new speech style. For prosody modeling, most TTS systems use linguistic rules to define the prosody parameters [5,11]. Only limited natural language processing is generally used prior to prosody parameter generation. These rule based prosody models tend to sound robotic. Moreover, while these rules may have been derived from speech of a donor speaker, the resulting synthetic prosody typically does not resemble ....
Klatt D. "Review of text-to-speech conversion for English". Journal of the Acoustical Society of America, 82(3):737793, 1987.
....applications of HMMs include automatic speech segmentation [100, 55, 38] and smoothing waveforms at the concatenation points [70] 2.4 Unit Selection 2.4.1 Synthesis Unit Choosing the inventory of units is a subject of ongoing research. Diphone based systems have been o ered for many years [51]. A diphone database contains the transitions between all pairs of phones that can exist in a given language. While English has approximately 50 phones the total number of diphones ranges between 1500 and 2000, as some combinations never occur. Diphone based systems can produce very intelligible ....
D. Klatt. Review of text-to-speech conversion for english. Journal of the Acoustical Society of America, 82(3):737-793, 1987.
....Zero or no stress (NS) is assigned to all remaining syllables. Syllable stress is extremely useful in speech processing. Most pioneering work in utilizing stress has been in the area of text tospeech synthesis, where the need for producing intelligible and natural sounding speech is paramount [5]. In the areas of speech recognition and understanding in spite of the potential benefits from prosodic information its involvement up until now there has been limited [1] Syllable stress can be useful in facilitating lexical access in isolated word recognition systems. It is estimated that 75 ....
Klatt, D., "Review of text-to-speech conversion for English", J. Acoust. Soc. America, vol.82, no.3, pp.737-793, 1987.
....of Portuguese. With the exception of lexical stress assignment , the linguist and phonetic module was built using a rule compiler combined with a set of auxiliary functions written in the C language. The use of a rule compiler has the advantage of imposing a more structured rule definition [6] and enabling the system developmentby researchers with less programming skills. SCYLA, Speech Compiler for Your LAnguage, the rule compiler developed by CSELT [7] was selected because of three basic features of its multi level structures, allowing each procedure to access simultaneously all the ....
Klatt, D. H. (1987), "Review of Text-to-SpeechConversion for English", JournaloftheAcoustical Society of America, 82(3), 737-793.
....Angus B. Grieve Smith Linguistics Department, Humanities 526, The University of New Mexico, Albuquerque, NM 87131 USA grvsmth unm.edu Abstract. Development of sign synthesis (also known as text to sign) can benefit from studying the history of its older cousin, speech synthesis. As Klatt [1] outlines the basic architecture of a speech synthesis application, I will discuss the architecture of a sign synthesis application and mention some of the applications and prototypes currently available. I will focus on SignSynth, a CGI based articulatory sign synthesis prototype I am ....
....boundaries [8] 3. Sign Synthesis Architecture Sign synthesis is essentially the same task as speech synthesis; the difference is the form of the output. The architecture of a sign synthesis application is thus almost identical to a speech synthesis application. 3. 1 Basic Architecture Klatt [1] describes the basic architecture of a speech synthesis application: Input text is acted on by some analysis routines that produce an abstract underlying linguistic representation. This representation is fed into synthesis routines that produce output speech. The architecture is summarized in ....
[Article contains additional citation context not shown here]
Klatt, D.: Review of Text-to-Speech Conversion for English. Journal of the Acoustic Society of America 82 (1987)
....with millions of names and acronyms. Moreover, in order to sound natural, the intonation of the sentences must be appropriately generated. The development of TTS synthesis can be traced back to the 1930s when Dudley s Voder, developed by Bell Laboratories, was demonstrated at the World s Fair [18]. Taking advantage of increasing computation power and storage technology, TTS researchers have been able to generate high quality commercial multilingual text to speech systems, although the quality is inferior to human speech for general purpose applications. The basic TTS components are shown ....
Klatt, D., "Review of Text-to-Speech Conversion for English," Journal of Acoustical Society of America, 1987, 82, pp. 737-793.
....interaction with the user. Such actions usually include responding to a user query, asking for additional information, requesting clarification, or simply prompting the user to speak, etc. The importance of prosody to the naturalness and intelligibility of speech is evident in speech synthesis (Klatt 1987). It is not surprising that much prosodic modeling work has been carried out on the prediction side for such applications. In a typical speech synthesis system, some linguistic analysis is usually performed, and prosodic tags (such as phrase boundaries, pitch accents, boundary tones, lexical ....
Klatt, D. H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America 82 (3), 737--793.
.... production component requires more concentration than natural speech communication between humans [53] It has been proposed by many researchers that one important area in which the intelligibility and naturalness of synthesized speech can be improved is the prosody of the synthesized speech [28, 32, 46, 53]. Prosody is an important component of human speech that helps carry important semantic, syntactic, and discourse infor1 mation of the utterance. Prosody is found in the acoustic characteristics of human speech which include pauses, fundamental frequency (F 0 ) contours, and energy of the sounds. ....
....pre recorded waveforms or canned speech, and the synthesis of an acoustic waveform from unrestricted text. Both methods have a relatively long and varied history 1 . The first method is currently used in many telephony applications, where a caller hears prerecorded utterances such as 1 See [28] for a more complete account of the history of speech synthesis. 6 The number you requested, 555 1212, can be automatically dialed for an additional 35 cents. in response to a request for information. Pre recorded speech can be natural sounding and effective in many applications. This method ....
[Article contains additional citation context not shown here]
Klatt, D. (1987). "Review of Text-to-Speech conversion for English." Journal of the Acoustic Society of America, 82(3).
....a general sense of the history of the field. To this end, we present in this section a brief overview of work on speech synthesis and text analysis for text to speech synthesis. Complete reviews are found elsewhere in published materials, from which much of the current material is drawn, including [Klatt, 1987, Allen, 1992, Olive, 1997] 13 3.1.1 The Synthesis of Speech Sounds The history of speech synthesis can be traced to the late 18th century and the first attempts by Kratzenstein and von Kempelen to build mechanical devices to mimic the sounds produced by the human vocal apparatus. These ....
....of speech are collectively referred to as prosody. Aspects of prosody in speech production have been investigated in various phonetic and linguistic traditions for centuries, but their investigation in the context of speech synthesis is, of course, much more recent. For example, according to [Klatt, 1987], the first implemented algorithm for determining a pitch contour was done by Ignatius Mattingly in 1966, and was incorporated into the Holmes rule based synthesizer. Intonation patterns were constructed using three basic intonational tunes ( falling , rising , and fall rise ) which were aligned ....
[Article contains additional citation context not shown here]
Klatt, D. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82:737--793.
.... be found in the lexicon are usually transcribed by rule, but there are other possibilities as well, for example the analogy strategy, where a viable pronunciation for a previously unseen word is created by using the knowledge of similar letter patterns in words the system knows how to pronounce [Klatt, 1987]. While lexicon based transcription methods usually need so called Letter to Sound (LTS) rules to fall back upon, rule based methods have to keep an exception lexicon with words whose pronunciation diverge from the language specific rules, for example loan words. There are expert rule based ....
Klatt, D. H., (1987), "Review of text-to-speech conversion for English" in Journal of Acoustical Society of America, vol. 82, no. 3, pp. 737-793.
....example. The models are also often relatively easy to optimize, in some useful sense, and can be rapidly retrained on new databases to model new speakers or new languages. Most current speech synthesis systems, however, whether using rule based formant synthesis (Klatt, 1982; Allen, Hunnicutt Klatt, 1987; Hallahan, 1996) or more recently concatenation synthesis (Olive, 1990; Sorin, 1994) generally use rules and or concatenation units generated by hand. The research described in this paper was conducted to determine whether R. E. Donovan is now at the IBM T. J. Watson Research Center, PO Box ....
....11 . 11 . 2 Figure 1. A synthetic speech waveform (a) and wideband spectrograms (b,c) of the sentence fragment When a sailor in a. Thesynthetic speech (a,b) was produced using a system trained on the M2 database. The natural speech (c) was spoken by the speaker used in the M2 database. (Klatt, 1987). Clustering questions did exist in the question list which could have split the vowel nodes by asking about the voicing of the following plosive, but they were often not used during tree building. The problem was perhaps that, with the acoustic analysis used, the acoustic difference between such ....
Klatt, D. H. (1987). Review of text-to-speech conversion for english. Journal of the Acoustical Society of America, 82, 737--793.
....from spoken speech. 3. Selection of one (or a few) good unit instance when many are available in the corpus. The synthesis unit needs to be scaleable and reflect spectral variations of different allophones, so one can build the optimal TTS systems under various resource requirements. Diphones [5] which are the most popular synthesis unit employed today, however, fail to meet both criteria. Whistler uses the decision tree clustered phone based unit [9] 4] 7] as the synthesis unit. The decision tree clustered phone based unit not only can effectively model contextual variation, but also ....
....unit. In Section 3 we then discuss how to segment the speech recording into units. In Section 4 we describe the process of automatic unit selection, including multiple instance case. Finally we summarize our major findings and outline our future work. 2. SYNTHESIS UNIT 2. 1 Diphone Diphone [5][12] which contains the transitions between two phones, has been chosen as the synthesis unit for concatenative synthesizers. There are about 1500 to 2000 diphones in English, and the diphone mapping for a phoneme string is straightforward. The word HELLO hh ah l ow can be mapped into the ....
Klatt D. "Review of text-to-speech conversion for English". Journal of the Acoustical Society of America, 82(3):737-793, 1987.
....recognition, speech synthesis. I. INTRODUCTION A COMMON problem in speech processing is the conversion of a written language to a set of phonetic symbols. Such algorithms are often called letter to sound rules in English, and are commonly a core component of a text tospeech synthesis system [1]. More recently, as interest in large vocabulary speech recognition has grown, and speech database projects have become more ambitious, such algorithms Manuscript received November 15, 1993; revised March 23, 1998. This work was performed while the authors were with Texas Instruments, Tsukuba, ....
D. H. Klatt, "Review of text-to-speech conversion for English," J. Acoust. Soc. Amer., vol. 82, pp. 737--793, Sept. 1987.
.... reduction and coarticulation show that phonemes in natural speech do not reach the same acoustic targets in all circumstances [14] 15] 16] 17] In formant synthesis, researchers have speci ed rules to mimic vowel reduction, for example by manipulating formant targets or transition slopes [18], 19] 20] Unfortunately, these rules are not easily applied in the concatenative synthesis framework because they require tracking of formant trajectories in natural speech and control over individual formant frequencies and bandwidths during synthesis. TO APPEAR IN IEEE TRANSACTIONS ON AUDIO ....
D. H. Klatt, \Review of text-to-speech conversion for English," Journal of the Acoustical Society of America, vol. 82, no. 3, pp. 737-793, Sept. 1987.
....according to which there is a lexical route for the pronunciation of known words and a parallel route utilising abstract letter to sound rules for the pronunciation of unknown, or novel, words (Fig. 1) In the case of letter to sound conversion within a TTS system, the standard approach (e.g. Klatt, 1987, Fig. 30, p. 768) also has two routes. It utilises a pronouncing dictionary for known words and a set of general purpose, context dependent translation rules (e.g. Ainsworth, 1973; Elovitz et al., 1976; Hunnicutt, 1976) which is invoked if the input word is not in the system s dictionary. Unlike ....
....to the rules can either be embedded in the rule set 25 (i.e. treated as highly specialised rules 7 ) or segregated out and placed in a dictionary, there is a real problem in assessing the rules alone, in the absence of a dictionary, i.e. in deciding where the division between the two lies. Klatt (1987, p. 772) states: A moderate sized exceptions dictionary can hide the deficiencies of a weak set of letter to sound rules, but at a high cost in terms of storage requirements. According to Pols (1989, p. 60) There is very little experience in evaluating the text specific part (text ....
Klatt, D.H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82, 737--793.
....utterance must be produced in a way that fits with the surrounding discourse context: following from the preceding utterance and providing the right base for the subsequent utterance. A fundamental problem in speech synthesis is the absence of algorithms for automatic intonational variation [60]. This lack stems not from the inability of synthesis systems to produce natural sounding intonation; systems such as DEC Talk and the 30 AT T synthesizer can be hand tweaked to sound quite natural. However, no system currently represents the discourse level information necessary to assign ....
Dennis H. Klatt. Review of text-to-speech conversion for English. JASA, 82(3):737--793, 1987.
....devices, hand held computers, and automotive systems. While speech synthesizers can add value to portable products, their usage must not result in significantly higher product costs. Concatenation and synthesis by rule have been the two traditional techniques used for speech synthesis (Klatt, 1987). Concatenative systems store patterns produced by an analysis of speech, diphones or demisyllables, and concatenate these stored patterns, adjusting their duration and smoothing transitions to produce speech parameters corresponding to the phonetic representation. One problem with concatenative ....
Klatt, D.H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America 88, no. 3 (September):737-93.
....per se, but also its distance to the letter whose pronunciation is at consideration. This paper demonstrates how this assumption, combined with a simple decision tree building algorithm can produce convincing results. A necessary component in a text to speech system is also stress assignment [8]. Since this is even more dependent on other issues but local letter context [2] we have decided to concentrate only on grapheme to phoneme conversion in this work. In a working text to speech system, it might be necessary to construct or at least complete the set of textto phonemes rules by ....
D. H. Klatt. Review of text-to-speech conversion for English. Journal of Acoustical Society of America, 82(3):137--181, September 1987.
.... example, segmental duration depends on at least six factors: segmental identity, some characterization of the identities of the surrounding segments, word stress, sentence accent, position in the syllable, position of the syllable in the word, and position of the word in the phrase and utterance [2, 7, 16]. For intonation, at least that many factors are relevant. As a result, coverage indices of even large training corpora are very low, which means that for prosodic modeling to work we must use models with strong generalization properties, and avoid list like approaches altogether. 5.2. ....
D.H. Klatt. Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82(3):737--793, 1987.
.... phonology, speech is represented by concatenating elements of a limited alphabet [1] In mapping phonemes into physical units of sounds, linguists recognize a second intermediate level of representation called allophone and the allophonic rule to capture prosodic and contextual variations [2]. The output of this model is a systematic phonetic transcription. For example, an utterance of the word peep is represented by sequences of the hypothesized segmental phonemes [p] i] and [p] In addition, there is a contextual variation in the pronunciation of the phoneme [p] which is ....
....applying contextual rules that first adjust feature values 6 7 to reflect contextual variability and then translate the features into successive target values for production parameters in time. This process transforms the phonological units of contrast (phonemes) into physical units of sounds [2]. This traditional model of allophonic segments has been widely used in many machine recognition systems [8, 9] However, research in phonetics and phonology over the last few decades has shown that such a traditional phonetic representation has problems capturing the complexity behind actual ....
D. H. Klatt, "Review of text--to--speech conversion for english," Journal of the Acoustical Society of America, vol. 82, pp. 737--793, September 1987.
....prosodic cues may provide valuable information for computational models with limited semantic knowledge, even though the cues may be only redundant information for human listeners with a detailed knowledge of the world. In addition, prosody is a limiting factor in speech synthesis applications [63]. There are three major reasons for the problems of current models: first, the hand crafting of language understanding systems leads to a competence based model rather than a performance based model. Second, we do not understand well how to treat various phenomena as information, rather than ....
.... in the human vocal apparatus, models of articulation for synthesizing phonetic segments, theories of the relationship between prosody and syntax semantics for predicting abstract prosodic patterns, and models of intonation and duration for interpreting those prosodic patterns acoustically [63, 24]. 2. Computational Models of Variability. Explicit models of variability are needed in synthesis to avoid monotony, an issue both for synthesis of long monologues and long human computer interactive sessions. In addition, models that can account for variability are more likely to also be useful ....
D. H. Klatt. Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 3(82):737--793, 1987.
....prosodic cues may provide valuable information for computational models with limited semantic knowledge, even though the cues may be only redundant information for human listeners with a detailed knowledge of the world. In addition, prosody is a limiting factor in speech synthesis applications [62]. Another issue that deserves further investigation is turn taking, namely how conversational participants signal that they want to take the floor, that they are ready to release the floor, that they require clarification or that they have understood. Research on turn taking dynamics and ....
.... in the human vocal apparatus, models of articulation for synthesizing phonetic segments, theories of the relationship between prosody and syntax semantics for predicting abstract prosodic patterns, and models of intonation and duration for interpreting those prosodic patterns acoustically [62, 26]. 2. Computational models of variability. Explicit models of variability are needed in synthesis to avoid monotony, an issue both for synthesis of long monologues and long human computer interactive sessions. In addition, models that can account for variability are more likely to also be useful ....
D. H. Klatt. Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 3(82):737--793, 1987.
....module followed by the phonetic module. The proposed model thus appears to capture most of the grammatical information needed to generate F 0 . 1. Introduction Generating acceptable prosody is currently one of the most challenging tasks in the development of text to speech synthesis systems (Klatt, 1987; Collier, 1991) High quality prosody is indeed essential, both for comprehension (Silverman, 1993) and for acceptable synthesis, especially when long read texts are involved. Many studies have shown that although semantic and pragmatic factors enter into play, syntax is a major determining ....
Klatt, D. H. (1987). Review of text-to-speech conversion for English, Journal of the Acoustical Society of America, 82, 3, 737-793.
....3000 500 1000 1500 2000 2500 F2 vowel (Hz) Figure 1. Plot of F2 measured at consonant release against F2 at the steady state, or stationary point, of the vowel, with regression line ( r 2 = 0. 97) Since the variation is linear, it can be described by a simple equation of the form shown in (8) (Klatt 1987): 8) F2(C) k 1 (F2(V) L) L where L is the target F2, or F2 locus , for the consonant, and k 1 depends on the consonant and the style and rate of speech, and determines the slope of the line in a plot like fig. 1. The interpretation of (8) is that there is a target F2 value, or locus, for ....
....the analysis of contextual neutralization, it is necessary for the number of contrasting vowels to vary according to context. So, following the model outlined here, a constraint favouring maximizing the number of contrasts should be added to these models so that the number of contrasts in a 4 Klatt (1987) and Sussman et al. (1991) argue that velars actually have two F2 loci a high locus before front vowels, and a low locus before back vowels. This would obviously exclude any fronting effect. 19 given context is a result of the trade off between this constraint and the cost in increased effort ....
Klatt, Dennis H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America 82, 737-793.
....consonant is linear, it can be expressed as an equation of the following form, known as a locus equation: 1) F2 C = k 1 F2 V c 1 Where k 1 and c 1 , the slope and intercept of the line, are fixed for a given consonant 2 . This relationship can be expressed in an alternative form, due to Klatt (1987): 2) F2 C = k 1 (F2 V F2 L ) F2 L The formulation in (2) is equivalent to (1) where F2L = c 1 (1 k 1 ) The interpretation of (2) is that there is a target or locus for F2 for a given consonant, F2 L , but the actual value of F2 at the consonant deviates towards the F2 value in the vowel ....
Klatt, Dennis H. (1987). `Review of text-to-speech conversion for English'. Journal of the Acoustical Society of America 82, 737-793.
....problems for the rule based approach. In practice, exceptions to the rules can either be embedded in the rule set (i.e. treated as highly specialised rules) or separated out and placed in the dictionary. For instance, Elovitz et al. include about 60 very common whole words as rules . As Klatt [10] states: A moderatesized exceptions dictionary can hide the deficiencies of a weak set of letter to sound rules, but at a high cost in terms of storage requirements. Hence, there is a real problem in deciding where the division between the two lies. Moreover, Bernstein and Nessly [9] write: ....
D. H. Klatt. Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82:737--793, 1987.
....problems for the rule based approach. In practice, exceptions to the rules can either be embedded in the rule set (i.e. treated as highly specialised rules) or separated out and placed in the dictionary. For instance, Elovitz et al. include about 60 very common whole words as rules . As Klatt [10] states: A moderate sized exceptions dictionary can hide the deficiencies of a weak set of letter to sound rules, but at a high cost in terms of storage requirements. Hence, there is a real problem in deciding where the division between the two lies. Moreover, Bernstein and Nessly [9] write: ....
D. H. Klatt. Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82:737--793, 1987.
....of minimal concatenative units, as suggested by monophonemic analyses of diphtongs and affricates, the phoneme is no longer the only candidate for underlying unit. Syllables, moras (demisyllables) and other units for which concatenation based speech synthesis models exist (for an overview, see Klatt 1987) can also be realized as a succession of freeze frames. Even models based on units of highly questionable psychological reality, such as diphones (units composed of the second half of one phone and the first half of the following phone) work reasonably well. The logic of our enterprise dictates ....
Klatt, Dennis H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America 82 3, 737--793.
.... English (even more) Various ameliorations have been brought to this simple framework, which either aimed at complexifying rule formalisms in various ways (see for example Divay, 1990 [3] or at giving much more data (morphosyntactic decomposition, etc) to the transcribing system (see Klatt, 1987 [5] for a review) It is commonly admitted (see for example Vitale, 1991 [6] that the transcriptions of proper names remains a major weakness of most TTS systems. Various reasons have been proposed (difficulties with foreign names, etc) which all eventually sum up that the hypothesis (1) is probably ....
Klatt, D.H. (1987) "Review of text-to-speech conversion for English". Journal of the American Society of America 82/3, pp 737--793.
....and incomplete algorithms are available to model this complex process. As a result, despite much progress in the field, completely natural speech synthesis is still an elusive goal. Research in speech synthesis has had a long history. Comprehensive reviews can be found in Klatt s 1987 JASA article [15], which covers work up to the late 1980s, and in Dutoit s 1997 book [6] which includes more recent work done in the last decade. Many of the earlier approaches to speech synthesis were based on knowledge engineered rules derived from linguistic theories and acoustic analyses. These systems were ....
....speech. The physical correlates are, respectively, the duration, energy, and fundamental frequency of the speech. The pattern or trajectory of these parameters as a function of time carries linguistically significant information and is the key to producing natural sounding speech. As observed in [15], intensity contributes to the perceived stress and syllabic structure in speech; duration affects the rhythm, stress, emphasis, and syntactic structure of the utterance; and the fundamental frequency conveys information about the intonation, stress, emphasis, gender, and emotional state of the ....
[Article contains additional citation context not shown here]
D. H. Klatt, "Review of text-to-speech conversion for english," Journal of the Acoustical Society of America, vol. 82, pp. 737--793, 1987.
....is most directly useful with a vocal tract model synthesizer. However, the construction of a functional vocal tract model synthesizer is not yet feasible. Not enough is known about what is intrinsic and necessary, and the computational costs of incorporating what is known are currently too high [14]. discourse events. Its descriptors describe only speech events. Secondly, because the perceptual parameters are explicitly declared, this model is well suited for testing the perceptual effects of affect in speech. Finally, current technology supports such a model. Most commercial synthesizers ....
....rise (H H or L H ) or a period to simulate finality (L L ) Pitch accents There is no direct DECtalk translation for the Generative Intonation pitch accents. The stresses recognized by Klattalk, and therefore by the DECtalk, are always excursions upwards from the original hat rise contour [14]. Thus only approximations of H accents H , H L and L H are possible. The rise and fall markings may be a rough approximation of L accents (L , L H, H L ) but their effects are too unpredictable to be of use. Consequently, pitch accent notation is significant only insofar as it ....
[Article contains additional citation context not shown here]
Dennis H. Klatt. Review of text-to-speech conversion for English. Journal of the Acoustic Society of America, 82(3):737--793, Sept 1987.
No context found.
Klatt, D. H. Review of Text-to-Speech Conversion for English. Journal of the Acoustical Society of America, 82, 3 (1987), 737-793.
No context found.
D.H. Klatt. Review of text-to-speech conversion for English. J. Acoust. Soc. Am., Vol. 82, No. 3, pp. 737--793, 1987.
No context found.
Dennis H. Klatt. 1987. Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82(3):737--792, September.
No context found.
Dennis H. Klatt. Review of text-to-speech conversion for English. J. Acoust. Soc. Am., 82#3#, 1987.
No context found.
D. Klatt, "Review of text-to-speech conversion for English," J. Acoust. Soc. Am., 82(3), 737--793, 1987.
No context found.
Klatt, D. H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America 82 (3), 737--793.
No context found.
D. H. Klatt (1987), "Review of text-to-speech conversion for English," J. Acoust. Soc. Amer. 82, pp.137-181.
No context found.
Klatt, D., 1987. Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82 (3), 737-793.
No context found.
Dennis H Klatt, "Review of text-to-speech conversion for english," JASA, vol. 82, no. 3, pp. 737--793, 1987.
No context found.
D. H. Klatt, "Review of text-to-speech conversion for English," Journal of Acoustic Society of America, vol. 82(3), pp. 737--793, Sep. 1987.
No context found.
Klatt, D. "Review of text-to-speech conversion for English, " JASA vol. 82, no. 3, pp. 737-793, 1987.
No context found.
Klatt D. "Review of text-to-speech conversion for English". Journal of the Acoustical Society of America, 82(3):737-793, 1987.
No context found.
D. H. Klatt. Review of text-to-speech conversion for English. J. Acoust. Soc. Am., 82(3):737--793, 1987.
No context found.
Klatt D.H.: Review of text-to-speech conversion for English, Journal of the Acoustical Society of America 82, 737--793, 1987.
No context found.
Klatt, D.H: Review of text-to-speech conversion for English. Journal of the Acoustical Scoiety of America 88, No. 3, 737-793(1987).
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC