Results 1 - 10
of
109
Prosody Modeling with Soft Templates
, 2001
"... This paper describes a novel prosody generation model. We intend it to broadly support many linguistic theories and multiple languages, for the model imposes no restriction on accent categories and shapes. This capability is crucial to the next-generation of Text-to-Speech systems that will need to ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
This paper describes a novel prosody generation model. We intend it to broadly support many linguistic theories and multiple languages, for the model imposes no restriction on accent categories and shapes. This capability is crucial to the next-generation of Text-to-Speech systems that will need to synthesize intonation variations for different speech acts, emotions, and styles of speech. The system supports mark-up tags that are mathematically defined and generate f 0 deterministically. Underlying the tags is an articulatory model of accent interaction which balances physiological and communication constraints. We specify the model by way of an algorithm for calculating the pitch, and by way of examples. The model allows localized, linguistically reasonable tags, and is suitable for a data-driven fitting process. 1.
Structured speech modeling
- IEEE Transactions on Audio, Speech and Language Processing (Special Issue on Rich Transcription
, 2006
"... Abstract—Modeling dynamic structure of speech is a novel paradigm in speech recognition research within the generative modeling framework, and it offers a potential to overcome limitations of the current hidden Markov modeling approach. Analogous to structured language models where syntactic structu ..."
Abstract
-
Cited by 19 (11 self)
- Add to MetaCart
Abstract—Modeling dynamic structure of speech is a novel paradigm in speech recognition research within the generative modeling framework, and it offers a potential to overcome limitations of the current hidden Markov modeling approach. Analogous to structured language models where syntactic structure is exploited to represent long-distance relationships among words [5], the structured speech model described in this paper makes use of the dynamic structure in the hidden vocal tract resonance space to characterize long-span contextual influence among phonetic units. A general overview is provided first on hierarchically classified types of dynamic speech models in the literature. A detailed account is then given for a specific model type called the hidden trajectory model, and we describe detailed steps of model construction and the parameter estimation algorithms. We show how the use of resonance target parameters and their temporal filtering enables joint modeling of long-span coarticulation and phonetic reduction effects. Experiments on phonetic recognition evaluation demonstrate superior recognizer performance over a modern hidden Markov model-based system. Error analysis shows that the greatest performance gain occurs within the sonorant speech class. Index Terms—Hidden dynamics, hidden trajectory, long span modeling, maximum-likelihood, nonlinear prediction, parameter learning, structured modeling, vocal tract resonance. I.
When cues collide: Use of stress and statistical cues to word boundaries by 7- to 9-month-old infants
- Developmental Psychology
, 2003
"... Prior research suggests that stress cues are particularly important for English-hearing infants ’ detection of word boundaries. It is unclear, though, how infants learn to attend to stress as a cue to word segmentation. This series of experiments was designed to explore infants ’ attention to confli ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
Prior research suggests that stress cues are particularly important for English-hearing infants ’ detection of word boundaries. It is unclear, though, how infants learn to attend to stress as a cue to word segmentation. This series of experiments was designed to explore infants ’ attention to conflicting cues at different ages. Experiment 1 replicated previous findings: When stress and statistical cues indicated different word boundaries, 9-month-old infants used syllable stress as a cue to segmentation while ignoring statistical cues. However, in Experiment 2, 7-month-old infants attended more to statistical cues than to stress cues. These results raise the possibility that infants use their statistical learning abilities to locate words in speech and use those words to discover the regular pattern of stress cues in English. Infants at different ages may deploy different segmentation strategies as a function of their current linguistic experience. To achieve mastery of their native language, infants must identify and learn words. Identifying words in an unfamiliar language is no simple task. Unlike the white spaces that mark the boundaries between words in a written text, speakers do not consistently place silent pauses between words when speaking (e.g., Cole & Jakimik,
LANDMARK-BASED SPEECH RECOGNITION: REPORT OF THE 2004 Johns Hopkins Summer Workshop
, 2005
"... ..."
Hierarchical structure and word strength prediction of Mandarin prosody
- International Journal of Speech Technology
, 2003
"... We use Stem-ML to build an automatic learning system for Mandarin prosody that allows us to make quantitative measurements of prosodic strengths. Stem-ML is a phenomenological model of the muscle dynamics and planning process that controls the tension of the vocal folds. Because Stem-ML describes th ..."
Abstract
-
Cited by 13 (9 self)
- Add to MetaCart
We use Stem-ML to build an automatic learning system for Mandarin prosody that allows us to make quantitative measurements of prosodic strengths. Stem-ML is a phenomenological model of the muscle dynamics and planning process that controls the tension of the vocal folds. Because Stem-ML describes the interactions between nearby tones or accents, we were able to use a highly constrained model with only one accent template for each lexical tone category, and a single prosodic strength per word. The model accurately reproduces the intonation of the speaker, capturing 87 % of the variance of. The result reveals strong alternating metrical patterns in words, and shows that the speaker uses word strength to mark a hierarchy of boundaries. 1.
A Syllable, Articulatory-Feature, and Stress-Accent Model of Speech Recognition
, 2002
"... Current-generation automatic speech recognition #ASR# systems assume that words are readily decomposable into constituent phonetic components ##phonemes"#. A detailed linguistic dissection of state-of-the-art speech recognition systems indicates that the conventional phonemic #beads-on-a-string" app ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
Current-generation automatic speech recognition #ASR# systems assume that words are readily decomposable into constituent phonetic components ##phonemes"#. A detailed linguistic dissection of state-of-the-art speech recognition systems indicates that the conventional phonemic #beads-on-a-string" approach is of limited utility, particularly with respect to informal, conversational material. The study shows that there is a signi#cantgapbetween the observed data and the pronunciation models of current ASR systems. It also shows that many important factors a#ecting recognition performance are not modeled explicitly in these systems.
Modeling language evolution
- Foundations of Computational Mathematics
, 2004
"... A purpose of this paper is to understand the evolution of the languages used by the agents of a society. We focus on language features in which convexity plays a central role. In our model a language is a function from a set X of objects or meanings to a ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
A purpose of this paper is to understand the evolution of the languages used by the agents of a society. We focus on language features in which convexity plays a central role. In our model a language is a function from a set X of objects or meanings to a
The time course of spoken word learning and recognition: Studies with artificial lexicons
- Journal of Experimental Psychology: General
, 2003
"... The time course of spoken word recognition depends largely on the frequencies of a word and its competitors, or neighbors (similar-sounding words). However, variability in natural lexicons makes systematic analysis of frequency and neighbor similarity difficult. Artificial lexicons were used to achi ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
The time course of spoken word recognition depends largely on the frequencies of a word and its competitors, or neighbors (similar-sounding words). However, variability in natural lexicons makes systematic analysis of frequency and neighbor similarity difficult. Artificial lexicons were used to achieve precise control over word frequency and phonological similarity. Eye tracking provided time course measures of lexical activation and competition (during spoken instructions to perform visually guided tasks) both during and after word learning, as a function of word frequency, neighbor type, and neighbor frequency. Apparent shifts from holistic to incremental competitor effects were observed in adults and neural network simulations, suggesting such shifts reflect general properties of learning rather than changes in the nature of lexical representations. Current models of spoken word recognition share a set of core assumptions that correspond to what Marslen-Wilson (1993) called the macrostructure of spoken word recognition: As speech is heard, multiple lexical candidates are activated and compete for recognition with strengths proportional to their similarity with the input and their prior probabilities (frequencies of occurrence).

