Results 1 - 10
of
146
Insights Into Spoken Language Gleaned From Phonetic Transcription Of The Switchboard Corpus
, 1996
"... Models of speech recognition (by both human and machine) have traditionally assumed the phoneme to serve as the fundamental unit of phonetic and phonological analysis. However, phoneme-centric models have failed to provide a convincing theoretical account of the process by which the brain extracts m ..."
Abstract
-
Cited by 105 (16 self)
- Add to MetaCart
(Show Context)
Models of speech recognition (by both human and machine) have traditionally assumed the phoneme to serve as the fundamental unit of phonetic and phonological analysis. However, phoneme-centric models have failed to provide a convincing theoretical account of the process by which the brain extracts meaning from the speech signal and have fared poorly in automatic recognition of natural, informal speech (e.g., the Switchboard corpus). Over the past five months the Switchboard Transcription Project has phonetically transcribed a portion of the Switchboard corpus in an effort to better understand the failure of phoneme-centric models for machine recognition of speech, as well as to provide a database through which to improve the performance of recognition systems focused on conversational dialogs. Transcription of spoken dialogs illustrates the pitfalls of a phoneme-based system. Many words are articulated in such a fashion as to either omit or significantly transform the phonetic properti...
Incorporating Information From Syllable-length Time Scales into Automatic Speech Recognition
- In ICASSP
, 1998
"... Incorporating the concept of the syllable into speech recognition may improve recognition accuracy through the integration of information over syllable-length time spans. Evidence from psychoacoustics and phonology suggests that humans use the syllable as a basic perceptual unit. Nonetheless, the ex ..."
Abstract
-
Cited by 66 (4 self)
- Add to MetaCart
Incorporating the concept of the syllable into speech recognition may improve recognition accuracy through the integration of information over syllable-length time spans. Evidence from psychoacoustics and phonology suggests that humans use the syllable as a basic perceptual unit. Nonetheless, the explicit use of such long-timespan units is comparatively unusual in automatic speech recognition systems for English. The work described in this thesis explored the utility of information collected over syllable-related time-scales. The first approach involved integrating syllable segmentation information into the speech recognition process. The addition of acoustically-based syllable onset estimates [184] resulted in a 10% relative reduction in word-error rate. The second approach began with developing four speech recognition systems based on long-time-span features and units, including modulation spectro- gram features [80]. Error analysis suggested the strategy of combining, which led to the implementation of methods that merged the outputs of syllable-based recognition systems with the phone-oriented baseline system at the frame level, the syllable level and the whole-utterance level. These combined systems exhibited relative improvements of 20-40% compared to the baseline system for clean and reverberant speech test cases.
Multiresolution spectrotemporal analysis of complex sounds
- J Acoust Soc Am
"... A computational model of auditory analysis is described that is inspired by psychoacoustical and neurophysiological findings in early and central stages of the auditory system. The model provides a unified multiresolution representation of the spectral and temporal features of sound likely critical ..."
Abstract
-
Cited by 65 (3 self)
- Add to MetaCart
A computational model of auditory analysis is described that is inspired by psychoacoustical and neurophysiological findings in early and central stages of the auditory system. The model provides a unified multiresolution representation of the spectral and temporal features of sound likely critical in the perception of timbre. Several types of complex stimuli are used to demonstrate the spectrotemporal information extracted and represented by the model. Also outlined are several reconstruction algorithms to resynthesize the sound so as to evaluate the fidelity of the representation and contribution of different features and cues to the sound percept. Simplified versions of this model representations have already been used in a variety of applications, as in the assessment of speech intelligibility [Elhilali et al., 2003, Chi et al., 1999] and in explaining the perception of monaural phase sensitivity [Carlyon and Shamma, 2002]. 1 1.
Should recognizers have ears
- Speech Communication
, 1998
"... The paper discusses author’s experience with applying auditory knowledge to automatic recognition of speech. It indirectly argues against blind implementing of scattered accidental knowledge which may be irrelevant to a speech recognition task. It advances the notion that the reason for applying kno ..."
Abstract
-
Cited by 59 (4 self)
- Add to MetaCart
(Show Context)
The paper discusses author’s experience with applying auditory knowledge to automatic recognition of speech. It indirectly argues against blind implementing of scattered accidental knowledge which may be irrelevant to a speech recognition task. It advances the notion that the reason for applying knowledge of human auditory perception in engineering applications should be the ability of perception to suppress some parts of information in the speech message. Three properties of human speech perception: limited spectral resolution, use of information from about syllable-length segments ability to alleviate unreliable cues, are discussed in some detail. Overall, we are advocating selective use of auditory knowledge,optimized on real speechdata. Fig. I A good hard working man. Fig. II A foolish man?
What HMMs can do
, 2002
"... Since their inception over thirty years ago, hidden Markov models (HMMs) have have become the predominant methodology for automatic speech recognition (ASR) systems — today, most state-of-the-art speech systems are HMM-based. There have been a number of ways to explain HMMs and to list their capabil ..."
Abstract
-
Cited by 50 (5 self)
- Add to MetaCart
(Show Context)
Since their inception over thirty years ago, hidden Markov models (HMMs) have have become the predominant methodology for automatic speech recognition (ASR) systems — today, most state-of-the-art speech systems are HMM-based. There have been a number of ways to explain HMMs and to list their capabilities, each of these ways having both advantages and disadvantages. In an effort to better understand what HMMs can do, this tutorial analyzes HMMs by exploring a novel way in which an HMM can be defined, namely in terms of random variables and conditional independence assumptions. We prefer this definition as it allows us to reason more throughly about the capabilities of HMMs. In particular, it is possible to deduce that there are, in theory at least, no theoretical limitations to the class of probability distributions representable by HMMs. This paper concludes that, in search of a model to supersede the HMM for ASR, we should rather than trying to correct for HMM limitations in the general case, new models should be found based on their potential for better parsimony, computational requirements, and noise insensitivity.
Understanding Speech Understanding: Towards A Unified Theory Of Speech Perception
, 1996
"... Ever since Helmholtz, the perceptual basis of speech has been associated with the energy distribution across frequency. However, there is now accumulating evidence that speech understanding does not require a detailed spectral portraiture of the signal. As a consequence, a new theoretical perspectiv ..."
Abstract
-
Cited by 45 (7 self)
- Add to MetaCart
(Show Context)
Ever since Helmholtz, the perceptual basis of speech has been associated with the energy distribution across frequency. However, there is now accumulating evidence that speech understanding does not require a detailed spectral portraiture of the signal. As a consequence, a new theoretical perspective, focused on time, is beginning to emerge. This framework emphasizes the temporal evolution of coarse spectral patterns as the primary carrier of information within the speech signal, and provides an efficient and effective means of shielding linguistic information against the potentially hostile forces of the natural soundscape, such as reverberation and background acoustic interference. The auditory system may extract this relational information through computation of the low-frequency modulation spectrum in the auditory cortex, and this representation provides a principled basis for segmentation of the speech signal into syllabic units. Because of the systematic relationship between the syllable and higher-level lexicogrammatical organization it is possible, in principle, to gain direct access to the lexicon and grammar through such an auditory analysis of speech.
Speech perception at the interface of neurobiology and linguistics.
- Philos. Trans. R. Soc. Lond. B Biol. Sci.
, 2008
"... Speech perception consists of a set of computations that take continuously varying acoustic waveforms as input and generate discrete representations that make contact with the lexical representations stored in long-term memory as output. Because the perceptual objects that are recognized by the spe ..."
Abstract
-
Cited by 44 (1 self)
- Add to MetaCart
Speech perception consists of a set of computations that take continuously varying acoustic waveforms as input and generate discrete representations that make contact with the lexical representations stored in long-term memory as output. Because the perceptual objects that are recognized by the speech perception enter into subsequent linguistic computation, the format that is used for lexical representation and processing fundamentally constrains the speech perceptual processes. Consequently, theories of speech perception must, at some level, be tightly linked to theories of lexical representation. Minimally, speech perception must yield representations that smoothly and rapidly interface with stored lexical items. Adopting the perspective of Marr, we argue and provide neurobiological and psychophysical evidence for the following research programme. First, at the implementational level, speech perception is a multi-time resolution process, with perceptual analyses occurring concurrently on at least two time scales (approx. 20-80 ms, approx. 150-300 ms), commensurate with (sub)segmental and syllabic analyses, respectively. Second, at the algorithmic level, we suggest that perception proceeds on the basis of internal forward models, or uses an 'analysis-by-synthesis' approach. Third, at the computational level (in the sense of Marr), the theory of lexical representation that we adopt is principally informed by phonological research and assumes that words are represented in the mental lexicon in terms of sequences of discrete segments composed of distinctive features. One important goal of the research programme is to develop linking hypotheses between putative neurobiological primitives (e.g. temporal primitives) and those primitives derived from linguistic inquiry, to arrive ultimately at a biologically sensible and theoretically satisfying model of representation and computation in speech.
On The Origins Of Speech Intelligibility In The Real World
- ESCA WORKSHOP ON ROBUST SPEECH RECOGNITION FOR UNKNOWN COMMUNICATION CHANNELS, PONT-A-MOUSSON
, 1997
"... Current-generation speech recognition systems seek to identify words via analysis of their underlying phonological constituents. Although this stratagem works well for carefully enunciated speech emanating from a pristine acoustic environment, it has fared less well for recognizing speech spoken und ..."
Abstract
-
Cited by 39 (12 self)
- Add to MetaCart
Current-generation speech recognition systems seek to identify words via analysis of their underlying phonological constituents. Although this stratagem works well for carefully enunciated speech emanating from a pristine acoustic environment, it has fared less well for recognizing speech spoken under more realistic conditions, such as (1) moderate to high levels of background noise (2) moderately reverberant acoustic environments (3) spontaneous, informal conversation Under such "real-world" conditions the acoustic properties of speech make it difficult to partition the acoustic stream into readily definable phonological units, thus rendering the process of word recognition highly vulnerable to departures from "canonical" patterns. Analysis of informal, spontaneous speech indicates that the stability of linguistic representation is more likely to reside on the syllabic and phrasal levels than on the phonological. In consequence, attempts to represent words merely as sequences of ...
Intelligibility Of Speech With Filtered Time Trajectories Of Spectral Envelopes
- Proceedings of the International Conference on Spoken Language Processing
, 1996
"... The effect of filtering the time trajectories of spectral envelopes on speech intelligibility was investigated. Since LPC cepstrum forms the basis of many automatic speech recognition systems, we filtered time trajectories of LPC cepstrum of speech sounds, and the modified speech was reconstructed a ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
(Show Context)
The effect of filtering the time trajectories of spectral envelopes on speech intelligibility was investigated. Since LPC cepstrum forms the basis of many automatic speech recognition systems, we filtered time trajectories of LPC cepstrum of speech sounds, and the modified speech was reconstructed after the filtering. For processing, we applied low-pass, high-pass and band-pass filters. The results of the accuracyfrom the perceptual experiments for Japanesesyllables show that speech intelligibility is not severely impaired as long as the filtered spectral components have 1) a rate of change faster than 1 Hz when high-pass filtered, 2) a rate of change slower than 24 Hz when low-pass filtered, and 3) a rate of change between 1 and 16 Hz when band-pass filtered. 1.
Temporal Properties of Spontaneous Speech -- A Syllable-Centric Perspective
"... Temporal properties associated with the speech signal are potentially important for understanding spoken language. Five hours of spontaneous American English dialogue material (from the SWITCHBOARD corpus) were hand-labeled and segmented at the phonetic-segment level; a fortyfive -minute subset was ..."
Abstract
-
Cited by 32 (6 self)
- Add to MetaCart
Temporal properties associated with the speech signal are potentially important for understanding spoken language. Five hours of spontaneous American English dialogue material (from the SWITCHBOARD corpus) were hand-labeled and segmented at the phonetic-segment level; a fortyfive -minute subset was also manually annotated (at the syllabic level) with respect to stress accent. Statistical analysis of the corpus indicates that much of the temporal variation observed at the syllabic and phonetic-segment levels can be accounted for in terms of two basic parameters: (1) stress-accent pattern and (2) position of the segment within the syllable. Segments are generally longest in heavily accented syllables and shortest in syllables without accent. However, the magnitude of accent's impact on duration varies as a function of syllable position. Duration of the nucleus is heavily affected by accent level (heavily accented nuclei are, on average, twice as long as their unaccented counterparts), while the duration of the onset is also significantly affected but to a lesser degree. In contrast, accent has relatively little impact on the duration of the coda. This pattern of durational variation is incommensurate with segmental models, but rather implies the importance of syllable structure (and stress accent) for understanding spoken language.