Results 1 -
9 of
9
Recent advances in the automatic recognition of audio-visual speech
- PROC. IEEE
, 2003
"... Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech r ..."
Abstract
-
Cited by 64 (10 self)
- Add to MetaCart
Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audio-visual adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks.
Audio-visual automatic speech recognition: An overview
- Issues in Visual and Audio-visual Speech Processing
, 2004
"... We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly per ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly pervasive user interface. Indeed, even in “clean ” acoustic environments, and for a variety of tasks, state of the art ASR system
Photo-Realistic Talking-Heads from Image Samples
, 2000
"... This paper describes a system for creating a photo-realistic model of the human head that can be animated and lip-synched from phonetic transcripts of text. Combined with a state-of-the-art text-to-speech synthesizer (TTS), it generates video animations of talking heads that closely resemble real pe ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
This paper describes a system for creating a photo-realistic model of the human head that can be animated and lip-synched from phonetic transcripts of text. Combined with a state-of-the-art text-to-speech synthesizer (TTS), it generates video animations of talking heads that closely resemble real people. To obtain a naturally looking head, we choose a "data-driven" approach. We record a talking person and apply image recognition to extract automatically bitmaps of facial parts. These bitmaps are normalized and parameterized before being entered into a database. For synthesis, the TTS provides the audio track, as well as the phonetic transcript from which trajectories in the space of parameterized bitmaps are computed for all facial parts. Sampling these trajectories and retrieving the corresponding bitmaps from the database produces animated facial parts. These facial parts are then projected and blended onto an image of the whole head using its pose information. This talking head model can produce new, never recorded speech of the person who was originally recorded. Talking-head animations of this type are useful as a front-end for agents and avatars in multimedia applications such as virtual operators, virtual announcers, help desks, educational, and expert systems.
Speaker Independent Audio-Visual Database For Bimodal ASR
- Proc. Europ. Tut. Work. Audio-Visual Speech Proc., Rhodes
, 1997
"... This paper describes the audio-visual database collected at AT&T Labs#Research for the study of bimodal speech recognition. To date, this database consists of twomultiple speaker parts, namely isolated confusable words and connected letters, thus allowing the study of some popular and relatively sim ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
This paper describes the audio-visual database collected at AT&T Labs#Research for the study of bimodal speech recognition. To date, this database consists of twomultiple speaker parts, namely isolated confusable words and connected letters, thus allowing the study of some popular and relatively simple speaker independent audio-visual recognition tasks. In addition, a single speaker connected digits database is collected to facilitate speedy development and testing of various algorithms. Intentionally,no lip markings are used on the subjects during data collection. Development of robust and speaker independent algorithms for mouth location and lip contour extraction is thus necessary in order to obtain informative features about visual speech #visual front end#. We describe our approach to this problem, and we report our automatic speech-reading and audio-visual speech recognition results on the single speaker connected digits task. 1.
Audio-visual and Multimodal Speech Systems
- In D. Gibbon (Ed.) Handbook of Standards and Resources for Spoken Language Systems - Supplement Volume
"... ion Signal Level Semantic Level Figure 13: Multimodal Design Space (adapted from [224]) system in the design space is the pivotal center of its features. According to the characterization of an interaction along the two dimensions, fusion, and use of modalities, four basic types of multimodal intera ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
ion Signal Level Semantic Level Figure 13: Multimodal Design Space (adapted from [224]) system in the design space is the pivotal center of its features. According to the characterization of an interaction along the two dimensions, fusion, and use of modalities, four basic types of multimodal interactions can be distinguished: alternative, synergistic, exclusive, and concurrent multimodal interaction, as shown in Figure 13. Obviously, synergistic systems subsume the other three classes of multimodal systems. Therefore, architectural models of multimodal integration (as presented in the next subsection and in Section 9) are sufficient if they are able to model synergistic cooperation of modalities. 6.2.2 Fusion of Multimodal Input Fusion of multimodal input events can occur on different levels, ranging from signal-level to semantic-level. Signal-level fusion (or lexical fusion [224]) performs the combination of multimodal input at the level of the input signal. Signal-level fusion has...
A Cascade Visual Front End for Speaker Independent Automatic Speechreading
- International Journal of Speech Technology
, 2001
"... We propose a three-stage pixel based visual front end for automatic speechreading #lipreading# that results in signi#cantly improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied on a three-dimensional video region-of-interest th ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
We propose a three-stage pixel based visual front end for automatic speechreading #lipreading# that results in signi#cantly improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied on a three-dimensional video region-of-interest that contains the speaker's mouth area. The #rst stage is a typical image compression transform that achieves a high-energy, reduced-dimensionality representation of the video data. The second stage is a linear discriminant analysis based data projection, which is applied on a concatenation of a small number of consecutive image transformed video data. The third stage is a data rotation by means of a maximum likelihood linear transform that optimizes the likelihood of the observed data under the assumption of their class-conditional multi-variate normal distribution with diagonal covariance. We apply the algorithm to visual-only 52-class phonetic and 27-class visemic classi#cation on a 162-subject, 8-hour long, large-vocabulary, continuous speech audio-visual database. We demonstrate signi#cant classi#cation accuracy gains byeach added stage of the proposed algorithm, which, when combined, can reach up to 27# improvement. Overall, weachieve a 60# #49## visual-only frame-level visemic classi#cation accuracy with #without# use of test set viseme boundaries. In addition, we report improved audio-visual phonetic classi#cation over the use of a single-stage image transform visual front end. Finally, we discuss preliminary speech recognition results.
Sample-Based Talking-Head Synthesis
- in PhD Thesis, Signal Processing Lab, Swiss Federal Institute of Techology
, 2002
"... 1.1 Version Abrégée....................................................................................................................................... 2 ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
1.1 Version Abrégée....................................................................................................................................... 2
Audio-Visual Unit Selection for the Synthesis of Photo-Realistic Talking-Heads
- in Proc. Int. Conf. Multimedia Expo
, 2000
"... This paper investigates audio-visual unit selection for the synthesis of photo-realistic, speech-synchronized talking-head animations. These animations are synthesized from recorded video samples of a subject speaking in front of a camera, resulting in a photo-realistic appearance. The lip-synchroni ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
This paper investigates audio-visual unit selection for the synthesis of photo-realistic, speech-synchronized talking-head animations. These animations are synthesized from recorded video samples of a subject speaking in front of a camera, resulting in a photo-realistic appearance. The lip-synchronization is obtained by optimally selecting and concatenating variable-length video units of the mouth area. Synthesizing a new speech animation from these recorded units starts with audio speech and its phonetic annotation from a text-to-speech synthesizer. Then, optimal image units are selected from the recorded set using a Viterbi search through a graph of candidate image units. Costs are attached to the nodes and arcs of the graph that are computed from similarities in both the acoustic and visual domain, While acoustic similarities are computed by simple phonetic mawhing, visual similarities are estimated using a hierarchical metric that uses high-level features (position and sizes of facial parts) and low-level features (projection of the image pixels on principal components of the database). This method preserves coarticulatian and temporal coherence, producing smooth, lipsyncher animations. Once the database has been prepared, this system can produce animations from ascii text fully automatically. Keywords: facial animation, talking-heads, sample-based image synthesis, computer vision.

