Results 1 -
4 of
4
Recent advances in the automatic recognition of audio-visual speech
- PROC. IEEE
, 2003
"... Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech r ..."
Abstract
-
Cited by 64 (10 self)
- Add to MetaCart
Visual speech information from the speaker’s mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audio-visual adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks.
Audio-visual automatic speech recognition: An overview
- Issues in Visual and Audio-visual Speech Processing
, 2004
"... We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly per ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, ASR performance has yet to reach the level required for speech to become a truly pervasive user interface. Indeed, even in “clean ” acoustic environments, and for a variety of tasks, state of the art ASR system
efase: Expressive facial animation synthesis and editing with phoneme-level controls
- In Proc. of ACM SIGGGRAPH/Eurographics Symposium on Computer Animation
, 2006
"... This paper presents a novel data-driven animation system for expressive facial animation synthesis and editing. Given novel phoneme-aligned speech input and its emotion modifiers (specifications), this system automatically generates expressive facial animation by concatenating captured motion data w ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
This paper presents a novel data-driven animation system for expressive facial animation synthesis and editing. Given novel phoneme-aligned speech input and its emotion modifiers (specifications), this system automatically generates expressive facial animation by concatenating captured motion data while animators establish constraints and goals. A constrained dynamic programming algorithm is used to search for best-matched captured motion nodes by minimizing a cost function. Users optionally specify “hard constraints " (motion-node constraints for expressing phoneme utterances) and “soft constraints " (emotion modifiers) to guide the search process. Users can also edit the processed facial motion node database by inserting and deleting motion nodes via a novel phoneme-Isomap interface. Novel facial animation synthesis experiments and objective trajectory comparisons between synthesized facial motion and captured motion demonstrate that this system is effective for producing realistic expressive facial animations. Categories and Subject Descriptors (according to ACM CCS): I.3.7 [Computer Graphics]: Three-Dimensional
U.: Expressive speech animation synthesis with phoneme-level control
- Computer Graphics Forum
"... This paper presents a novel data-driven expressive speech animation synthesis system with phonemelevel controls. This system is based on a pre-recorded facial motion capture database, where an actress was directed to recite a pre-designed corpus with four facial expressions (neutral, happiness, ange ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
This paper presents a novel data-driven expressive speech animation synthesis system with phonemelevel controls. This system is based on a pre-recorded facial motion capture database, where an actress was directed to recite a pre-designed corpus with four facial expressions (neutral, happiness, anger, and sadness). Given new phoneme-aligned expressive speech and its emotion modifiers as inputs, a constrained dynamic programming algorithm is used to search for best-matched captured motion clips from the processed facial motion database by minimizing a cost function. Users optionally specify “hard constraints” (motion-node constraints for expressing phoneme utterances) and “soft constraints ” (emotion modifiers) to guide this search process. We also introduce a phoneme-Isomap interface for visualizing and interacting phoneme clusters that are typically composed of thousands of facial motion capture frames. On top of this novel visualization interface, users can conveniently remove contaminated motion subsequences from a large facial motion dataset. Facial animation synthesis experiments and objective comparisons between synthesized facial motion and captured motion showed that this system is effective for producing realistic expressive speech animations.

