Results 1 -
5 of
5
Continuous Pose-Invariant Lipreading
"... In audio-visual automatic speech recognition (AVASR), no research to date has been conducted into the problem of recognising visual speech whilst the speaker is moving their head. In this paper, we extend our current system to deal with this task, which we entitle continuous pose-invariant lipreadin ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
In audio-visual automatic speech recognition (AVASR), no research to date has been conducted into the problem of recognising visual speech whilst the speaker is moving their head. In this paper, we extend our current system to deal with this task, which we entitle continuous pose-invariant lipreading. By developing an AVASR system which can deal with such a scenario, we believe we are making the system effectively “real-world ” as it requires little cooperation from the user and as such can be used in a host of realistic applications (e.g. mobile phones, in-vehicles etc.). In this proof of concept paper, we show via our experiments on the CUAVE database, that recognising visual speech whilst a speaker is moving their head during the utterance is feasible. Index Terms: audio-visual automatic speech recognition (AVASR), lipreading, pose-invariance, pose-estimation. 1.
u.ac.jp
"... As one of the techniques for robust speech recognition under noisy environments, audio-visual speech recognition (AVSR) using lip dynamic scene information together with audio in-formation is attracting attention, and the research has ad-vanced in recent years. However, in visual speech recogni-tion ..."
Abstract
- Add to MetaCart
(Show Context)
As one of the techniques for robust speech recognition under noisy environments, audio-visual speech recognition (AVSR) using lip dynamic scene information together with audio in-formation is attracting attention, and the research has ad-vanced in recent years. However, in visual speech recogni-tion (VSR), when a face turns sideways, the shape of the lip as viewed from the camera changes and the recognition accu-racy degrades significantly. Therefore, many of the conven-tional VSR methods are limited to situations in which the face is viewed from the front. This paper proposes a VSR method to convert faces viewed from various directions into faces that are viewed from the front using Active Appear-ance Models (AAM). In the experiment, even when the face direction changes about 30 degrees relative to a frontal view, the recognition accuracy improved significantly.
Open Access
"... Multi-pose lipreading and audio-visual speech recognition Virginia Estellers * and Jean-Philippe Thiran In this article, we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We focus on overcoming the effects of a changing pose of the speaker, ..."
Abstract
- Add to MetaCart
(Show Context)
Multi-pose lipreading and audio-visual speech recognition Virginia Estellers * and Jean-Philippe Thiran In this article, we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We focus on overcoming the effects of a changing pose of the speaker, a problem encountered in natural situations where the speaker moves freely and does not keep a frontal pose with relation to the camera. To handle these situations, we introduce a pose normalization block in a standard system and generate virtual frontal views from non-frontal images. The proposed method is inspired by pose-invariant face recognition and relies on linear regression to find an approximate mapping between images from different poses. We integrate the proposed pose normalization block at different stages of the speech recognition system and quantify the loss of performance related to pose changes and pose normalization techniques. In audio-visual experiments we also analyze the integration of the audio and visual streams. We show that an audio-visual system should account for non-frontal poses and normalization techniques in terms of the weight assigned to the visual stream in the classifier. 1
MULTIPOSE AUDIO-VISUAL SPEECH RECOGNITION
"... In this paper we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We fo-cus on the effects of a changing pose of the speaker relative to the camera, a problem encountered in natural situations. To that purpose, we introduce a pose normalizati ..."
Abstract
- Add to MetaCart
In this paper we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We fo-cus on the effects of a changing pose of the speaker relative to the camera, a problem encountered in natural situations. To that purpose, we introduce a pose normalization technique and per-form speech recognition from multiple views by generating virtual frontal views from non-frontal images. The proposed method is in-spired by pose-invariant face recognition studies and relies on linear regression to find an approximate mapping between images from different poses. Lipreading experiments quantify the loss of perfor-mance related to pose changes and the proposed pose normalization techniques, while audio-visual results analyse how an audio-visual system should account for non-frontal poses in terms of the weight assigned to the visual modality in the audio-visual classifier. 1.