Audio-visual Information Fusion In Human Computer Interfaces and Intelligent Environments: A Survey
| Citations: | 6 - 5 self |
BibTeX
@MISC{Shivappa_audio-visualinformation,
author = {Shankar T. Shivappa and Mohan M. Trivedi and Bhaskar D. Rao},
title = {Audio-visual Information Fusion In Human Computer Interfaces and Intelligent Environments: A Survey},
year = {}
}
OpenURL
Abstract
Microphones and cameras have been extensively used to observe and detect human activity and to facilitate natural modes of interaction between humans and intelligent systems. Human brain processes the audio and video modalities extracting complementary and robust information from them. Intelligent systems with audio-visual sensors should be capable of achieving similar goals. The audio-visual information fusion strategy is a key component in designing such systems. In this paper we exclusively survey the fusion techniques used in various audio-visual information fusion tasks. The fusion strategy used tends to depend mainly on the model, probabilistic or otherwise, used in the particular task to process sensory information to obtain higher level semantic information. The models themselves are task oriented. In this paper we describe the fusion strategies and the corresponding models used in audiovisual tasks such as speech recognition, tracking, biometrics, affective state recognition and meeting scene analysis. We also review the challenges and existing solutions and also unresolved or partially resolved issues in these fields. Specifically, we discuss established and upcoming work in hierarchical fusion strategies and crossmodal learning techniques, identifying these as critical areas of research in the future development of intelligent systems.







