Results 1 - 10
of
22
Microphone array driven speech recognition: influence of localization on the word error rate
- 2nd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms
, 2005
"... Abstract. Interest within the automatic speech recognition (ASR) research community has recently focused on the recognition of speech captured with one or more microphones located in the far field, rather than being mounted on a headset and positioned next to the speaker’s mouth. Far field ASR is a ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Abstract. Interest within the automatic speech recognition (ASR) research community has recently focused on the recognition of speech captured with one or more microphones located in the far field, rather than being mounted on a headset and positioned next to the speaker’s mouth. Far field ASR is a natural application for beamforming techniques using an array of microphones. A prerequisite for applying such techniques, however, is a reliable means of speaker localization. In this work, we compare the accuracy of source localization systems based on only audio features, only video features, as well as a combination of audio and video features using speech data collected during seminars held by actual speakers. We also investigate the influence of source localization accuracy on the word error rate (WER) of a far field ASR system, comparing the WERs obtained with position estimates from several automatic source localizers with those obtained from true speaker positions. Our results reveal that accurate speaker localization is crucial for minimizing the error rate of a far field ASR system. 1
Boostingbased multimodal speaker detection for distributed meetings
- in IEEE International Workshop on Multimedia Singal Processing (MMSP
, 2006
"... Abstract — Speaker detection is a very important task in distributed meeting applications. This paper discusses a number of challenges we met while designing a speaker detector for the Microsoft RoundTable distributed meeting device, and proposes a boosting-based multimodal speaker detection (BMSD) ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Abstract — Speaker detection is a very important task in distributed meeting applications. This paper discusses a number of challenges we met while designing a speaker detector for the Microsoft RoundTable distributed meeting device, and proposes a boosting-based multimodal speaker detection (BMSD) algorithm. Instead of performing sound source localization (SSL) and multiperson detection (MPD) separately and subsequently fusing their individual results, the proposed algorithm uses boosting to select features from a combined pool of both audio and visual features simultaneously. The result is a very accurate speaker detector with extremely high efficiency. The algorithm reduces the error rate of SSL-only approach by 47%, and the SSL and MPD fusion approach by 27%. I.
Multiple Object Tracking Performance Metrics and Evaluation in a Smart Room Environment
"... Simultaneous tracking of multiple persons in real world environments is an active research field and several approaches have been proposed, based on a variety of features and algorithms. Recently, there has been a growing interest in organizing systematic evaluations to compare the various technique ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Simultaneous tracking of multiple persons in real world environments is an active research field and several approaches have been proposed, based on a variety of features and algorithms. Recently, there has been a growing interest in organizing systematic evaluations to compare the various techniques. Unfortunately, the lack of common metrics for measuring the performance of multiple object trackers still makes it hard to compare their results. In this work, we introduce two intuitive and general metrics to allow for objective comparison of tracker characteristics, focusing on their precision in estimating object locations, their accuracy in recognizing object configurations and their ability to consistently label objects over time. We also present a novel system for tracking multiple users in
Audio-visual Information Fusion In Human Computer Interfaces and Intelligent Environments: A Survey
"... Microphones and cameras have been extensively used to observe and detect human activity and to facilitate natural modes of interaction between humans and intelligent systems. Human brain processes the audio and video modalities extracting complementary and robust information from them. Intelligent s ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Microphones and cameras have been extensively used to observe and detect human activity and to facilitate natural modes of interaction between humans and intelligent systems. Human brain processes the audio and video modalities extracting complementary and robust information from them. Intelligent systems with audio-visual sensors should be capable of achieving similar goals. The audio-visual information fusion strategy is a key component in designing such systems. In this paper we exclusively survey the fusion techniques used in various audio-visual information fusion tasks. The fusion strategy used tends to depend mainly on the model, probabilistic or otherwise, used in the particular task to process sensory information to obtain higher level semantic information. The models themselves are task oriented. In this paper we describe the fusion strategies and the corresponding models used in audiovisual tasks such as speech recognition, tracking, biometrics, affective state recognition and meeting scene analysis. We also review the challenges and existing solutions and also unresolved or partially resolved issues in these fields. Specifically, we discuss established and upcoming work in hierarchical fusion strategies and crossmodal learning techniques, identifying these as critical areas of research in the future development of intelligent systems.
Capturing Interactions
- in Meetings with Omnidirectional Cameras, International workshop on Multimedia Technologies in E-learning and Collaboration
, 2003
"... To provide intelligent services in a smart environments it is necessary to acquire information about the room, the people in it and their interactions. This includes, for example, the number of people, their identities, locations, postures, body and head orientations, among others. This paper gives ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
To provide intelligent services in a smart environments it is necessary to acquire information about the room, the people in it and their interactions. This includes, for example, the number of people, their identities, locations, postures, body and head orientations, among others. This paper gives an overview of the perceptual technology evaluations that were conducted in the CHIL project, specifically those held in the CLEAR 2006 and 2007 evaluation workshops. We then summarize the main achievements and lessons learnt in the project in the areas of person tracking, person identification and head pose estimation, all of which are critical perception components in order to build perceptive smart environments. 1.
Detection and Localization of 3D Audio-Visual Objects Using Unsupervised Clustering
"... This paper addresses the issues of detecting and localizing objects in a scene that are both seen and heard. We explain the benefits of a human-like configuration of sensors (binaural and binocular) for gathering auditory and visual observations. It is shown that the detection and localization probl ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
This paper addresses the issues of detecting and localizing objects in a scene that are both seen and heard. We explain the benefits of a human-like configuration of sensors (binaural and binocular) for gathering auditory and visual observations. It is shown that the detection and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data into a common audio-visual 3D representation via a pair of mixture models. Inference is performed by a version of the expectationmaximization algorithm, which is formally derived, and which provides cooperative estimates of both the auditory activity and the 3D position of each object. We describe several experiments with single- and multiple-speaker detection and localization, in the presence of other audio sources.
Y.: 3d head tracking using the particle filter with cascaded classifiers
- In: BMVC (2006
"... We propose a method for real-time people tracking using multiple cameras. The particle filter framework is known to be effective for tracking people, but most of existing methods adopt only simple perceptual cues such as color histogram or contour similarity for hypothesis evaluation. To improve the ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We propose a method for real-time people tracking using multiple cameras. The particle filter framework is known to be effective for tracking people, but most of existing methods adopt only simple perceptual cues such as color histogram or contour similarity for hypothesis evaluation. To improve the robustness and accuracy of tracking more sophisticated hypothesis evaluation is indispensable. We therefore present a novel technique for human head tracking using cascaded classifiers based on AdaBoost and Haar-like features for hypothesis evaluation. In addition, we use multiple classifiers, each of which is trained respectively to detect one direction of a human head. During real-time tracking the most suitable classifier is adaptively selected by considering each hypothesis and known camera position. Our experimental results demonstrate the effectiveness and robustness of our method. 1
Multimodal fusion for multimedia analysis: a survey
, 2010
"... This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks. The existing literature on multimodal fusion research is presented through several c ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks. The existing literature on multimodal fusion research is presented through several classifications based on the fusion methodology and the level of fusion (feature, decision, and hybrid). The fusion methods are described from the perspective of the basic concept, advantages, weaknesses, and their usage in various analysis tasks as reported in the literature. Moreover, several distinctive issues that influence a multimodal fusion process such as, the use of correlation and independence, confidence level, contextual information, synchronization between different modalities, and the optimal modality selection are also highlighted. Finally, we present the open issues for further research in the area of multimodal fusion.
Head Orientation Estimation using Particle Filtering in Multiview Scenarios
"... Abstract. This paper presents a novel approach to the problem of determining head pose estimation and face 3D orientation of several people in low resolution sequences from multiple calibrated cameras. Spatial redundancy is exploited and the head in the scene is approximated by an ellipsoid. Skin pa ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. This paper presents a novel approach to the problem of determining head pose estimation and face 3D orientation of several people in low resolution sequences from multiple calibrated cameras. Spatial redundancy is exploited and the head in the scene is approximated by an ellipsoid. Skin patches from each detected head are located in each camera view. Data fusion is performed by back-projecting skin patches from single images onto the estimated 3D head model, thus providing a synthetic reconstruction of the head appearance. A particle filter is employed to perform the estimation of the head pan angle of the person under study. A likelihood function based on the face appearance is introduced. Experimental results proving the effectiveness of the proposed algorithm are provided for the SmartRoom scenario of the CLEAR Evaluation 2007 Head Orientation dataset. 1 Video Head Pose Estimation This section presents a new approach to multi-camera head pose estimation from low-resolution images based on Particle Filtering (PF) [1]. A spatial and color analysis of these input images is performed and redundancy among cameras is exploited to produce a synthetic reconstruction of the head of the person. This informationis used to construct the likelihood function that will weight the particles of this PF based on visual information. The estimation of the head orientation will be computed as the expectation of the pan angle thus producing a real valued output. For a given frame in the video sequence, a set of N images are obtained from the N cameras. Each camera is modeled using a pinhole camera model based on perspective projection. Accurate calibration information is available. Bounding boxes describing the head of a person in multiple views are used to segment the interest area where the colour module will be applied. Center and size of the bounding box allow defining an ellipsoid model H = {c,R,s} where c is the center, R the rotation along each axis centered on c and s the length of each axis. Colour information is processed as described in the following subsection.
Estimating the Lecturer’s Head Pose in Seminar Scenarios - A Multi-view Approach
- 2nd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms - MLMI 2005
, 2005
"... Abstract. In this paper, we present a system to track the horizontal head orientation of a lecturer in a smart seminar room, which is equipped with several cameras. We automatically detect and track the face of the lecturer and use neural networks to classify his or her face orientation in each came ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract. In this paper, we present a system to track the horizontal head orientation of a lecturer in a smart seminar room, which is equipped with several cameras. We automatically detect and track the face of the lecturer and use neural networks to classify his or her face orientation in each camera view. By combining the single estimates of the speaker’s head orientation from multiple cameras into one joint hypothesis, we improve overall head pose estimation accuracy. We conducted experiments on annotated recordings from real seminars. Using the proposed fully automatic system we are able to correctly determine the lecturer’s head pose in 59 % of the time and for 8 orientation classes. In 92 % of the time, the correct pose class or a neighbouring pose class (i.e. a 45 degree error) were estimated. 1

