Results 1 - 10
of
16
Providing the Basis for Human-Robot-Interaction: A Multi-Modal Attention System for a Mobile Robot
, 2003
"... In order to enable the widespread use of robots in home and office environments, systems with natural interaction capabilities have to be developed. A prerequisite for natural interaction is the robot's ability to automatically recognize when and how long a person's attention is directed towards it ..."
Abstract
-
Cited by 33 (5 self)
- Add to MetaCart
In order to enable the widespread use of robots in home and office environments, systems with natural interaction capabilities have to be developed. A prerequisite for natural interaction is the robot's ability to automatically recognize when and how long a person's attention is directed towards it for communication. As in open environments several persons can be present simultaneously, the detection of the communication partner is of particular importance. In this paper we present an attention system for a mobile robot which enables the robot to shift its attention to the person of interest and to maintain attention during interaction. Our approach is based on a method for multi-modal person tracking which uses a pan-tilt camera for face recognition, two microphones for sound source localization, and a laser range finder for leg detection. Shifting of attention is realized by turning the camera into the direction of the person which is currently speaking. From the orientation of the head it is decided whether the speaker addresses the robot. The performance of the proposed approach is demonstrated with an evaluation. In addition, qualitative results from the performance of the robot at the exhibition part of the ICVS'03 are provided.
Automatic Punctuation And Disfluency Detection In Multi-Party Meetings Using Prosodic And Lexical Cues
, 2002
"... We investigate automatic approaches to finding "hidden" spontaneous speech events, such as sentence boundaries and disfluencies, in multi-party meetings. Hidden events are characterized prosodically by a large array of automatically extracted energy, duration, and pitch features, and are modeled by ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
We investigate automatic approaches to finding "hidden" spontaneous speech events, such as sentence boundaries and disfluencies, in multi-party meetings. Hidden events are characterized prosodically by a large array of automatically extracted energy, duration, and pitch features, and are modeled by decision tree classifiers; lexical cues are modeled by N-gram language models. Both sources of information are combined in a hidden Markov model framework. Results show that combined classifiers achieve higher accuracy than either single knowledge source alone. We also study classifiers that use only the preceding context for predicting events, simulating online processing. We find that prosodic features are more robust than are language model features to this constraint. Finally, we examine the effect of automatic word recognition errors, in both training and testing, on classification accuracy. We find that lexical models degrade much more severely than do prosodic models in this case, again showing the relative robustness of prosodic information for hidden-event detection in natural conversation.
Can Prosody Aid the Automatic Processing of Multi-Party Meetings? Evidence from Predicting . . .
- IN PROC. ISCA TUTORIAL AND RESEARCH WORKSHOP ON PROSODY IN SPEECH RECOGNITION AND UNDERSTANDING (PROSODY
, 2001
"... We investigate whether probabilistic modeling of prosody can aid various automatic labeling tasks essential for processing of multi-party meetings. Task 1, automatic punctuation, seeks to classify sentence boundaries and disfluencies. Task 2, jumpin points, predicts locations within foreground spee ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
We investigate whether probabilistic modeling of prosody can aid various automatic labeling tasks essential for processing of multi-party meetings. Task 1, automatic punctuation, seeks to classify sentence boundaries and disfluencies. Task 2, jumpin points, predicts locations within foreground speech at which background speakers start talking; Task 3, jump-in words,ex- amines characteristics of the speech they use to do so. Data are from the ICSI Meeting Recorder corpus. To infer inherent cues, analyses are based on close-talking microphone signals and recognizer forced alignments. As a generous baseline for word-level cues, we compare prosodic models to those of a language model given the true words. Results for Task 1 show prosody reduces classification error by 10% relative over the cheating language model; furthermore when this task is run in "online" mode the prosodic model degrades less than does the language model. For Task 2, the language model provides no information, while the prosodic model reduces entropy by 13% over chance. For Task 3, a prosodic model reduces entropy by 25% over chance. Analyses also show interesting prosodic patterns, which differ over tasks. Task 1 uses cues similar to those for Switchboard (but not Broadcast News) data. Task 2 predicts jump-in points that look prosodically like sentence boundaries but that are not actually such boundaries. And Task 3 shows that speakers "raise" their voice when starting during another's talk, compared to starting during silence. These results provide evidence that prosodic modeling can be of use for the automatic processing of meetings. Further results and implications for future automatic meeting processing systems are discussed.
Towards a humanoid museum guide robot that interacts with multiple persons
- in Proc. of the IEEE/RSJ International Conference on Humanoid Robots (Humanoids
, 2005
"... Abstract — The purpose of our research is to develop a humanoid museum guide robot that performs intuitive, multimodal interaction with multiple persons. In this paper, we present a robotic system that makes use of visual perception, sound source localization, and speech recognition to detect, track ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Abstract — The purpose of our research is to develop a humanoid museum guide robot that performs intuitive, multimodal interaction with multiple persons. In this paper, we present a robotic system that makes use of visual perception, sound source localization, and speech recognition to detect, track, and involve multiple persons into interaction. Depending on the audio-visual input, our robot shifts its attention between different persons. In order to direct the attention of its communication partners towards exhibits, our robot performs gestures with its eyes and arms. As we demonstrate in practical experiments, our robot is able to interact with multiple persons in a multimodal way and to shift its attention between different people. Furthermore, we discuss experiences made during a two-day public demonstration of our robot. I.
Understanding referring expressions in situated language: Some challenges for real-world agents
- In Proceedings of the First International Workshop on Language Understanding and Agents for the Real
, 2003
"... Past research efforts to construct embodied conversational agents have focused on features of multi-channel communication that are made possible by embodiment. This paper draws attention instead to 4 crucial aspects of the linguistic forms that an embodied agent will encounter, and that language und ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Past research efforts to construct embodied conversational agents have focused on features of multi-channel communication that are made possible by embodiment. This paper draws attention instead to 4 crucial aspects of the linguistic forms that an embodied agent will encounter, and that language understanding agents must master in order to converse and collaborate with humans in the real world. If language technology is to be ready to support conversation in the copresent technologies that will soon arrive, these deficiencies in linguistic competence must be addressed now. 1
Integrating vision and speech for conversations with multiple persons
- in Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS
, 2005
"... Abstract — An essential capability for a robot designed to interact with humans is to show attention to the people in its surroundings. To enable a robot to involve multiple persons into interaction requires the maintenance of an accurate belief about the people in the environment. In this paper, we ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract — An essential capability for a robot designed to interact with humans is to show attention to the people in its surroundings. To enable a robot to involve multiple persons into interaction requires the maintenance of an accurate belief about the people in the environment. In this paper, we use a probabilistic technique to update the knowledge of the robot based on sensory input. In this way, the robot is able to reason about the uncertainty in its belief about people in the vicinity and is able to shift its attention between different persons. Even people who are not the primary conversational partners are included into the interaction. In practical experiments with a humanoid robot, we demonstrate the effectiveness of our approach. I.
Fritz -- A Humanoid Communication Robot
, 2007
"... In this paper, we present the humanoid communication robot Fritz. Our robot communicates with people in an intuitive, multimodal way. Fritz uses speech, facial expressions, eye-gaze, and gestures to interact with people. Depending on the audio-visual input, our robot shifts its attention between dif ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
In this paper, we present the humanoid communication robot Fritz. Our robot communicates with people in an intuitive, multimodal way. Fritz uses speech, facial expressions, eye-gaze, and gestures to interact with people. Depending on the audio-visual input, our robot shifts its attention between different persons in order to involve them into the conversation. He performs human-like arm gestures during the conversation and also uses pointing gestures generated with eyes, head, and arms to direct the attention of its communication partners towards objects of interest. To express its emotional state, the robot generates facial expressions and adapts the speech synthesis. We discuss experiences made during two public demonstrations of our robot.
Prosody-Based Automatic Detection of Punctuation and Interruption Events in the ICSI Meeting Recorder Corpus
, 2002
"... This report focuses on extending the use of prosody to the domain of natural meetings using a collection recorded at the International Computer Science Institute (ICSI). This corpus presents new challenges, because speakers are familiar with one another, have access to other cues such as gesture, ar ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This report focuses on extending the use of prosody to the domain of natural meetings using a collection recorded at the International Computer Science Institute (ICSI). This corpus presents new challenges, because speakers are familiar with one another, have access to other cues such as gesture, are not typically constrained to one topic as they are in corpora such as Broadcast News or Switchboard, and because of the high degree of speaker overlap and presence of multiple speakers. This study uses automatically derived prosodic features, including stylized pitch, pause durations and energy statistics, based on both forced alignments and recognized words, to build a prosodic classifier for various events of interest. An analysis of performance degradations in the ASR-based feature set is included and provides some useful observations regarding the feasibility of a fully automatic system in the presence of word errors. The value of "online" classifiers, which have no access to future features and would therefore be used in real time systems, is assessed and compared to the case of the full feature set. This comparison is relevant for ongoing research (Y. Matsusaka, et al., 2001) where robotic conversational agents participate in meetings and interact with human participants. In order for such a machine to function well, it must master the prediction of pragmatic and semantic events. Where applicable, results are compared to a language model classifier, which provides a measure of the usefulness of words alone in our classification tasks. Finally, the performance of a combined prosodic and language classifier is assessed. These event classification systems allow for feature analysis across tasks that provide important insights on the usefulness of various cues in both hum...
Multimodal conversation between a humanoid robot and multiple persons
- in Proc. of the Workshop on Modular Construction of Humanlike Intelligence at the Twentieth National Conferences on Artificial Intelligence (AAAI
, 2005
"... Attracting people and involving multiple persons into an interaction is an essential capability for a humanoid robot. A prerequisite for such a behavior is that the robot is able to sense people in its vicinity and to know where they are located. In this paper, we propose an approach that maintains ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Attracting people and involving multiple persons into an interaction is an essential capability for a humanoid robot. A prerequisite for such a behavior is that the robot is able to sense people in its vicinity and to know where they are located. In this paper, we propose an approach that maintains a probabilistic belief about people in the surroundings of the robot. Using this belief, the robot is able to memorize people even if they are currently outside its limited field of view. Furthermore, we use a technique to localize a speaker in the environment. In this way, even people who are currently not the primary conversational partners or who are not stored in the robot’s belief can attract its attention. To enrich human-robot interaction and to express how the robot changes its mood, we apply a technique to change its facial expressions. As we demonstrate in practical experiments, by integrating the presented techniques into its control architecture, our robot is able to interact with multiple persons in a multimodal way and to shift its attention between different people.
Robust Recognition of Simultaneous Speech by a Mobile Robot
"... Abstract—This paper describes a system that gives a mobile robot the ability to perform automatic speech recognition with simultaneous speakers. A microphone array is used along with a real-time implementation of geometric source separation (GSS) and a postfilter that gives a further reduction of in ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract—This paper describes a system that gives a mobile robot the ability to perform automatic speech recognition with simultaneous speakers. A microphone array is used along with a real-time implementation of geometric source separation (GSS) and a postfilter that gives a further reduction of interference from other sources. The postfilter is also used to estimate the reliability of spectral features and compute a missing feature mask. The mask is used in a missing feature theory-based speech recognition system to recognize the speech from simultaneous Japanese speakers in the context of a humanoid robot. Recognition rates are presented for three simultaneous speakers located at 2 m from the robot. The system was evaluated on a 200-word vocabulary at different azimuths between sources, ranging from 10 ◦ to 90 ◦. Compared to the use of the microphone array source separation alone, we demonstrate an average reduction in relative recognition error rate of 24 % with the postfilter and of 42 % when the missing features approach is combined with the postfilter. We demonstrate the effectiveness of our multisource microphone array postfilter and the improvement it provides when used in conjunction with the missing features theory. Index Terms—Cocktail party, geometric source separation (GSS), microphone array, missing feature theory, robot audition, speech recognition. I.

