Results 1 - 10
of
18
Observations on overlap: Findings and implications for automatic processing of multi-party conversation
- Proc. EUROSPEECH
, 2001
"... We examine the distribution of overlapping speech in different corpora of natural multi-party conversations, including two types of meetings, and two corpora of telephone conversations. Analyses are based on forced alignment and speech recognition using an identical recognizer across tasks. Three re ..."
Abstract
-
Cited by 51 (10 self)
- Add to MetaCart
We examine the distribution of overlapping speech in different corpora of natural multi-party conversations, including two types of meetings, and two corpora of telephone conversations. Analyses are based on forced alignment and speech recognition using an identical recognizer across tasks. Three results are discussed. First, all corpora show high overall rates of overlap, with similar rates for meetings and telephone conversations. Second, speech recognition performance in non-overlapped regions of meetings is no worse than that in single-channel telephone conversations, while recognition in overlap regions degrades considerably. Finally, interrupt locations are associated with endpoints of word-level events in a speaker’s turn, including backchannels, discourse markers, and disfluencies. Results suggest that overlap is an important inherent characteristic of conversational speech that should not be ignored; on the contrary, it should be jointly modeled with acoustic and language model information in machine processing of conversation. 1.
Multispeaker speech activity detection for the ICSI Meeting Recorder
- in Proceedings IEEE Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio
, 2001
"... As part of a project into speech recognition in meeting environments, we have collected a corpus of multi-channel meeting recordings. We expected the identification of speaker activity to be straightforward given that the participants had individual microphones, but simple approaches yielded unaccep ..."
Abstract
-
Cited by 32 (4 self)
- Add to MetaCart
As part of a project into speech recognition in meeting environments, we have collected a corpus of multi-channel meeting recordings. We expected the identification of speaker activity to be straightforward given that the participants had individual microphones, but simple approaches yielded unacceptably erroneous labelings, mainly due to crosstalk between nearby speakers and wide variations in channel characteristics. Therefore, we have developed a more sophisticated approach for multichannel speech activity detection using a simple hidden Markov model (HMM). A baseline HMM speech activity detector has been extended to use mixtures of Gaussians to achieve robustness for different speakers under different conditions. Feature normalization and crosscorrelation processing are used to increase the channel independence and to detect crosstalk. The use of both energy normalization and crosscorrelation based postprocessing results in a 35% relative reduction of the frame error rate. Speech recognition experiments show that it is beneficial in this multispeaker setting to use the output of the speech activity detector for presegmenting the recognizer input, achieving word error rates within 10 % of those achieved with manual turn labeling. 1.
Audiovisual probabilistic tracking of multiple speakers in meetings
- IEEE Transactions on Audio, Speech, and Language Processing
, 2007
"... e-mail ..."
The Mad Hatter's Cocktail Party: A Social Mobile Audio Space Supporting Multiple Simultaneous Conversations
- Proceedings of the Conference on Computer Human Interaction
, 2003
"... This paper presents a mobile audio space intended for use by gelled social groups. In face-to-face interactions in such social groups, conversational floors change frequently, e.g., two participants split off to form a new conversational floor, a participant moves from one conversational floor to an ..."
Abstract
-
Cited by 25 (7 self)
- Add to MetaCart
This paper presents a mobile audio space intended for use by gelled social groups. In face-to-face interactions in such social groups, conversational floors change frequently, e.g., two participants split off to form a new conversational floor, a participant moves from one conversational floor to another, etc. To date, audio spaces have provided little support for such dynamic regroupings of participants, either requiring that the participants explicitly specify with whom they wish to talk or simply presenting all participants as though they are in a single floor. By contrast, the audio space described here monitors participant behavior to identify conversational floors as they emerge. The system dynamically modifies the audio delivered to each participant to enhance the salience of the participants with whom they are currently conversing. We report a user study of the system, focusing on conversation analytic results.
The ICSI Meeting Project: Resources and Research
- in Proc. of ICASSP 2004 Meeting Recognition Workshop
, 2004
"... This paper provides a progress report on ICSI’s Meeting Project, including both the data collected and annotated as part of the project, as well as the research lines such materials support. We include a general description of the official “ICSI Meeting Corpus”, as currently available through the Li ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
This paper provides a progress report on ICSI’s Meeting Project, including both the data collected and annotated as part of the project, as well as the research lines such materials support. We include a general description of the official “ICSI Meeting Corpus”, as currently available through the Linguistic Data Consortium, discuss some of the existing and planned annotations which augment the basic transcripts provided there, and describe several research efforts that make use of these materials. The corpus supports wideranging efforts, from low-level processing of the audio signal (including automatic speech transcription, speaker tracking, and work on far-field acoustics) to higher-level analyses of meeting structure, content, and interactions (such as topic and sentence segmentation, and automatic detection of dialogue acts and meeting “hot spots”). 1.
Laughter detection in meetings
- in Proc. NIST Meeting Recognition Workshop
, 2004
"... We build a system to automatically detect laughter events in meetings, where laughter events are defined as points in the meeting where a number of the participants (more than just one) are laughing simultaneously. We implement our system using a support vector machine classifier trained on mel-freq ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
We build a system to automatically detect laughter events in meetings, where laughter events are defined as points in the meeting where a number of the participants (more than just one) are laughing simultaneously. We implement our system using a support vector machine classifier trained on mel-frequency cepstral coefficients (MFCCs), delta MFCCs, modulation spectrum, and spatial cues from the time delay between two desktop microphones. We run our experiments on the ‘Bmr ’ subset of the ICSI Meeting Recorder corpus using just two table-top microphones and obtain detection results with a correct accept rate of 87 % and a false alarm rate of 13%. 1.
Can Prosody Aid the Automatic Processing of Multi-Party Meetings? Evidence from Predicting . . .
- IN PROC. ISCA TUTORIAL AND RESEARCH WORKSHOP ON PROSODY IN SPEECH RECOGNITION AND UNDERSTANDING (PROSODY
, 2001
"... We investigate whether probabilistic modeling of prosody can aid various automatic labeling tasks essential for processing of multi-party meetings. Task 1, automatic punctuation, seeks to classify sentence boundaries and disfluencies. Task 2, jumpin points, predicts locations within foreground spee ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
We investigate whether probabilistic modeling of prosody can aid various automatic labeling tasks essential for processing of multi-party meetings. Task 1, automatic punctuation, seeks to classify sentence boundaries and disfluencies. Task 2, jumpin points, predicts locations within foreground speech at which background speakers start talking; Task 3, jump-in words,ex- amines characteristics of the speech they use to do so. Data are from the ICSI Meeting Recorder corpus. To infer inherent cues, analyses are based on close-talking microphone signals and recognizer forced alignments. As a generous baseline for word-level cues, we compare prosodic models to those of a language model given the true words. Results for Task 1 show prosody reduces classification error by 10% relative over the cheating language model; furthermore when this task is run in "online" mode the prosodic model degrades less than does the language model. For Task 2, the language model provides no information, while the prosodic model reduces entropy by 13% over chance. For Task 3, a prosodic model reduces entropy by 25% over chance. Analyses also show interesting prosodic patterns, which differ over tasks. Task 1 uses cues similar to those for Switchboard (but not Broadcast News) data. Task 2 predicts jump-in points that look prosodically like sentence boundaries but that are not actually such boundaries. And Task 3 shows that speakers "raise" their voice when starting during another's talk, compared to starting during silence. These results provide evidence that prosodic modeling can be of use for the automatic processing of meetings. Further results and implications for future automatic meeting processing systems are discussed.
Pitch-based emphasis detection for characterization of meeting recordings
- in Proc. ASRU, Virgin Islands
, 2003
"... The automatic extraction of key utterances in spoken data has emerged as an interesting and difficult topic in automatic speech recognition. “Emphasis ” or “excitement ” may be a useful identifier for these utterances of interest. In this paper, we undertake the task of reliably and automatically id ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
The automatic extraction of key utterances in spoken data has emerged as an interesting and difficult topic in automatic speech recognition. “Emphasis ” or “excitement ” may be a useful identifier for these utterances of interest. In this paper, we undertake the task of reliably and automatically identifying emphasized or excited utterances in natural speech in a meeting setting. We start by endeavoring to establish reliable ground truth emphasis labels by using several hand-labelers. The results show that human listeners can reliably identify emphasized utterances in meeting recordings. We then build an automatic emphasis detection system, which uses normalized pitch as its only acoustic predictor. The results show that this pitch-based emphasis detection scheme can distinguish between non-emphasized and emphasized utterances with an accuracy of 92 % when ambiguous cases are excluded, a rate comparable to human interlabeler agreement. 1.
Speech Enhancement and Recognition in Meetings With an Audio–Visual Sensor Array
"... reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained
Analyzing group interactions in conversations: a review
- IN PROC. IEEE INT. CONF. MULTISENSOR FUSION AND INTEGRATION FOR INTELLIGENT SYSTEMS ’06
, 2006
"... ..."

