Download:
|
by Mark Stevenson, Robert Gaizauskas
In Proceedings of the 6th ANLP
http://www.dcs.shef.ac.uk/~marks/publications/rmem1.ps
Add To MetaCart
Abstract:
This paper explores the problem of identifying sentence boundaries in the transcriptions produced by automatic speech recognition systems. An experiment which determines the level of human performance for this task is described as well as a memory-based computational approach to the problem. 1 The Problem This paper addresses the problem of identifying sentence boundaries in the transcriptions produced by automatic speech recognition (ASR) systems. This is unusual in the field of text processing which has generally dealt with well-punctuated text: some of the most commonly used texts in NLP are machine readable versions of highly edited documents such as newspaper articles or novels. However, there are many types of text which are not so-edited and the example which we concentrate on in this paper is the output from ASR systems. These differ from the sort of texts normally used in NLP in a number of ways; the text is generally in single case (usually upper), unpunctuated and may contain transcription errors. 1 Figure 1 compares a short text in the format which would be produced by an ASR system with a fully punctuated version which includes case information. For the remainder of this paper error-free texts such as newspaper articles or novels shall be referred to as "standard text " and the output from a speech recognition system as "ASR text".
Citations
|
369
|
A Simple Rule-Based Part of Speech Tagger
– Brill
|
|
367
|
Assessing agreement on classifica-tion tasks: the kappa statistic
– Carletta
- 1996
|
|
198
|
TiMBL: Tilburg Memory Based Learner, version 5.0, Reference Guide
– Daelemans, Zavrel, et al.
- 2003
|
|
115
|
A Maximum Entropy Approach to Identifying Sentence Boundaries
– Reynar, Ratnaparkhi
- 1997
|
|
47
|
Nonparametric Statistics for the Behavioural Sciences
– Siegel, Castellan
- 1988
|
|
42
|
den Bosch. TiMBL: Tilburg Memory Based Learner, version 2.0, reference manual
– Daelemans, Zavrel, et al.
- 1999
|
|
41
|
CommandTalk: A Spoken-Language Interface for Battlefield Simulations
– MOORE, DOWDING, et al.
- 1997
|
|
37
|
Adaptive sentence boundary disambiguation
– Palmer, Hearst
- 1994
|
|
36
|
Good-Turing frequency estimation without tears
– Gale, Sampson
- 1995
|
|
35
|
Users reference guide for the British
– Burnard
- 1995
|
|
34
|
Survey of the state of the art in human language technology. Available at HLTsurvey.html> Denaux, R., et al. 2005. An approach for ontology-based elicitation of user models to enable personalization on the Semantic Web. Available at p1170.pdf> Digital Lib
– Cole
- 1996
|
|
29
|
Feature Lattices for Maximum Entropy Modelling
– Mikheev
- 1998
|
|
17
|
Combining weak knowledge sources for sense disambiguation
– Stevenson, Wilks
- 1999
|
|
14
|
Cyberpunc: A lightweight punctuation annotation system for speech
– Beeferman, Berger, et al.
- 1998
|
|
14
|
Information extraction from broadcast news
– Gotoh, Renals
- 2000
|
|
14
|
Hub-4 named entity task definition (version 4.8
– Chinchor, Robinson, et al.
- 1998
|
|
10
|
Survey of the State of the Art
– COLE
- 1995
|
|
6
|
Matching words to senses in WordNet: Naive vs. expert differentiation of senses
– Fellbaum, Grabowski, et al.
- 1998
|
|
1
|
Users Reference Guide for the M. Stevenson and
– Burnard
- 1995
|
|
1
|
Assessing agreement on classificConference on
– Carletta
- 1996
|
|
1
|
CommandTalk: A Spokcaa-Language Interface to Battlefield Simulations
– Moore, Dowding, et al.
- 1997
|