Sub-Word-Based Language Models for Speech Recognition: Implications for Spoken Document Retrieval
BibTeX
@MISC{Larson_sub-word-basedlanguage,
author = {Martha Larson},
title = {Sub-Word-Based Language Models for Speech Recognition: Implications for Spoken Document Retrieval},
year = {}
}
OpenURL
Abstract
r exact morphology. For IR purposes a document might be adequately modeled with a vector of indexing features that is a histogram of word stems. A Spoken Document Retrieval (SDR) system requires language representations suited for speech recognition as well as for IR. For additional general information concerning SDR refer to [4]. The first step in the design of any language model involves the choice of what units to use as fundamental features. For LVCSR language models these features are the underlying inventory of base units over which the statistical model is defined. For IR language models these features are the indexing features that will be used to compute the relevance of the document to the user query. This extended abstract explores the base units of language models as a point of contact between language models for LVCSR and for IR, and tries to shed light how SDR systems can be built with an optimal interface between speech recognition and IR. The first section sketches dist







