Results 1 - 10
of
16
Moving Beyond the `Beads-On-A-String' Model of Speech
- In Proc. IEEE ASRU Workshop
, 1999
"... The notion that a word is composed of a sequence of phone segments, sometimes referred to as `beads on a string', has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
The notion that a word is composed of a sequence of phone segments, sometimes referred to as `beads on a string', has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives -- automatically derived subword units and linguistically motivated distinctive feature systems -- and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies. 1. INTRODUCTION It has often been noted that automatic speech recognition performance is much worse on spontaneous speech than on carefully planned or r...
Towards unsupervised pattern discovery in speech
- Peter Hagedorn, Wolfgang Konrad and J. Wallaschek, The Journal of Sound and Vibration
, 2005
"... Abstract—We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a prespecified inventory of lexical units (i.e., p ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
Abstract—We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a prespecified inventory of lexical units (i.e., phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multiword phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream. Index Terms—Speech processing, unsupervised pattern discovery, word acquisition. I.
Automatic generation of subword units for speech recognition systems
- IEEE Transactions on Speech and Audio Processing
"... Abstract—Large vocabulary continuous speech recognition (LVCSR) systems traditionally represent words in terms of smaller subword units. Both during training and during recognition, they require a mapping table, called the dictionary, which maps words into sequences of these subword units. The perfo ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Abstract—Large vocabulary continuous speech recognition (LVCSR) systems traditionally represent words in terms of smaller subword units. Both during training and during recognition, they require a mapping table, called the dictionary, which maps words into sequences of these subword units. The performance of the LVCSR system depends critically on the definition of the subword units and the accuracy of the dictionary. In current LVCSR systems, both these components are manually designed. While manually designed subword units generalize well, they may not be the optimal units of classification for the specific task or environment for which an LVCSR system is trained. Moreover, when human expertise is not available, it may not be possible to design good subword units manually. There is clearly a need for data-driven design of these LVCSR components. In this paper, we present a complete probabilistic formulation for the automatic design of subword units and dictionary, given only the acoustic data and their transcriptions. The proposed framework permits easy incorporation of external sources of information, such as the spellings of words in terms of a nonideographic script. Index Terms—Learning, lexical representation, maximum-likelihood, speech recognition, subword units.
Pronunciation Adaptation At the Lexical Level
- Proceedings ISCA ITRW Workshop Adaptation Methods for Speech Recognition, Sophia Antipolis, France [on CD-ROM
, 2001
"... There are various kinds of adaptation which can be used to enhance the performance of automatic speech recognizers. This paper is about pronunciation adaptation at the lexical level, i.e. about modeling pronunciation variation at the lexical level. In the early years of automatic speech recognition ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
There are various kinds of adaptation which can be used to enhance the performance of automatic speech recognizers. This paper is about pronunciation adaptation at the lexical level, i.e. about modeling pronunciation variation at the lexical level. In the early years of automatic speech recognition (ASR) research, the amount of pronunciation variation was limited by using isolated words. Since the focus gradually shifted from isolated words to conversational speech, the amount of pronunciation variation present in the speech signals has increased, as has the need to model it. This is reflected by the growing attention for this topic. In this paper, an overview of the studies on lexicon adaptation is presented. Furthermore, many examples are mentioned of situations in which lexicon adaptation is likely to improve the performance of speech recognizers. Finally, it is argued that some assumptions made in current standard ASR systems are not in line with the properties of the speech signals. Consequently, the problem of pronunciation variation at the lexical level probably cannot be solved by simply adding new transcriptions to the lexicon, as it is generally done at the moment.
Automatically learning the units of speech by non-negative matrix factorisation
- in Proc. European Conference on Speech Communication and Technology
, 2007
"... We present an unsupervised technique to discover the (wordsized) speech units in which a corpus of utterances can be decomposed. First, a fixed-length high-dimensional vector representation of the utterances is obtained. Then, the resulting matrix is decomposed in terms of additive units by applying ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
We present an unsupervised technique to discover the (wordsized) speech units in which a corpus of utterances can be decomposed. First, a fixed-length high-dimensional vector representation of the utterances is obtained. Then, the resulting matrix is decomposed in terms of additive units by applying the non-negative matrix factorisation algorithm. On a small vocabulary task, the obtained basis vectors each represent one of the uttered words. We also investigate the amount of speech data that is needed to obtain a correct set of basis vectors. By decreasing the number of occurrences of the words in the corpus, an indication of the learning rate of the system is obtained. Index Terms: matrix factorisation, word segmentation, phone lattices, language acquisition.
ROBUST MUSIC IDENTIFICATION, DETECTION, AND ANALYSIS
"... In previous work, we presented a new approach to music identification based on finite-state transducers and Gaussian mixture models. Here, we expand this work and study the performance of our system in the presence of noise and distortions. We also evaluate a song detection method based on a univers ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
In previous work, we presented a new approach to music identification based on finite-state transducers and Gaussian mixture models. Here, we expand this work and study the performance of our system in the presence of noise and distortions. We also evaluate a song detection method based on a universal background model in combination with a support vector machine classifier and provide some insight into why our transducer representation allows for accurate identification even when only a short song snippet is available. 1
Single speaker segmentation and inventory selection using dynamic time warping self organization and joint multigram mapping
- in SSW06 ISCA Workshop
, 2007
"... In speech synthesis the inventory of units is decided by inspection and on the basis of phonological and phonetic expertise. The ephone (or emergent phone) project at CSTR is investigating how self organisation techniques can be applied to build an inventory based on collected acoustic data together ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In speech synthesis the inventory of units is decided by inspection and on the basis of phonological and phonetic expertise. The ephone (or emergent phone) project at CSTR is investigating how self organisation techniques can be applied to build an inventory based on collected acoustic data together with the constraints of a synthesis lexicon. In this paper we will describe a prototype inventory creation method using dynamic time warping (DTW) for acoustic clustering and a joint multigram approach for relating a series of symbols that represent the speech to these emerged units. We initially examined two symbol sets: 1) A baseline of standard phones 2) Orthographic symbols. The success of the approach is evaluated by comparing word boundaries generated by the emergent phones against those created using state-of-the-art HMM segmentation. Initial results suggest the DTW segmentation can match word boundaries with a root mean square error (RMSE) of 35ms. Results from mapping units onto phones resulted in a higher RMSE of 103ms. This error was increased when multiple multigram types were added and when the default unit clustering was altered from 40 (our baseline) to 10. Results for orthographic matching had a higher RMSE of 125ms. To conclude we discuss future work that we believe can reduce this error rate to a level sufficient for the techniques to be applied to a unit selection synthesis system. Index Terms: speech synthesis, unit selection. 1.
Efficient and Robust Music Identification with Weighted Finite-State Transducers
"... Abstract—We present an approach to music identification based on weighted finite-state transducers and Gaussian mixture models, inspired by techniques used in large-vocabulary speech recognition. Our modeling approach is based on learning a set of elementary music sounds in a fully unsupervised mann ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract—We present an approach to music identification based on weighted finite-state transducers and Gaussian mixture models, inspired by techniques used in large-vocabulary speech recognition. Our modeling approach is based on learning a set of elementary music sounds in a fully unsupervised manner. While the space of possible music sound sequences is very large, our method enables the construction of a compact and efficient representation for the song collection using finite-state transducers. This paper gives a novel and substantially faster algorithm for the construction of factor transducers, the key representation of song snippets supporting our music identification technique. The complexity of our algorithm is linear with respect to the size of the suffix automaton constructed. Our experiments further show that it helps speed up the construction of the weighted suffix automaton in our task by a factor of 17 with respect to our previous method using the intermediate steps of determinization and minimization. We show that, using these techniques, a large-scale music identification system can be constructed for a database of over 15 000 songs while achieving an identification accuracy of 99.4 % on undistorted test data, and performing robustly in the presence of noise and distortions. Index Terms—Music identification, content-based information retrieval, factor automata, suffix automata, finite-state transducers I.
Clustering Wide-Contexts and HMM Topologies for Spontaneous Speech Recognition
, 2001
"... In most speech recognition systems today, all the acoustic variation associated with a phoneme is characterized in terms of the identity of its neighboring phonemes. The neighbors influence only the state observation density of a fixed Hidden Markov Model. Other sources of variation are captured imp ..."
Abstract
- Add to MetaCart
In most speech recognition systems today, all the acoustic variation associated with a phoneme is characterized in terms of the identity of its neighboring phonemes. The neighbors influence only the state observation density of a fixed Hidden Markov Model. Other sources of variation are captured implicitly by using Gaussian mixture models for the state observations. Consequently, these models can be very broad, particularly for casual spontaneous speech. In this thesis, we explore conditioning of phonemes on higher level linguistic structure, specifically syllable- and word-level structure to learn models for phonemes that are more specific to the context, reporting experimental results on a large vocabulary (35k words) conversational speech task (Switchboard). In particular, this thesis makes three main contributions related to wide context conditioning. First, we demonstrate that syllable- and word-level structure can be incorporated into current acoustic models to improve recognition accuracy over triphones. For a fixed number of parameters, these models are computationally more efficient than pentaphones, both in training and in testing. In addition, use of syllable and word features leads to a small but significant improvement in performance. The wide-contexts used in our acoustic model can implicitly capture re-syllabification effects to a certain extent. However, we find that explicitly modeling re-syllabification does not improve recognition further, because there are only a small number of phones that exhibit acoustic difference after re-syllabification. The second contribution addresses the difficulties that arise when a large number of additional conditioning features are used. As the number of conditioning features increases, the training cost can increase exponentially. Moreover, a large fraction of the training labels tends to have too few examples to have reliable statistics associated with them, and this could potentially cause decision trees to learn bad clusters. A new method has been developed for clustering with multiple stages, where each stage clusters a different subset of features, and also has a choice of using the partitions learned in the previous stages. Apart from reducing the risk of unreliable statistics, it is designed to ameliorate data fragmentation problem and is computationally less expensive. This method was successfully demonstrated with pentaphones, resulting in equivalent performance at a lower cost. Finally, a new algorithm is described to design context-specific HMMs. The idea is to model reduction of a phone for certain contexts, and to learn a more constrained topology. Using contextual information, the algorithm clusters HMM paths where each path has a different number of states. An HMM distance measure has been formulated to prune out the paths which are similar. During decoding, the paths are allocated dynamically for each sub-word unit according to their context. We investigated this algorithm to model phone topologies, finding improved characterization of speech given known word sequences but no significant improvement in word error rate.

