| K. Seymore and R. Rosenfeld, "Scalable Backoff Language Models," in Proceedings of the International Conference on Spoken Language Processing, Philadelphia, USA, 1996. |
....is built statically, these optimizations can be performed entirely off line. The composition, determinization and minimization algorithms, a non deterministic language model back off representation [8] remove redundancy and minimize the size of the recognition transducer. Language model shrinking [9] allows control of the size and accuracy trade off even for very large vocabulary tasks. 4. COMPARISON OF APPROACHES All experimental comparisons were carried out on the DARPA North American Business News (NAB) 1995 evaluation corpus. There exist recognition systems that have been optimized for ....
....Even when exploiting the backing off structure, the LM transducer G may be large resulting in a large HCLG recognition transducer causing the recognizer to use a lot of computational resources. To reduce these requirements, G may be shrunk to a smaller size, i.e. with fewer events modeled [9]. This approximation may affect the recognition accuracy. Figure 1 shows that for this task the effect on the word error rate (WER) is quite small for moderate shrinking factors. Figure 1 also shows that even with the large difference in size of the final transducers (see Table 3) the relation ....
K. Seymore and R. Rosenfeld, "Scalable Backoff Language Models," in Int. Conf. on Spoken Language Processing, Philadelphia, Pennsylvania, 1996, pp. 232-- 235.
....of FSMs for their use in NL processing Applications [13] Simple as they are, FSMs generally need to be huge in order to be useful approximations to complex languages. Consequently, building them by some procedure of automatical learning from large enough sets of training data becomes a must [11, 23, 17, 24, 26]. The present work continues a line of research centered in language understanding and subsequential transducer learning [6, 20, 21, 24] Here, more powerful techniques are being tested on a complex, useful task of practical interest. 2 Subsequential Transduction Learning The Finite State or ....
K.Seymore, R.Rosenfeld. "Scalable Backoff Language Models". ICSLP-96, proc.. pp.232-235. Philadelfia, 1996.
....on the size of the network in practice. 2] Simple as they are, FSMs generally need to be huge in order to be useful approximations to complex languages. For instance, an adequate 3 Gram Language Model for the language of the Wall Street Journal is a FSM that may have as many as 20 million edges [23]. Obviously, there is no point in trying to manually build such models on the base of a priori knowledge about the language to be modeled: the success lies in the possibility of automatically learning them from large enough sets of training data [8, 23] This is also the case for the finite state ....
.... a FSM that may have as many as 20 million edges [23] Obviously, there is no point in trying to manually build such models on the base of a priori knowledge about the language to be modeled: the success lies in the possibility of automatically learning them from large enough sets of training data [8, 23]. This is also the case for the finite state LU models used in the work presented in this paper [15, 24, 26] 2 Subsequential Transduction The following definitions follow closely those given in Berstel [4] with some small variations for the sake of brevity. A Finite State Transducer (FST) is a ....
K.Seymore, R.Rosenfeld. "Scalable Backoff Language Models". ICSLP-96, proc.. pp.232-235. Philadelfia, 1996.
....use in NL processing applications [10] Simple as they are, FSMs generally need to be huge in order to be useful approximations to complex languages. For instance, an adequate 3 Gram Language Model for the language of the Wall Street Journal is a FSM that may have as many as 20 million edges [15]. Obviously, there is no point in trying to manually build such models on the base of a priori knowledge about the language to be modeled: the success lies in the possibility of automatically learning them from large enough sets of training data [8, 15] This is also the case for the finite state ....
.... a FSM that may have as many as 20 million edges [15] Obviously, there is no point in trying to manually build such models on the base of a priori knowledge about the language to be modeled: the success lies in the possibility of automatically learning them from large enough sets of training data [8, 15]. This is also the case for the finite state translation models used in the work presented in this paper [11, 16, 17] 2. SUBSEQUENTIAL TRANSDUCERS The difficulty of a translation task depends on many factors. One of the most important is the asynchrony , or distance at which words of the ....
K.Seymore, R.Rosenfeld. "Scalable Backoff Language Models". ICSLP-96, proc.. pp.232-235. Philadelfia, 1996.
....Jianfeng Gao Natural Language Group Microsoft Research China Beijing 100080, P.R. C jfgao microsoft.com http: www.microsoft.com china research ABSTRACT Several techniques are known for reducing the size of language models, including count cutoffs [1] Weighted Difference pruning [2], Stolcke pruning [3] and clustering [4] We compare all of these techniques and show some surprising results. For instance, at low pruning thresholds, Weighted Difference and Stolcke pruning underperform count cutoffs. We then show novel clustering techniques that can be combined with Stolcke ....
....is typically comparable in size to the data on which it is trained. Some form of size reduction is therefore critical for any practical application. Many different approaches have been suggested for reducing the size of language models, including count cutoffs [1] Weighted Difference pruning [2], Stolcke pruning [3] and clustering [4] In this paper, we first present a comparison of these various techniques, and then we demonstrate a new technique that combines a novel form of clustering with Stolcke pruning, performing up to a factor of 3, or more, better than Stolcke pruning alone. ....
[Article contains additional citation context not shown here]
K. Seymore, R. Rosenfeld. "Scalable backoff language models", Proc. ICSLP, Vol. 1., pp.232-235, Philadelphia, 1996
....transcripts (3. 5M words) Call Home transcripts ( 3M words) and Broadcast News transcripts (165M words) The contributions of the training data from the three sources were effectively weighted in a ratio of (1:1: 15) The 3 gram model was shrunken using the technique of Seymore and Rosenfeld [1], giving a model with 497855 states and 1554689 arcs (45642 1 grams, 452212 2 grams and 558979 3grams) For training the larger, recoring language model, in addition to the training text used for the first pass model (from Switchboard, Call Home and Broadcast News) transcripts from various ....
K. Seymore and R. Rosenfeld. Scalable backoff language models. In Proceedings of ICSLP, Philadelphia, Pennsylvania, 1996.
....NA News corpus of newspaper text) The first pass trigram model was built by first constructing a backoff language model from the 271 million words of training text, yielding 15.8 million 2 grams and 22.4 million 3 grams. This model was reduced in size, using the approach of Seymore and Rosenfeld [7], to 1.4 million 2 grams and 1.1 million 3 grams. When composed with the lexicon, this smaller trigram model yielded a manageable sized network. The second pass model used 6.2 million 2 grams, 7.8 million 3 grams, and 4.0 million 4 grams. For this model, the three transcription sources (SDR, HUB4, ....
K. Seymore and R. Rosenfeld. Scalable backoff language models. In Proceedings of the ICSLP-96, 1996.
....by the Viterbi algorithm. The first recognition pass used a pruned trigram language model, the second an un pruned 6 gram model. Both first and second pass models were Katz [5] backoff language models. The first pass trigram model was pruned using the approach of Seymore and Rosenfeld [7] using a pruning threshold of 100. In addition to the transcriptions of previous SDR evaluations we also used the transcripts of the Hub4 evaluations and two printed media sources (the LDC North American news corpus and United Press International (ClariNet) Different language models were ....
Kristie Seymore and Ronald Rosenfeld. Scalable backoff language models. In ICSLP'96, volume 1, 1996.
....NA News corpus of newspaper text) The first pass trigram model was built by first constructing a backoff language model from the 271 million words of training text, yielding 15.8 million 2 grams and 22.4 million 3 grams. This model was reduced in size, using the approach of Seymore and Rosenfeld [7], to 1.4 million 2 grams and 1.1 million 3 grams. When composed with the lexicon, this smaller trigram model yielded a manageable sized network. The second pass d tf factor: t idf factor: b pivoted byte length normalization factor: ....
Kristie Seymore and Ronald Rosenfeld. Scalable backoff language models. In ICSLP'96 , volume 1, 1996.
....of manageable size. In the second pass, these word lattices are rescored with a more detailed 4 gram language model. The best path is extracted from the rescored lattices. Both models are based on the Katz backoff technique [8] and are pruned using the shrinking method of Seymore and Rosenfeld [14]. 2.3.3. ASR Performance The performance of our recognition component on the TREC7 test set was 32.4 word error rate (WER) This was slightly better than the medium error transcriptions provided by NIST in the TREC7 competition, although considerably worse than the 24.8 WER of the top ....
Kristie Seymore and Ronald Rosenfeld. Scalable backoff language models. In Proceedings of the Fourth International Conference on Spoken Language Processing, volume 1, 1996.
....that change perplexity by less than a threshold are removed from the model. Experiments show that a production quality Hub4 LM can be reduced to 26 its original size without increasing recognition error. We also compare the approach to a heuristic pruning criterion by Seymore and Rosenfeld [9], and show that their approach can be interpreted as an approximation to the relative entropy criterion. Experimentally, both approaches select similar sets of N grams (about 85 overlap) with the exact relative entropy criterion giving marginally better performance. 1. Introduction N gram ....
....minimizing model size. As pointed out in [6] pruning (selecting parameters from) a full N gram model of higher order amounts to building a variable length N gram model, i.e. one in which training set contexts are not uniformly represented by N grams of the same length. Seymore and Rosenfeld [9] showed that selecting N grams based on their conditionalprobabilityestimates and frequency of use is more effective than the traditional absolute frequency thresholding. In this paper we revisit the problem of N gram parameter selection by deriving a criterion that satisfies the following ....
[Article contains additional citation context not shown here]
K. Seymore andR. Rosenfeld. Scalable backoff languagemodels. In Proc. ICSLP, vol. 1, pp. 232--235, Philadelphia, 1996.
....167.6 20 20 936,064 186.3 50 50 407,266 220.4 100 100 213,488 252.8 Table 3.1 The effect of cutoffs on the size and perplexity of a trigram language model trained on the broadcast news corpus. Other methods of reducing the number of N grams retained by the language model have been proposed. In (Seymore and Rosenfeld, 1996), a process is described whereby N grams are selected for removal from the model according to the difference in the original probability estimate and the backed off probability that would result should the N gram be removed. It is shown that for language models of equivalent size, the new method ....
....whose removal makes least difference to the relative entropy between the original and pruned model. The two approaches are similar the difference being that when an N gram is removed, it will have an effect on the back off weight (see Section 2.3. 3) which is not accounted for by the method of (Seymore and Rosenfeld, 1996). However, the comparison of the two techniques described in (Stolcke, 1998) demonstrates that there is little practical difference between the two pruning schemes. Both techniques, however, perform slightly better than the straightforward use of cutoffs. Construction of Language Models 44 3.3.2 ....
Seymore, K. and Rosenfeld, R. (1996). Scalable Backoff Language Models. In Proceedings International Conference on Spoken Language Processing, Philadelphia, USA.
....based on the 9.4 million trigrams observed in the training corpus. This model showed an out of vocabulary rate of 2.2 and a perplexity of 144 on the three hour development test corpus. From this model, a more compact trigram language model was constructed following the procedures described in [20]. In particular, trigrams and bigrams were discarded from the model in cases where the difference between the model prediction and the backed off prediction is less than a threshold T: f ( P O P B ) T where f is the observed n gram frequency, P O is the n gram prediction and P B is the ....
Seymore, K., and Rosenfeld, R., "Scalable backoff language models", Proceedings of the Fourth International Conference on Spoken Language Processing, 1996.
....NA News corpus of newspaper text) The first pass trigram model was built by first constructing a backoff language model from the 271 million words of training text, yielding 15.8 million 2 grams and 22.4 million 3 grams. This model was reduced in size, using the approach of Seymore and Rosenfeld [10], to 1.4 million 2 grams and 1.1 million 3 grams. When composed with the lexicon, this smaller trigram model yielded a manageable sized network. The second pass model used 6.2 million 2 grams, 7.8 million 3 grams, and 4.0 million 4 grams. For this model, the three transcription sources (SDR, HUB4, ....
Kristie Seymore and Ronald Rosenfeld. Scalable backoff language models. In ICSLP'96, volume 1, 1996.
....NA News corpus of newspaper text) The first pass trigram model was built by first constructing a backoff language model from the 271 million words of training text, yielding 15.8 million 2 grams and 22.4 million 3 grams. This model was reduced in size, using the approach of Seymore and Rosenfeld [7], to 1.4 million 2 grams and 1.1 million 3 grams. When composed with the lexicon, this smaller trigram model yielded a manageable sized network. The second pass d tf factor: 1 ln(1 ln(tf ) 0 if tf = 0 t idf factor: log( N 1 df ) b pivoted byte length normalization factor: 1 0:8 ....
Kristie Seymore and Ronald Rosenfeld. Scalable backoff language models. In ICSLP'96, volume 1, 1996.
....Table 2 shows the sizes and test set perplexities (excluding unknown words) of the various language models used. These were built using Katz s backoff method with frequency cutoffs of 2 for bigrams and 4 for trigrams [3] then shrunk with an epsilon of 10 using the method of Seymore and Rosenfeld [16], and finally encoded into (non deterministic) weighted automata G [14] Table 3 lists the sizes of the transducers created by composing lexicon transducers with their corresponding language models and determinizing the result, as described in Section 3. Finally, Table 4 lists the sizes for the ....
K. Seymore and R. Rosenfeld. Scalable backoff language models. In Proceedings of ICSLP, Philadelphia, Pennsylvania, 1996.
....that change perplexity by less than a threshold are removed from the model. Experiments show that a production quality Hub4 LM can be reduced to 26 its original size without increasing recognition error. We also compare the approach to a heuristic pruning criterion by Seymore and Rosenfeld [9], and show that their approach can be interpreted as an approximation to the relative entropy criterion. Experimentally, both approaches select similar sets of N grams (about 85 overlap) with the exact relative entropy criterion giving marginally better performance. 1. Introduction N gram ....
....minimizing model size. As pointed out in [6] pruning (selecting parameters from) a full N gram model of higher order amounts to building a variable length N gram model, i.e. one in which training set contexts are not uniformly represented by N grams of the same length. Seymore and Rosenfeld [9] showed that selecting N grams based on their conditionalprobabilityestimates and frequency of use is more effective than the traditional absolute frequency thresholding. In this paper we revisit the problem of N gram parameter selection by deriving a criterion that satisfies the following ....
[Article contains additional citation context not shown here]
K. Seymore andR. Rosenfeld. Scalable backoff languagemodels. In H. T. Bunnell and W. Idsardi, editors, Proc. ICSLP, vol. 1, pp. 232--235, Philadelphia, 1996.
.... space of histories f(w i Gamma4 w i Gamma3 w i Gamma2 w i Gamma1 )g f(w i Gamma3 w i Gamma2 w i Gamma1 )g : fw i Gamma1 g or similarly f(g i Gamma4 g i Gamma3 g i Gamma2 g i Gamma1 )g f(g i Gamma3 g i Gamma2 g i Gamma1 )g : fg i Gamma1 g 2 Even backoff models with cutoffs [18] are static as long as the cutoff values are fixed for all histories. P(w w ) i 1 P(w ) P(w w w w w ) i i 1 i 1 P(w g ) P(w G ) i 4 i 3 i 2 i 1 i i 3 i 2 i 1 P(w w w w ) i i 2 i 1 P(w w w ) i i 1 P(w GG ) Figure 1: Lattice of language models Each node of the lattice ....
K. Seymore and R. Rosenfeld. Scalable backoff language models. In International Conference on Spoken Language Processing, pages 232-- 235, 1996.
No context found.
K. Seymore and R. Rosenfeld, "Scalable Backoff Language Models," in Proceedings of the International Conference on Spoken Language Processing, Philadelphia, USA, 1996.
No context found.
K. Seymore and R. Rosenfeld. Scalable backoff language models. In Proceedings of ICSLP, Philadelphia, Pennsylvania, 1996.
No context found.
K. Seymore and R. Rosenfeld. Scalable backoff language models. In Proceedings of ICSLP, Philadelphia, Pennsylvania, 1996.
No context found.
K. Seymore and R. Rosenfeld. Scalable backoff language models. In Proceedings of ICSLP, Philadelphia, Pennsylvania, 1996.
No context found.
K. Seymore and R. Rosenfeld. Scalable backoff language models. In Proceedings of ICSLP, Philadelphia, Pennsylvania, 1996.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC