| Slava M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3):400--401, March 1987. |
....data to compute the probabilities based on the trigram model. When their doesn t exist enough data for the bigram model then a unigram model is used, i.e. compute probabilities based solely on how often the words appear in the training data set. This method is referred to as the back o# method [41]. Other methods include combining the trigram, bigram and unigram models using linear interpolation [15, 36] The trigram model performs surprisingly well, and is used in many speech recognition algorithms. Obtaining results that significantly outperform the model is di#cult [10, 69] Maximum ....
Slava M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3):400--401, March 1987.
....in which unknown words likely to appear. To estimate these counts, we replace all words appearing only once in the training corpus with unknown word tags UllK , before computing relative frequencies. The underlying idea of the replacetnent is the same as Turing s estimates in back off smooth ing (Katz, 1987). We redistribute the probabil ity mass of low count sequelices to unseen se quences. Generalized Forward Backward Reestimation Generalization of the Forward and Viterbi Algorithm In English part of speech taggers, the maximiza tion of Equation (1) to get the most likely tag se quence, ....
Slava M. Katz. 1987. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer, IEEE Trans. ASSP-35, No.3, pp.400-401.
....in running text. Due to limitations in the amount of available training data, the so called sparse data problem, estimating probabilities directly from observed relative frequencies may not always be very accurate. For this reason, Turing s formula, in the incarnation of Katz s back off scheme [Katz 1987], has become a standard technique for improving parameter estimates for probabilistic language models used by speech recognizers. A more theoretical treatment of Turing s formula itself can be found in [N das 1985] Zipf s law is commonly regarded as an empirically accurate description of a wide ....
Slava M. Katz. "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer". In IEEE Transactions on Acoustics, Speech, and Signal Processing 35(3), pp. 400-401, 1987.
....by linking the word phone transcriptions according to the grammar, then, as for the no grammar case, the phone graph is converted to a large HMM by replacing each phone node by the appropriate set of phone models and establishing the proper connections with the neighboring phones. A bigram backoff[22] language model estimated on the text material from Le Monde is used for lexicons containing 5K and 20K words. In all cases, CD phone models are used for word juncture phones as well as for intra word phones. As an example, for the 3K lexicon, the average number of instanciations of each phone ....
S.M. Katz, "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer," IEEE Trans. ASSP, 35(3), 1987.
....We first briefly describe the smoothing technique choosen. We then discuss an adaptation of the bagging technique [3] Finally we describe an efficient method for using lexical information in the learning process. The smoothing technique The smoothing technique used is inspired from Katz smoothing [10]: a small quantity is discounted from each probability avail able in the automaton. The probability mass available is then redistributed on the events the automaton cannot handle (e.g. parsing NN VP with the transducer of figure 3) The redistribution of the probability mass is made via a unigram ....
S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 35(hum 3):400-401, 1987.
.... to this problem : succession laws [12] linear interpolation of the maximum likelihood estimator with another estimator, such as an a priori distribution [8] or a more general distribution [7] discounting of a certain amount of the probability mass of seen events using the Turing good formula [9] or absolute discounting [13] When using discounting, the discounted probability mass is redistributed to all unseen events according to another probability distribution. This is the back o smoothing methods, proposed by Katz [9] For Markov chains, the back o smoothing is based on the ....
....the probability mass of seen events using the Turing good formula [9] or absolute discounting [13] When using discounting, the discounted probability mass is redistributed to all unseen events according to another probability distribution. This is the back o smoothing methods, proposed by Katz [9]. For Markov chains, the back o smoothing is based on the following idea : discount a certain amount dC from the probability mass of events which have been observed in a context of length k and redistribute this amount to all unseen events according to their probability in a context of length k ....
Slava M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. on Acoustics, Speech and Signal Processing, ASSP-35(3):400-401, March 1987.
....To deal with phonological variability alternate pronunciations are included in the lexicon, and optional phonological rules are applied during training and recognition. The decoder uses a time synchronous graph search strategy[Ney84] for a first pass with a bigram back off language model (LM)[Kat87]. A trigram LM is used in a second acoustic decoding pass which incorporates the word graph generated This work is partially funded by the LRE project 62 058 SQALE. in the first pass[Gau94b] Experimental results are reported on the ARPA Wall Street Journal (WSJ) Pau92] and BREF[Gau90, Lam91] ....
....3 Language Modeling Language modeling entails incorporating constraints on the allowable sequences of words which form a sentence. Statistical n gram models attempt to capture the syntactic and semantic constraints by estimating the frequencies of sequences of n words. A backoff mechanism[Kat87] is used to smooth the estimates of the probabilities of rare n grams by relying on a lower order ngram when there is insufficient training data, and to provide a means of modeling unobserved n grams. Another advantage of the backoff mechanism is that LM size can be arbitrarily reduced by relying ....
S.M. Katz, "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer," IEEE Trans. ASSP, 35(3), 1987.
....in natural language, and can therefore be used in speech recognition to limit the decoding search space. The most popular methods, such as statistical n gram models, attempt to capture the syntactic and semantic constraints by estimating the frequencies of sequences of n words. A backoff mechanism [48] is generally used to smooth the estimates of the probabilities of rare n grams by relying on a lower order n gram when there is insufficient training data, and to provide a means of modeling unobserved n grams. Another advantage of the backoff mechanism is that LM size can be arbitrarily reduced ....
S.M. Katz, "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer, " IEEE Trans. Acoustics, Speech, and Signal Processing, ASSP-35(3), pp. 400-401, March 1987.
....of #. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 swallow crow eagle bird bug bee insect Prob. Figure 2.2: Word based distribution estimated using MLE. To overcome this problem, we can smooth the probabilities by resorting to statistical techniques (Jelinek and Mercer, 1980; Katz, 1987; Gale and Church, 1990; Ristad and Thomas, 1995) We can, for example, employ an extended version of the Laplace s Law of Succession (cf. Je#reys, 1961; Krichevskii and Trofimov, 1981) to estimate P (n v, r) as f(n v, r) 0.5 f(v, r) 0.5 where N denotes the size of the set of ....
Katz, Slava M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-35(3):400--401.
....In practice, this discounted estimate c is not used for all counts of c. Large counts (where c k for some threshold k) are assumed to be reliable. For example, k = 5 is said to be a good threshold to select. Back off Smoothing Another method for smoothing is the back off modeling proposed by Katz [1997]. The estimate for the n gram is allowed to back off through progressively shorter histories. If the n gram did not appear at all or appeared k times or less in the training data, then we use an estimate from a shorter n gram. More formally, for n = 3, and k = O: Pbo(Wilwi 2, Wi l ) P(wilwi 2, ....
....to build this model, we used the morphological parses of the words. Name Tag Model, which captures the name tag information (person, location, organization, and else) of the word tokens. Each model is smoothed using Good Turing method [Good, 1953] combined with the Back off modeling proposed by Katz [1997], as described in Chapter 2. In this work, in order to build a language model, and decode the most proba ble output in an HMM with the Vitcrbi algorithm [Vitcrbi, 1967] we used the publicly available SRILM toolkit, developed by Andreas Stolckc [Stolckc, 1999] We would like to explain each model ....
Katz, S. 1997. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics Speech and Signal Processing 35(3):400-401.
....data, and what is more: it becomes undefined as soon as the denominator #(w t Gamman 1 ) of the MLE expression turns to zero. As a consequence, the raw ML estimates have to be smoothed by a suitable backing off or interpolation strategy. Backing off approaches such as Katz trigram formula [7] typically operate on a discounted version of n gram frequencies, for example based on Jeffrey s rule or the Good Turing estimate [2] of unseen events, reducing the occurrence counts of frequent events in favour of the rare ones. The probability mass that was saved by deriving the conditional ....
S.M. Katz. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans. on Acoustics, Speech and Signal Processing, 35(3):400--401, 1987.
....end, IP, and editing term, and for each reparandum, the words are linked A couple of words in the source sentence describes the same concepts as a couple of words in the target sentence. to the corresponding words in the reparans. All distributions are smoothed by a simple back off method [8] to avoid zero probabilities with the exception that the replacement probability P (RD j jRSa j ) is smoothed in a more sophisticated way. It is calculated by a linear interpolation of replacement probabilities for the words, the corresponding POS tags, and the semantic class P (RD j jRSa j ) ....
S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. Transaction on Acoustics, Speech and Signal Processing, ASSP-35:400--401, March 1987.
.... sequence probabilities were evaluated in four different ways considering: equiprobability between features, feature frequencies, feature bigrams and feature trigrams (evaluated on the training data set) In order to avoid data sparseness for n grams evaluation, we use the Katz back off model [5] for distribution smoothing. It is based on the Good Turing smoothing principal. The Good Turing estimate states that for any n gram that occur # times, we should consider that it occurs # times: # ## ### , where # # is the number of n grams that occur exactly # times. Then the ....
S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. on Acoustics, Speech and Signal Processing, 35(3):400--401, March 1987.
....since it is the probability of emitting the word W i given the tag T i . These probabilities are estimated using a corpus where each word is tagged with its correct supertag. The contextual probabilities were estimated using Good Turing discounting technique [7] combined with Katz s back off model [15] given by: p(T 3 jT 1 ; T 2 ) p(T 3 jT 1 ; T 2 ) if p(T 3 jT 1 ; T 2 ) 0 = ff(T 1 ; T 2 ) p(T 3 jT 2 ) if p(T 2 jT 1 ) 0 = p(T 3 jT 2 ) otherwise p(T 2 jT 1 ) p(T 1 ; T 2 ) if p(T 2 jT 1 ) 0 = fi(T 1 ) p 1 (T 2 ) otherwise where ff(T i ; T j ) and fi(T k ) are constants to ensure ....
Slava M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, speech and SignalProcessing, 35(3):400--401, 1987.
....in the PST. When matching against an input text, if the character or is presented to the machine when it is in state w, we return to the initial state, loose all the past context and predict or with the prior probability JS(cr ) Better backoff techniques may be worth exploring in this situation [13, 14]. 4.2 Experimental results In this section, we present a detailed study of the training of the main VLMM module in our system. 13 characters 1 10 102 103 104 10 s 106 107 10 2 10 nodes 7 8 494 3432 21082 102896 319215 139127 198653 195890 Table 4: Prefix trees trained on the AP news ....
S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trasactios o Acoustics, Spcccch ad Sigal Processig, 35(3):400 401, 1987.
....approach include: There are no known bounds on the number of iterations required by generalized iterative scaling. Also, each iteration of generalized iterative scaling is computationally expensive. Good Turing discounting (Good [12] may be useful in smoothing low count constraints (Katz [14]) However, it introduces inconsistencies so a unique maximum entropy solution may no longer exist and generalized iterative scaling may not converge. 4.2 Maximum Entropy Solution and Constraint Functions If we let x denote the current history and y denote a future, in the maximum entropy ....
....convenience although we feel that the selection is reasonable. 5.5 Good Turing Discounting Another possible approach to smoothing is to apply Good Turing discounting to the desired constraint values, d i . The use of Good Turing discounting with the back off trigram model was discussed in Katz [14]. We use the same formulation here. Let r represent the count of a constraint, that is, r = d i N where N is the size of the training text. Note that r is guaranteed to be an integer when we define the desired constraint values to be the observed values in the training data as given by (EQ ....
[Article contains additional citation context not shown here]
Katz, Slava M., "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer," in IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35(3): 400-401, March 1987.
....is to introduce smoothing techniques. Such techniques try to estimate the high order n gram probability with lower order probabilities, while maintaining the integrity of the probability space. For example, if the word pair (w 1 w 2 ) has never appeared in the training corpus, the back o# bi gram [51] estimates the probability of P (w 2 1 ) as follows: 96 1 ) q(w 1 )P (w 2 ) 4.2) where P (w 2 ) is the uni gram probability of w 2 ,andq(w 1 ) is a normalization factor chosen so that w2 1 ) 1. 2. Class n gram Model Another method of dealing with the sparse data problem is to reduce ....
S. Katz, "Estimation of probabilities from sparse data for the language model component of a speech recognizer," IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 35, 1987.
.... trigram models, maximum likelihood estimation of parameters can be unreliable (for instance, an important fraction of possible trigrams is usually not present in the available training data) In order to alleviate this problem, some techniques can be applied: Different smoothing techniques [1,4,5] are usually employed in order to assign reliable probability estimates to N grams which are not frequent or not present in the training data. The number of parameters can be significantly reduced (and, hence, their estimation made more reliable) by clustering linguistic units into classes [1, ....
S. Katz. Estimation of Probabilities From Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans. on ASSP, 34(3):400-401, 1987.
....0 jt) is relatively small) then the whole probability P (w 0 jw k : w 1 ) will be very small. That will increase the perplexity arti cially, yielding bad results. To solve this problem, we interpolated formula (27) with the root probabilities (which, in fact, is the smoothed n gram model, [Kat87]) The formula according to which we computed the probabilities is P (w 0 jw k : w 1 ) P (w 0 jw k : w 1 ) 1 ) P root (w 0 jw k : w 1 ) 31) In gure 7, we have plotted the variation of perplexity (computed using formula (12) and the probabilities from (31) versus ....
....the language model corresponding to the root is smoothed using a variation of Good Turing smoothing; the probabilities are reestimated such that are still proportional with the old ones, but there are no more 0 probabilities. The method we used for smoothing the root probability is described in [Kat87]. The fourth and nal phase constructs the language models for the nodes di erent than root by interpolates between the language model probability in the parent and the one in the current node. We have: P node (w i ) node P node (w i ) 1 node ) P parent(node) w i ) 32) Depending ....
[Article contains additional citation context not shown here]
Slava Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. In IEEE Transactions on Acoustics, Speech, and Signal Processing,
No context found.
Katz, S. 1987. "Estimation of Probabilities from Sparse Data for Language Model Component of a Speech Recognizer." IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35(3):400-401.
No context found.
Slava M. Katz. Estimation of probabilities from sparse data for the langauge model component of a speech rccognizer. IEE' Transactions on Acoustics, Speech and Signal Processing. ASSP-35(3):40001, March 1987.
No context found.
Katz, S.M. 1987. "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer ". IEEE Trans. Acoustics, Speech, and Signal Processing. ASSP-35(3): 400-401.
No context found.
S.M. Katz (1987), "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer," IEEE Trans. ASSP, 35(3), pp. 400-401, March.
No context found.
S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust., Speech and Signal Proc., ASSP-35(3):400--401, 1987.
No context found.
S. M. Katz. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans. Acoust., Speech and Sig. Processing, ASSP 35(3):400 401, 1987.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC