| T. Hain, P. Woodland, G. Evermann, and D. Povey. The cu-htk march 2000. |
....these models. This however also increases the computation and memory requirements of the system. Current desktop computers usually have sufficient computation and memory resources to support large vocabulary continuous speech recognition (LVCSR) tasks requiring complex acoustic and language models [3]. Mobile devices however have limited computation, memory and storage capabilities. While a simple speech recognizer with a small grammar (e.g. limited size voice dialing) can be implemented on a mobile device, more complex LVCSR tasks require computation and memory beyond what is available, or ....
....MFCCs was determined by dropping one MFCC at a time and finding the effect on N best recognition by DTW. The MFCC that introduces the most error in recognition was declared as the most important and so on. From our experiments the coefficients were ordered from most important to least important as [2,3,0,1,6,7,4,8,5,10,11,9]. Note that the number of models retained at each intermediate step can be variable depending on the unknown utterance. Also at any intermediate step if the number of models retained is only one, then the subsequent recognizers need not be used and the unknown utterance is classified as the digit ....
P. Woodland, T. Hain, G. Evermann, and D. Povey, "CU-HTK march 2001.
....studied in the future. 4.3 Switchboard 68 Hours For the experiments performed in this section, a 68 hour subset of the Switchboard (Hub5) acoustic training data set was used. 862 sides of the Switchboard 1 and 92 sides of the Call Home English were used. The set is described as h5train00sub in [9]. As with Minitrain, the baseline was a gender independent decision tree clustered tied state cross word triphone Gaussian mixture HMM system. The 1998 Switchboard evaluation data set was used for testing. The baseline HMM system word error rates with the order of number of free parameters are ....
T. Hain, P.C. Woodland, G. Evermann, and D. Povey. The CU-HTK March 2000.
....(pentaphones) are used. Conditioning only on phonemic context does not capture the acoustic variation of conversational speech fully. In recent years, augmenting the context with position of phoneme in the word has brought additional improvements to ASR performance, and it is now widely used (e.g. [55]) This is consistent with observations in linguistic studies about word position e ects on di erent consonants, using electropalatography (EPG) 73] The linguopalatal (tongue palate) contact, which a ect the strength and duration of sound produced, for word initial consonants is signi cantly ....
.... in a read speech corpus, where two separate trees were grown for male and female speakers, with a total of about 80 hours of speech [70, 130] Progressively, it has been employed for larger tasks, and now as much as 250 hours of speech are clustered with pentaphones and 50 word position features [55]. When the number of feature values increase, a few factors start a ecting the automatic training of decision trees. The number of unique labels tend to increase, and the associated sucient statistics needed to train the tree requires large amounts of memory. For example, Figure 5.1 shows the ....
Thomas Hain, Philip Woodland, Gunnar Evermann, and D. Povey. The CU-HTK March 2000.
No context found.
T. Hain, P. Woodland, G. Evermann, and D. Povey. The cu-htk march 2000.
No context found.
T. Hain, P.C. Woodland, G. Evermann, D. Povey, "The CUHTK March 2000.
No context found.
T. Hain, P. Woodland, G. Evermann, and D. Povey. The cu-htk march 2000.
No context found.
T. Hain, P. Woodland, G. Evermann, and D. Povey. The cu-htk march 2000.
No context found.
T. Hain, P.C. Woodland, G. Evermann, and D. Povey. The CU-HTK March 2000.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC