#### DMCA

## Speaker verification and spoken language identification using a generalized i-vector framework with phonetic tokenizations and tandem features (2014)

Citations: | 4 - 3 self |

### Citations

297 | Front-end Factor Analysis for Speaker Verification,”
- Dehak, Kenny, et al.
- 2010
(Show Context)
Citation Context ...tor modeling has gained significant attention in both speaker verification (SV) and language identification (LID) domains due to its excellent performance, compact representation and small model size =-=[1, 2, 3]-=-. In this modeling, first, zero-order and first-order Baum-Welch statistics are calculated by projecting the MFCC features on those Gaussian Mixture Model (GMM) components using the occupancy posterio... |

236 | Tandem connectionist feature extraction for conventional HMM systems”, in
- Hermansky, Ellis, et al.
- 2000
(Show Context)
Citation Context ...order statistics on these trigrams. Second, since the number of phonemes is much smaller than the number of tied triphone states, we converted the phoneme posterior probabilities into tandem features =-=[13, 14]-=- and then apply GMM on top of it to generate large components tokens. This is also motivated by the hierarchical phoneme posterior probability estimator in [15]. In this setup, the GMM statistics calc... |

182 | Support vector machines using GMM supervectors for speaker verifi cation
- Campbell, Sturim, et al.
- 2006
(Show Context)
Citation Context ...d, within this i-vector space, variability compensation methods, such as Within-Class Covariance Normalization (WCCN) [4], Linear Discriminative Analysis (LDA) and Nuisance Attribute Projection (NAP) =-=[5]-=-, are performed to reduce the variability for the subsequent modeling methods (e.g., Support Vector Machine (SVM), Logistic Regression [3] This research is funded in part by CMU-SYSU Collaborative Inn... |

116 | Probabilistic linear discriminant analysis for inferences about identity,”
- Prince
- 2007
(Show Context)
Citation Context ...U-SYSU Collaborative Innovation Research Center and the SYSU-CMU Shunde International Joint Research Institute. and Neural Network [6, 7] for LID and Probabilistic Linear Discriminant Analysis (PLDA) =-=[8, 9]-=- for SV, respectively). Lei, et.al [10] and Kenny, et.al [11] recently proposed a generalized i-vector framework where decision tree senones (tied triphone states) in a general Deep Neural Network bas... |

114 |
Eigenvoice modeling with sparse training data,”
- Kenny, Boulianne, et al.
- 2005
(Show Context)
Citation Context ...led i-vector [2]. Considering a C-component GMM andD dimensional acoustic features, the total variability matrixT is a CD×K matrix which is estimated the same way as learning the eigenvoice matrix in =-=[19]-=- except that here we consider that every utterance is produced by a new speaker [2]. As shown in fig. 3, we recently proposed the simplified supervised i-vector method [20, 21] which achieves comparab... |

79 | Within-class covariance normalization for SVM-based speaker recognition
- Hatch, Kajarekar, et al.
(Show Context)
Citation Context ... jointly models language, speaker and channel variabilities all together [1]. Third, within this i-vector space, variability compensation methods, such as Within-Class Covariance Normalization (WCCN) =-=[4]-=-, Linear Discriminative Analysis (LDA) and Nuisance Attribute Projection (NAP) [5], are performed to reduce the variability for the subsequent modeling methods (e.g., Support Vector Machine (SVM), Log... |

73 | Hierarchical structures of neural networks for phoneme recognition,” accepted to
- Schwarz, Matejka, et al.
- 2006
(Show Context)
Citation Context ...es-MFCC system, the tokens are the phonemes and the posterior probability P (c|zt, λ̂) is the phoneme posterior probability (PPP). We employed the multilayer perceptron (MLP) based phoneme recognizer =-=[22]-=- with acoustic models from five different languages, namely Czech, Hungarian, Russian, English andMandarin. The models for the first three languages were trained on SpeechDatE databases and provided i... |

49 | Using MLP features in SRI’s conversational speech recognition system. In:
- Zhu, Stolcke, et al.
- 2005
(Show Context)
Citation Context ...tistics calculation remains the same except that the GMM is trained on the tandem features. This phoneme posterior probability (PPP) based tandem feature has been reported to be effective in both ASR =-=[13, 14, 16]-=- and LID tasks[17, 18] as front end features. GMM mean supervector modeling and conventional i-vector modeling are used to model this tandem feature in [17] and [18] for LID. In both methods, the tand... |

43 | Tandem acoustic modeling in largevocabulary recognition
- Ellis, Singh, et al.
- 2001
(Show Context)
Citation Context ...order statistics on these trigrams. Second, since the number of phonemes is much smaller than the number of tied triphone states, we converted the phoneme posterior probabilities into tandem features =-=[13, 14]-=- and then apply GMM on top of it to generate large components tokens. This is also motivated by the hierarchical phoneme posterior probability estimator in [15]. In this setup, the GMM statistics calc... |

35 | 2011b. Language Recognition via i-vectors and Dimensionality Reduction
- Dehak, Torres-Carrasquillo, et al.
(Show Context)
Citation Context ...tor modeling has gained significant attention in both speaker verification (SV) and language identification (LID) domains due to its excellent performance, compact representation and small model size =-=[1, 2, 3]-=-. In this modeling, first, zero-order and first-order Baum-Welch statistics are calculated by projecting the MFCC features on those Gaussian Mixture Model (GMM) components using the occupancy posterio... |

28 |
The HTK book. Entropic Cambridge Research Laboratory
- Young, Odell, et al.
- 1997
(Show Context)
Citation Context ...ns the same except that the GMM model is trained on the tandem features. Second, we increase the time scale of tokens and adopt the trigrams as the new type of tokens. As shown in fig. 2, HTK toolkit =-=[23]-=- is used to decode the PPP features and output a lattice file for each utterance which is further processed into n-gram counts and n-gram indexes by the lattice-tool toolkit [24]. The decoded n-gram c... |

26 | A vector space modeling approach to spoken language identification
- Li, Ma, et al.
- 2007
(Show Context)
Citation Context ... SV and LID. First, we explore the commonly used phonemes as the phonetic tokens and extend to even larger units such as trigrams. In this way, the bag of trigrams vector in the vector space modeling =-=[12]-=- is exactly the zero-order statistics on these trigrams. Second, since the number of phonemes is much smaller than the number of tied triphone states, we converted the phoneme posterior probabilities ... |

24 |
Full-Covariance UBM and Heavy-Tailed PLDA in I-Vector Speaker Verification
- Matejka, Glembek, et al.
- 2011
(Show Context)
Citation Context ...U-SYSU Collaborative Innovation Research Center and the SYSU-CMU Shunde International Joint Research Institute. and Neural Network [6, 7] for LID and Probabilistic Linear Discriminant Analysis (PLDA) =-=[8, 9]-=- for SV, respectively). Lei, et.al [10] and Kenny, et.al [11] recently proposed a generalized i-vector framework where decision tree senones (tied triphone states) in a general Deep Neural Network bas... |

17 |
Language Recognition in iVectors Space
- Mart́ınez, Plchot, et al.
- 2011
(Show Context)
Citation Context ...tor modeling has gained significant attention in both speaker verification (SV) and language identification (LID) domains due to its excellent performance, compact representation and small model size =-=[1, 2, 3]-=-. In this modeling, first, zero-order and first-order Baum-Welch statistics are calculated by projecting the MFCC features on those Gaussian Mixture Model (GMM) components using the occupancy posterio... |

14 |
Language identification using phoneme recognition and phonotactic language modeling
- Zissman
- 1995
(Show Context)
Citation Context ...different languages, the overall performance was enhanced (method 5). This makes sense because phonetic or phonotactic LID systems usually employ parallel phoneme recognizers from different languages =-=[12, 27]-=-. Furthermore, the combined tandemGMM-tandem system (method 9) achieved 1.81% EER which outperformed the i-vector baseline by 30% relatively. This finding matches with the SV results which indicates t... |

12 |
A novel scheme for speaker recognition using a phoneticallyaware deep neural network,” in ICASSP-2014.
- Lei, Scheffer, et al.
- 2007
(Show Context)
Citation Context ...Center and the SYSU-CMU Shunde International Joint Research Institute. and Neural Network [6, 7] for LID and Probabilistic Linear Discriminant Analysis (PLDA) [8, 9] for SV, respectively). Lei, et.al =-=[10]-=- and Kenny, et.al [11] recently proposed a generalized i-vector framework where decision tree senones (tied triphone states) in a general Deep Neural Network based Automatic Speech Recognition (ASR) s... |

10 | Patrol team language identification system for darpa rats p1 evaluation
- Matějka, Plchot, et al.
- 2012
(Show Context)
Citation Context ... (SVM), Logistic Regression [3] This research is funded in part by CMU-SYSU Collaborative Innovation Research Center and the SYSU-CMU Shunde International Joint Research Institute. and Neural Network =-=[6, 7]-=- for LID and Probabilistic Linear Discriminant Analysis (PLDA) [8, 9] for SV, respectively). Lei, et.al [10] and Kenny, et.al [11] recently proposed a generalized i-vector framework where decision tre... |

10 | Speaker verification using simplified and supervised i-vector modeling
- Li, Tsiartas, et al.
- 2013
(Show Context)
Citation Context ...Tandem Tandem-GMM Tandem Trigrams-Tandem Trigrams Tandem Hybrid-GMM-Hybrid Hybrid-GMM MFCC+Tandem Figure 3: Schematic of the factor analysis based i-vector and simplified supervised i-vector modeling =-=[20, 21]-=- by concatenating all the F̃c together: F̃c = PL t=1 P (c|yt, λ)(yt − µc)PL t=1 P (c|yt, λ) . (3) The centered mean supervector F̃ can be projected as follows: F̃→ Tx, (4) where T is a rectangular tot... |

10 |
Simplified supervised i-vector modeling with application to robust and efficient language identification and speaker verification,” Computer Speech and Language
- Li, Narayanan
- 2014
(Show Context)
Citation Context ...Tandem Tandem-GMM Tandem Trigrams-Tandem Trigrams Tandem Hybrid-GMM-Hybrid Hybrid-GMM MFCC+Tandem Figure 3: Schematic of the factor analysis based i-vector and simplified supervised i-vector modeling =-=[20, 21]-=- by concatenating all the F̃c together: F̃c = PL t=1 P (c|yt, λ)(yt − µc)PL t=1 P (c|yt, λ) . (3) The centered mean supervector F̃ can be projected as follows: F̃→ Tx, (4) where T is a rectangular tot... |

9 |
H.: Shifteddelta mlp features for spoken language recognition
- Wang, Leung, et al.
(Show Context)
Citation Context ...s the same except that the GMM is trained on the tandem features. This phoneme posterior probability (PPP) based tandem feature has been reported to be effective in both ASR [13, 14, 16] and LID tasks=-=[17, 18]-=- as front end features. GMM mean supervector modeling and conventional i-vector modeling are used to model this tandem feature in [17] and [18] for LID. In both methods, the tandem feature outperforme... |

7 |
et al.: Srilm-an extensible language modeling toolkit
- Stolcke
(Show Context)
Citation Context ...n fig. 2, HTK toolkit [23] is used to decode the PPP features and output a lattice file for each utterance which is further processed into n-gram counts and n-gram indexes by the lattice-tool toolkit =-=[24]-=-. The decoded n-gram counts are considered as the posterior probability and the mean of features within this n-gram’s range is accounted as yt where t indexes the whole n-gram here. Both tandem featur... |

6 | J.D.: Extended phone log-likelihood ratio features and acoustic-based i-vectors for language recognition
- D’Haro, Cordoba, et al.
(Show Context)
Citation Context ...s the same except that the GMM is trained on the tandem features. This phoneme posterior probability (PPP) based tandem feature has been reported to be effective in both ASR [13, 14, 16] and LID tasks=-=[17, 18]-=- as front end features. GMM mean supervector modeling and conventional i-vector modeling are used to model this tandem feature in [17] and [18] for LID. In both methods, the tandem feature outperforme... |

4 | Deep neural networks for extracting baum-welch statistics for speaker recognition
- Kenny, Gupta, et al.
- 2014
(Show Context)
Citation Context ...U Shunde International Joint Research Institute. and Neural Network [6, 7] for LID and Probabilistic Linear Discriminant Analysis (PLDA) [8, 9] for SV, respectively). Lei, et.al [10] and Kenny, et.al =-=[11]-=- recently proposed a generalized i-vector framework where decision tree senones (tied triphone states) in a general Deep Neural Network based Automatic Speech Recognition (ASR) system are employed as ... |

3 |
Trap language identification system for rats phase ii evaluation
- Han, Ganapathy, et al.
- 2013
(Show Context)
Citation Context ... (SVM), Logistic Regression [3] This research is funded in part by CMU-SYSU Collaborative Innovation Research Center and the SYSU-CMU Shunde International Joint Research Institute. and Neural Network =-=[6, 7]-=- for LID and Probabilistic Linear Discriminant Analysis (PLDA) [8, 9] for SV, respectively). Lei, et.al [10] and Kenny, et.al [11] recently proposed a generalized i-vector framework where decision tre... |

2 |
et al.: Analysis of mlp-based hierarchical phoneme posterior probability estimator
- Pinto, Garimella, et al.
- 2011
(Show Context)
Citation Context ... probabilities into tandem features [13, 14] and then apply GMM on top of it to generate large components tokens. This is also motivated by the hierarchical phoneme posterior probability estimator in =-=[15]-=-. In this setup, the GMM statistics calculation remains the same except that the GMM is trained on the tandem features. This phoneme posterior probability (PPP) based tandem feature has been reported ... |

1 |
The 2007 nist language recognition evaluation,” http://www.itl.nist.gov/iad/mig/tests/lre/2007
- NIST
- 2007
(Show Context)
Citation Context ...seline, the proposed methods achieved 46% and 53% relative error reduction in terms of EER and norm old minDCF. 3.2. Results on LID We also adopted the 2007 NIST Language Recognition Evaluation (LRE) =-=[26]-=- 30 seconds closed set general task as the evaluation database for LID. Data of target languages from Call Friend, OGI Multilingual, OGI 22 languages, NIST LRE 1996, NIST LRE 2003, NIST LRE 2005, NIST... |