Results 1  10
of
115
A study of interspeaker variability in speaker verification
 IEEE Trans. Audio, Speech and Language Processing
, 2008
"... Abstract — We propose a new approach to the problem of estimating the hyperparameters which define the interspeaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10–15 % reductions in error r ..."
Abstract

Cited by 131 (12 self)
 Add to MetaCart
(Show Context)
Abstract — We propose a new approach to the problem of estimating the hyperparameters which define the interspeaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10–15 % reductions in error rates on the core condition and the extended data condition (as measured both by equal error rates and the NIST detection cost function). We show that when a large joint factor analysis model is trained in this way and tested on the core condition, the extended data condition and the crosschannel condition, it is capable of performing at least as well as fusions of multiple systems of other types. (The comparisons are based on the best results on these tasks that have been reported in the literature.) In the case of the crosschannel condition, a factor analysis model with 300 speaker factors and 200 channel factors can achieve equal error rates of less than 3.0%. This is a substantial improvement over the best results that have previously been reported on this task. Index Terms — Speaker verification, Gaussian mixture model, speaker factors, channel factors
Joint factor analysis of speaker and session variability: Theory and algorithms
, 2005
"... Abstract — We give a full account of the algorithms needed to carry out a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels and we discuss the practical limitations that will be encountered if these algorithms ar ..."
Abstract

Cited by 84 (12 self)
 Add to MetaCart
(Show Context)
Abstract — We give a full account of the algorithms needed to carry out a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels and we discuss the practical limitations that will be encountered if these algorithms are implemented on very large data sets. I.
Temporally weighted linear prediction features for tackling additive noise in speaker verification
, 2010
"... We consider textindependent speaker verification under additive noise corruption. In the popular melfrequency cepstral coefficient (MFCC) frontend, we substitute the conventional Fourierbased spectrum estimation with weighted linear predictive methods, which have earlier shown success in noiser ..."
Abstract

Cited by 26 (19 self)
 Add to MetaCart
(Show Context)
We consider textindependent speaker verification under additive noise corruption. In the popular melfrequency cepstral coefficient (MFCC) frontend, we substitute the conventional Fourierbased spectrum estimation with weighted linear predictive methods, which have earlier shown success in noiserobust speech recognition. We introduce two temporally weighted variants of linear predictive (LP) modeling to speaker verification and compare them to FFT, which is normally used in computing MFCCs, and to conventional LP. We also investigate the effect of speech enhancement (spectral subtraction) on the system performance with each of the four feature representations. Our experiments on the NIST 2002 SRE corpus indicate that the accuracy of the conventional and proposed features are close to each other on clean data. On 0 dB SNR level, baseline FFT and the better of the proposed features give EERs of 17.4 % and 15.6 %, respectively. These accuracies improve to 11.6 % and 11.2 %, respectively, when spectral subtraction is included as a preprocessing method. The new features hold a promise for noiserobust speaker verification. 1.
Language recognition in iVectors space,” in
 Proc. Interspeech,
, 2011
"... Abstract The concept of so called iVectors, where each utterance is represented by fixedlength lowdimensional feature vector, has recently become very successfully in speaker verification. In this work, we apply the same idea in the context of Language Recognition (LR). To recognize language in t ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
(Show Context)
Abstract The concept of so called iVectors, where each utterance is represented by fixedlength lowdimensional feature vector, has recently become very successfully in speaker verification. In this work, we apply the same idea in the context of Language Recognition (LR). To recognize language in the iVector space, we experiment with three different linear classifiers: one based on a generative model, where classes are modeled by Gaussian distributions with shared covariance matrix, and two discriminative classifiers, namely linear Support Vector Machine and Logistic Regression. The tests were performed on the NIST LRE 2009 dataset and the results were compared with stateoftheart LR based on Joint Factor Analysis (JFA). While the iVector system offers better performance, it also seems to be complementary to JFA, as their fusion shows another improvement.
Simplification and optimization of ivector extraction,”
 in Proceedings of the 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011. IEEE Signal Processing Society,
, 2011
"... ABSTRACT This paper introduces some simplifications to the ivector speaker recognition systems. Ivector extraction as well as training of the ivector extractor can be an expensive task both in terms of memory and speed. Under certain assumptions, the formulas for ivector extractionalso used in ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
(Show Context)
ABSTRACT This paper introduces some simplifications to the ivector speaker recognition systems. Ivector extraction as well as training of the ivector extractor can be an expensive task both in terms of memory and speed. Under certain assumptions, the formulas for ivector extractionalso used in ivector extractor trainingcan be simplified and lead to a faster and memory more efficient code. The first assumption is that the GMM component alignment is constant across utterances and is given by the UBM GMM weights. The second assumption is that the ivector extractor matrix can be linearly transformed so that its perGaussian components are orthogonal. We use PCA and HLDA to estimate this transform.
Duration Mismatch Compensation for Ivector based Speaker Recognition Systems,” in
 Proc. IEEE ICASSP,
, 2013
"... ABSTRACT Speaker recognition systems trained on long duration utterances are known to perform significantly worse when short test segments are encountered. To address this mismatch, we analyze the effect of duration variability on phoneme distributions of speech utterances and ivector length. We d ..."
Abstract

Cited by 16 (7 self)
 Add to MetaCart
(Show Context)
ABSTRACT Speaker recognition systems trained on long duration utterances are known to perform significantly worse when short test segments are encountered. To address this mismatch, we analyze the effect of duration variability on phoneme distributions of speech utterances and ivector length. We demonstrate that, as utterance duration is decreased, number of detected unique phonemes and ivector length approaches zero in a logarithmic and nonlinear fashion, respectively. Assuming duration variability as an additive noise in the ivector space, we propose three different strategies for its compensation: i) multiduration training in Probabilistic Linear Discriminant Analysis (PLDA) model, ii) score calibration using log duration as a Quality Measure Function (QMF), and iii) multiduration PLDA training with synthesized short duration ivectors. Experiments are designed based on the 2012 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) protocol with varying test utterance duration. Experimental results demonstrate the effectiveness of the proposed schemes on short duration test conditions, especially with the QMF calibration approach.
LowVariance Multitaper MFCC Features: a Case Study in Robust Speaker Verification
, 2012
"... In speech and audio applications, shortterm signal spectrum is often represented using melfrequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
(Show Context)
In speech and audio applications, shortterm signal spectrum is often represented using melfrequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the socalled multitaper method which uses multiple timedomain windows (tapers) with frequencydomain averaging. Multitapers have received little attention in speech processing even though they produce lowvariance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, firstly, detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus. For speaker verification experiments on the NIST 2002 and 2008 SRE corpora, we consider three Gaussian mixture model based classifiers with universal background model (GMMUBM), support vector machine (GMMSVM) and joint factor analysis (GMMJFA). Multitapers improve MinDCF over the baseline windowed DFT by relative 20.4 % (GMMSVM) and 13.7 % (GMMJFA) on the interviewinterview condition in NIST 2008. The GMMJFA system further reduces MinDCF by 18.7 % on the telephone data. With these improvements and generally noncritical parameter selection, multitaper MFCCs are a viable candidate for replacing the conventional MFCCs.
Discriminative acoustic language recognition via channelcompensated GMM statistics
 in Proc. Interspeech
, 2009
"... We propose a novel design for acoustic featurebased automatic spoken language recognizers. Our design is inspired by recent advances in textindependent speaker recognition, where intraclass variability is modeled by factor analysis in Gaussian mixture model (GMM) space. We use approximations to ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
We propose a novel design for acoustic featurebased automatic spoken language recognizers. Our design is inspired by recent advances in textindependent speaker recognition, where intraclass variability is modeled by factor analysis in Gaussian mixture model (GMM) space. We use approximations to GMMlikelihoods which allow variablelength data sequences to be represented as statistics of fixed size. Our experiments on NIST LRE’07 show that variabilitycompensation of these statistics can reduce errorrates by a factor of three. Finally, we show that further improvements are possible with discriminative logistic regression training. Index Terms: acoustic language recognition, intersession variability compensation, discriminative training
Support vector machines and joint factor analysis for speaker verification
 Proceedingsof ICASSP 2009
"... This article presents several techniques to combine between Support vector machines (SVM) and Joint Factor Analysis (JFA) model for speaker verification. In this combination, the SVMs are applied to different sources of information produced by the JFA. These informations are the Gaussian Mixture Mod ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
(Show Context)
This article presents several techniques to combine between Support vector machines (SVM) and Joint Factor Analysis (JFA) model for speaker verification. In this combination, the SVMs are applied to different sources of information produced by the JFA. These informations are the Gaussian Mixture Model supervectors and speakers and Common factors. We found that using SVM in JFA factors gave the best results especially when within class covariance normalization method is applied in order to compensate for the channel effect. The new combination results are comparable to other classical JFA scoring techniques.
Speaker verification using simplified and supervised ivector modeling,” appear to
 Proc. of ICASSP
, 2013
"... This paper presents a simplified and supervised ivector modeling framework that is applied in the task of robust and efficient speaker verification (SRE). First, by concatenating the mean supervector and the ivector factor loading matrix with respectively the label vector and the linear classifier ..."
Abstract

Cited by 11 (9 self)
 Add to MetaCart
(Show Context)
This paper presents a simplified and supervised ivector modeling framework that is applied in the task of robust and efficient speaker verification (SRE). First, by concatenating the mean supervector and the ivector factor loading matrix with respectively the label vector and the linear classifier matrix, the traditional ivectors are then extended to labelregularized supervised ivectors. These supervised ivectors are optimized to not only reconstruct the mean supervectors well but also minimize the mean squared error between the original and the reconstructed label vectors, such that they become more discriminative. Second, factor analysis (FA) can be performed on the prenormalized centered GMM first order statistics supervector to ensure that the Gaussian statistics subvector of each Gaussian component is treated equally in the FA, which reduces the computational cost significantly. Experimental results are reported on the female part of the NIST SRE 2010 task with common condition 5. The proposed supervised ivector approach outperforms the ivector baseline by relatively 12 % and 7 % in terms of equal error rate (EER) and norm old minDCF values, respectively. Index Terms — Speaker verification, Simplified ivector, Supervised ivector