Results 1  10
of
63
An overview of textindependent speaker recognition: from features to supervectors
, 2009
"... This paper gives an overview of automatic speaker recognition technology, with an emphasis on textindependent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the stateoftheart methods. We start with the fundamentals of ..."
Abstract

Cited by 156 (37 self)
 Add to MetaCart
This paper gives an overview of automatic speaker recognition technology, with an emphasis on textindependent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the stateoftheart methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
A study of interspeaker variability in speaker verification
 IEEE Trans. Audio, Speech and Language Processing
, 2008
"... Abstract — We propose a new approach to the problem of estimating the hyperparameters which define the interspeaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10–15 % reductions in error r ..."
Abstract

Cited by 131 (12 self)
 Add to MetaCart
(Show Context)
Abstract — We propose a new approach to the problem of estimating the hyperparameters which define the interspeaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10–15 % reductions in error rates on the core condition and the extended data condition (as measured both by equal error rates and the NIST detection cost function). We show that when a large joint factor analysis model is trained in this way and tested on the core condition, the extended data condition and the crosschannel condition, it is capable of performing at least as well as fusions of multiple systems of other types. (The comparisons are based on the best results on these tasks that have been reported in the literature.) In the case of the crosschannel condition, a factor analysis model with 300 speaker factors and 200 channel factors can achieve equal error rates of less than 3.0%. This is a substantial improvement over the best results that have previously been reported on this task. Index Terms — Speaker verification, Gaussian mixture model, speaker factors, channel factors
Combining Derivative and Parametric Kernels for Speaker Verification
, 2007
"... Support Vector Machinebased speaker verification (SV) has become a standard approach in recent years. These systems typically use dynamic kernels to handle the dynamic nature of the speech utterances. This paper shows that many of these kernels fall into one of two general classes, derivative and p ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
(Show Context)
Support Vector Machinebased speaker verification (SV) has become a standard approach in recent years. These systems typically use dynamic kernels to handle the dynamic nature of the speech utterances. This paper shows that many of these kernels fall into one of two general classes, derivative and parametric kernels. The attributes of these classes are contrasted and the conditions under which the two forms of kernel are identical are described. By avoiding these conditions gains may be obtained by combining derivative and parametric kernels. One combination strategy is to combine at the kernel level. This paper describes a maximummargin based scheme for learning kernel weights for the SV task. Various dynamic kernels and combinations were evaluated on the NIST 2002 SRE task, including derivative and parametric kernels based upon different model structures. The best overall performance was 7.78 % EER achieved when combining five kernels.
LowVariance Multitaper MFCC Features: a Case Study in Robust Speaker Verification
, 2012
"... In speech and audio applications, shortterm signal spectrum is often represented using melfrequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
(Show Context)
In speech and audio applications, shortterm signal spectrum is often represented using melfrequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the socalled multitaper method which uses multiple timedomain windows (tapers) with frequencydomain averaging. Multitapers have received little attention in speech processing even though they produce lowvariance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, firstly, detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus. For speaker verification experiments on the NIST 2002 and 2008 SRE corpora, we consider three Gaussian mixture model based classifiers with universal background model (GMMUBM), support vector machine (GMMSVM) and joint factor analysis (GMMJFA). Multitapers improve MinDCF over the baseline windowed DFT by relative 20.4 % (GMMSVM) and 13.7 % (GMMJFA) on the interviewinterview condition in NIST 2008. The GMMJFA system further reduces MinDCF by 18.7 % on the telephone data. With these improvements and generally noncritical parameter selection, multitaper MFCCs are a viable candidate for replacing the conventional MFCCs.
Variational bayes logistic regression as regularized fusion for NIST sre 2010
 in Proc. Odyssey: the Speaker and Language Recognition Workshop
, 2012
"... Fusion of the base classifiers is seen as a way to achieve high performance in stateoftheart speaker verification systems. Typically, we are looking for base classifiers that would be complementary. We might also be interested in reinforcing good base classifiers by including others that are sim ..."
Abstract

Cited by 10 (8 self)
 Add to MetaCart
(Show Context)
Fusion of the base classifiers is seen as a way to achieve high performance in stateoftheart speaker verification systems. Typically, we are looking for base classifiers that would be complementary. We might also be interested in reinforcing good base classifiers by including others that are similar to them. In any case, the final ensemble size is typically small and has to be formed based on some rules of thumb. We are interested to find out a subset of classifiers that has a good generalization performance. We approach the problem from sparse learning point of view. We assume that the true, but unknown, fusion weights are sparse. As a practical solution, we regularize weighted logistic regression loss function by elasticnet and LASSO constraints. However, all regularization methods have an additional parameter that controls the amount of regularization employed. This needs to be separately tuned. In this work, we use variational Bayes approach to automatically obtain sparse solutions without additional crossvalidation. Variational Bayes method improves the baseline method in 3 out of 4 subconditions. Index Terms: logistic regression, regularization, compressed sensing, linear fusion, speaker verification
Sparse classifier fusion for speaker verification
 IEEE Transactions on Audio, Speech and Language Processing
, 2013
"... Abstract—Stateoftheart speaker verification systems take advantage of a number of complementary base classifiers by fusing them to arrive at reliable verification decisions. In speaker verification, fusion is typically implemented as a weighted linear combination of the base classifier scores, wh ..."
Abstract

Cited by 9 (9 self)
 Add to MetaCart
(Show Context)
Abstract—Stateoftheart speaker verification systems take advantage of a number of complementary base classifiers by fusing them to arrive at reliable verification decisions. In speaker verification, fusion is typically implemented as a weighted linear combination of the base classifier scores, where the combination weights are estimated using a logistic regression model. An alternative way for fusion is to use classifier ensemble selection, which can be seen as sparse regularization applied to logistic regression. Even though score fusion has been extensively studied in speaker verification, classifier ensemble selection is much less studied. In this study, we extensively study a sparse classifier fusion on a collection of twelve I4U spectral subsystems on the NIST 2008 and 2010 speaker recognition evaluation (SRE) corpora. Index Terms—Classifier ensemble selection, experimentation, linear fusion, speaker verification. I.
The BOSARIS Toolkit: Theory, algorithms and code for surviving the new
 DCF,” in NIST SRE’11 Analysis Workshop
, 2011
"... The change of two orders of magnitude in the new DCF of SRE’10, relative to the old DCF evaluation criterion, posed a difficult challenge for participants and evaluator alike. Initially, participants were at a loss as to how to calibrate their systems, while the evaluator underestimated the required ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
The change of two orders of magnitude in the new DCF of SRE’10, relative to the old DCF evaluation criterion, posed a difficult challenge for participants and evaluator alike. Initially, participants were at a loss as to how to calibrate their systems, while the evaluator underestimated the required number of evaluation trials. After the fact, it is now obvious that both calibration and evaluation require very large sets of trials. This poses the challenges of (i) how to decide what number of trials is enough, and (ii) how to process such large data sets with reasonable memory and CPU requirements. After SRE’10, at the BOSARIS Workshop, we built solutions to these problems into the freely available BOSARIS Toolkit. This paper explains the principles and algorithms behind this toolkit. The main contributions of the toolkit are: 1. The Normalized Bayes ErrorRate Plot, which analyses likelihoodratio calibration over a wide range of DCF operating points. These plots also help in judging the adequacy of the sizes of calibration and evaluation databases. 2. Efficient algorithms to compute DCF and minDCF for large score files, over the range of operating points required by these plots. 3. A new score file format, which facilitates working with very large trial lists. 4. A faster logistic regression optimizer for fusion and calibration. 5. A principled way to define equal error rate, which is of practical interest when the absolute error count is small. 1
Dealing with sensor interoperability in multibiometrics: The UPM experience at the Biosecure Multimodal Evaluation 2007
"... Multimodal biometric systems allow to overcome some of the problems presented in unimodal systems, such as nonuniversality, lack of distinctiveness of the unimodal trait, noise in the acquired data, etc. Integration at the matching score level is the most common approach used due to the ease in com ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Multimodal biometric systems allow to overcome some of the problems presented in unimodal systems, such as nonuniversality, lack of distinctiveness of the unimodal trait, noise in the acquired data, etc. Integration at the matching score level is the most common approach used due to the ease in combining the scores generated by different unimodal systems. Unfortunately, scores usually lie in applicationdependent domains. In this work, we use linear logistic regression fusion, in which fused scores tend to be calibrated loglikelihoodratios and thus, independent of the application. We use for our experiments the development set of scores of the DS2 Evaluation (Access Control Scenario) of the BioSecure Multimodal Evaluation Campaign, whose objective is to compare the performance of fusion algorithms when query biometric signals are originated from heterogeneous biometric devices. We compare a fusion scheme that uses linear logistic regression with a set of simple fusion rules. It is observed that the proposed fusion scheme outperforms all the simple fusion rules, with the additional advantage of the applicationindependent nature of the resulting fused scores.
SPEAKER RECOGNITION USING SYLLABLEBASED CONSTRAINTS FOR CEPSTRAL FRAME SELECTION
"... We describe a new GMMUBM speaker recognition system that uses standard cepstral features, but selects different frames of speech for different subsystems. Subsystems, or “constraints”, are based on syllablelevel information and combined at the score level. Results on both the NIST 2006 and 2008 te ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
(Show Context)
We describe a new GMMUBM speaker recognition system that uses standard cepstral features, but selects different frames of speech for different subsystems. Subsystems, or “constraints”, are based on syllablelevel information and combined at the score level. Results on both the NIST 2006 and 2008 test data sets for the English telephone train and test condition reveal that a set of eight constraints performs extremely well, resulting in better performance than other commonlyused cepstral models. Given the still largelyunexplored world of possible constraints and combinations, it is likely that the approach can be even further improved. Index Terms — Speaker recognition, higherlevel features, GMMs, cepstral features, MFCCs, syllables [7]. The resulting system outperforms SRI’s otherwise top current cepstralbased systems on English telephone data, for both the NIST SRE 2006 and NIST SRE 2008 test data sets. 1.
System combination using auxiliary information for speaker verification
 in Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing (ICASSP), Las Vegas
, 2008
"... Recent studies in speaker recognition have shown that scorelevel combination of subsystems can yield significant performance gains over individual subsystems. We explore the use of auxiliary information to aid the combination procedure. We propose a modified linear logistic regression procedure th ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
(Show Context)
Recent studies in speaker recognition have shown that scorelevel combination of subsystems can yield significant performance gains over individual subsystems. We explore the use of auxiliary information to aid the combination procedure. We propose a modified linear logistic regression procedure that conditions combination weights on the auxiliary information. A regularization procedure is used to control the complexity of the extended model. Several auxiliary features are explored. Results are presented for data from the 2006 NIST speaker recognition evaluation (SRE). When an estimated degree of nonnativeness for the speaker is used as auxiliary information, the proposed combination results in a 15 % relative reduction in equal error rate over methods based on standard linear logistic regression, support vector machines, and neural networks.