Results 1 - 10
of
191
Front End Factor Analysis for Speaker Verification
- IEEE Transactions on Audio, Speech and Language Processing
, 2010
"... Abstract—This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space ..."
Abstract
-
Cited by 315 (22 self)
- Add to MetaCart
Abstract—This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12 % and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4 % absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring. Index Terms—Cosine distance scoring, joint factor analysis (JFA), support vector machines (SVMs), total variability space. I.
An overview of text-independent speaker recognition: from features to supervectors
, 2009
"... This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of ..."
Abstract
-
Cited by 156 (37 self)
- Add to MetaCart
This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
A study of inter-speaker variability in speaker verification
- IEEE Trans. Audio, Speech and Language Processing
, 2008
"... Abstract — We propose a new approach to the problem of estimating the hyperparameters which define the inter-speaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10–15 % reductions in error r ..."
Abstract
-
Cited by 131 (12 self)
- Add to MetaCart
(Show Context)
Abstract — We propose a new approach to the problem of estimating the hyperparameters which define the inter-speaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10–15 % reductions in error rates on the core condition and the extended data condition (as measured both by equal error rates and the NIST detection cost function). We show that when a large joint factor analysis model is trained in this way and tested on the core condition, the extended data condition and the cross-channel condition, it is capable of performing at least as well as fusions of multiple systems of other types. (The comparisons are based on the best results on these tasks that have been reported in the literature.) In the case of the cross-channel condition, a factor analysis model with 300 speaker factors and 200 channel factors can achieve equal error rates of less than 3.0%. This is a substantial improvement over the best results that have previously been reported on this task. Index Terms — Speaker verification, Gaussian mixture model, speaker factors, channel factors
Joint factor analysis versus eigenchannels in speaker recognition
- IEEE Trans. Audio, Speech, Lang. Process
, 2007
"... Abstract — We compare two approaches to the problem of session variability in GMM-based speaker verification, eigen-channels and joint factor analysis, on the NIST 2005 speaker recognition evaluation data. We show how the two approaches can be implemented using essentially the same software at all s ..."
Abstract
-
Cited by 115 (13 self)
- Add to MetaCart
Abstract — We compare two approaches to the problem of session variability in GMM-based speaker verification, eigen-channels and joint factor analysis, on the NIST 2005 speaker recognition evaluation data. We show how the two approaches can be implemented using essentially the same software at all stages except for the enrollment of target speakers. We demonstrate the effectiveness of zt-norm score normalization and a new decision criterion for speaker recognition which can handle large numbers of t-norm speakers and large numbers of speaker factors at little computational cost. We found that factor analysis was far more effective than eigenchannel modeling. The best result we obtained was a detection cost of 0.016 on the core condition (all trials) of the evaluation. Index Terms — Speaker verification, Gaussian mixture model, speaker factors, channel factors, eigenchannels
An overview of automatic speaker diarization systems
- IEEE TASLP
, 2006
"... Abstract—Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/ ..."
Abstract
-
Cited by 100 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel char-acteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification. Index Terms—Speaker diarization, speaker segmentation and clustering. I.
Fusion of heterogeneous speaker recognition systems
- in the STBU submission for the NIST speaker recognition evaluation 2006,” IEEE Transactions on Audio, Speech and Signal Processing
, 2007
"... Abstract—This paper describes and discusses the ‘STBU’ speaker recognition system, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE). STBU is a consortium ..."
Abstract
-
Cited by 63 (14 self)
- Add to MetaCart
(Show Context)
Abstract—This paper describes and discusses the ‘STBU’ speaker recognition system, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE). STBU is a consortium
Speaker and session variability in GMM-based speaker verification
- IEEE Trans. Audio, Speech, Lang. Process
, 2007
"... Abstract — We present a corpus-based approach to speaker verification in which maximum likelihood II criteria are used to train a large scale generative model of speaker and session variability which we call joint factor analysis. Enrolling a target speaker consists in calculating the posterior dist ..."
Abstract
-
Cited by 55 (9 self)
- Add to MetaCart
(Show Context)
Abstract — We present a corpus-based approach to speaker verification in which maximum likelihood II criteria are used to train a large scale generative model of speaker and session variability which we call joint factor analysis. Enrolling a target speaker consists in calculating the posterior distribution of the hidden variables in the factor analysis model and verification tests are conducted using a new type of likelihood II ratio statistic. Using the NIST 1999 and 2000 speaker recognition evaluation data sets, we show that the effectiveness of this approach depends on the availability of a training corpus which is well matched with the evaluation set used for testing. Experiments on the NIST 1999 evaluation set using a mismatched corpus to train factor analysis models did not result in any improvement over standard methods but we found that, even with this type of mismatch, feature warping performs extremely well in conjunction with the factor analysis model and this enabled us to obtain very good results (equal error rates of about 6.2%). Index terms: speaker verification, Gaussian mixture, factor analysis I.
Factor analysis simplified
- in ICASSP
, 2005
"... We show how the factor analysis model for speaker verification can be successfully implemented using some fast approximations which result in minor degradations in accuracy and open up the possibility of training the model on very large databases such as the union of all of the Switchboard corpora. ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
(Show Context)
We show how the factor analysis model for speaker verification can be successfully implemented using some fast approximations which result in minor degradations in accuracy and open up the possibility of training the model on very large databases such as the union of all of the Switchboard corpora. We tested our algorithms on the NIST 1999 evaluation set (carbon data as well as electret). Using warped cepstral features we obtained equal error rates of about 6.3 % and minimum detection costs of about 0.022. 1.
Short-Time Gaussianization for Robust Speaker Verification
- in Proc. ICASSP
, 2002
"... namely short-time Gaussianization, is proposed. Shorttime Gaussianization is initiated by a global linear transformation of the features, followed by a short-time windowed cumulative distribution function(CDF) matching. First, the linear transformation in the feature space leads to local independenc ..."
Abstract
-
Cited by 39 (4 self)
- Add to MetaCart
(Show Context)
namely short-time Gaussianization, is proposed. Shorttime Gaussianization is initiated by a global linear transformation of the features, followed by a short-time windowed cumulative distribution function(CDF) matching. First, the linear transformation in the feature space leads to local independence or decorrelation. Then the CDF matching is applied to segments of speech localized in time and tries to warp a given feature so that its CDF matches normal distribution. It is shown that one of the recent techniques used for speaker recognition, feature warping[1] can be formulated within the framework of Gaussianization. Compared to the baseline system with cepstral mean subtraction (CMS), around relative improvement in both equal error rate(EER) and minimum detection cost function(DCF) is obtained on NIST 2001 cellular phone data evaluation.
Feature And Score Normalization For Speaker Verification Of Cellular Data
- IN PROC. ICASSP
, 2003
"... This paper presents some experiments with feature and score normalization for text-independent speaker verification of cellular data. The speaker verification system is based on cepstral features and Gaussian mixture models with 1024 components. The following methods, which have been proposed for fe ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
This paper presents some experiments with feature and score normalization for text-independent speaker verification of cellular data. The speaker verification system is based on cepstral features and Gaussian mixture models with 1024 components. The following methods, which have been proposed for feature and score normalization, are reviewed and evaluated on cellular data: cepstral mean subtraction (CMS), variance normalization, feature warping, T-norm, Z-norm and the cohort method. We found that the combination of feature warping and T-norm gives the best results on the NIST 2002 test data (for the one-speaker detection task). Compared to a baseline system using both CMS and variance normalization and achieving a 0.410 minimal decision cost function (DCF), feature warping and T-norm respectively bring 8% and 12% relative reductions, whereas the combination of both techniques yields a 22% relative reduction, reaching a DCF of 0.320. This result approaches the state-of-the-art performance level obtained for speaker verification with land-line telephone speech.