Results 1  10
of
81
An overview of textindependent speaker recognition: from features to supervectors
, 2009
"... This paper gives an overview of automatic speaker recognition technology, with an emphasis on textindependent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the stateoftheart methods. We start with the fundamentals of ..."
Abstract

Cited by 156 (37 self)
 Add to MetaCart
This paper gives an overview of automatic speaker recognition technology, with an emphasis on textindependent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the stateoftheart methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
Cosine Similarity Scoring without Score Normalization Techniques
"... In recent work [1], a simplified and highly effective approach to speaker recognition based on the cosine similarity between lowdimensional vectors, termed ivectors, defined in a total variability space was introduced. The total variability space representation is motivated by the popular Joint Fact ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
(Show Context)
In recent work [1], a simplified and highly effective approach to speaker recognition based on the cosine similarity between lowdimensional vectors, termed ivectors, defined in a total variability space was introduced. The total variability space representation is motivated by the popular Joint Factor Analysis (JFA) approach, but does not require the complication of estimating separate speaker and channel spaces and has been shown to be less dependent on score normalization procedures, such as znorm and tnorm. In this paper, we introduce a modification to the cosine similarity that does not require explicit score normalization, relying instead on simple mean and covariance statistics from a collection of impostor speaker ivectors. By avoiding the complication of z and tnorm, the new approach further allows for application of a new unsupervised speaker adaptation technique to models defined in the ivector space. Experiments are conducted on the core condition of the NIST 2008 corpora, where, with adaptation, the new approach produces an equal error rate (EER) of 4.8 % and min decision cost function (MinDCF) of 2.3 % on all female speaker trials. 1.
Language recognition via ivectors and dimensionality reduction
 in Interspeech
, 2011
"... In this paper, a new language identification system is presented based on the total variability approach previously developed in the field of speaker identification. Various techniques are employed to extract the most salient features in the lower dimensional ivector space and the system develope ..."
Abstract

Cited by 35 (4 self)
 Add to MetaCart
(Show Context)
In this paper, a new language identification system is presented based on the total variability approach previously developed in the field of speaker identification. Various techniques are employed to extract the most salient features in the lower dimensional ivector space and the system developed results in excellent performance on the 2009 LRE evaluation set without the need for any postprocessing or backend techniques. Additional performance gains are observed when the system is combined with other acoustic systems.
Graph embedding for speaker recognition
 in Proc. Interspeech, 2010
"... This chapter presents applications of graph embedding to the problem of textindependent speaker recognition. Speaker recognition is a general term encompassing multiple applications. At the core is the problem of speaker comparison—given two speech recordings (utterances), produce a score which me ..."
Abstract

Cited by 28 (6 self)
 Add to MetaCart
(Show Context)
This chapter presents applications of graph embedding to the problem of textindependent speaker recognition. Speaker recognition is a general term encompassing multiple applications. At the core is the problem of speaker comparison—given two speech recordings (utterances), produce a score which measures speaker simi
An ivector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech
"... It is widely believed that speaker verification systems perform better when there is sufficient background training data to deal with nuisance effects of transmission channels. It is also known that these systems perform at their best when the sound environment of the training data is similar to tha ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
(Show Context)
It is widely believed that speaker verification systems perform better when there is sufficient background training data to deal with nuisance effects of transmission channels. It is also known that these systems perform at their best when the sound environment of the training data is similar to that of the context of use (test context). For some applications however, training data from the same type of sound environment is scarce, whereas a considerable amount of data from a different type of environment is available. In this paper, we propose a new architecture for textindependent speaker verification systems that are satisfactorily trained by virtue of a limited amount of applicationspecific data, supplemented with a sufficient amount of training data from some other context. This architecture is based on the extraction of parameters (ivectors) from a lowdimensional space (total variability space) proposed by Dehak [1]. Our aim is to extend Dehak’s work to speaker recognition on sparse data, namely microphone speech. The main challenge is to overcome the fact that insufficient applicationspecific data is available to accurately estimate the total variability covariance matrix. We propose a method based on Joint Factor Analysis (JFA) to estimate microphone eigenchannels (sparse data) with telephone eigenchannels (sufficient data). For classification, we experimented with the following two approaches: Support Vector Machines (SVM) and Cosine Distance Scoring (CDS) classifier, based on cosine distances. We present recognition results for the part of female voices in the interview data of the NIST 2008 SRE. The best performance is obtained when our system is fused with the stateoftheart JFA. We achieve 13 % relative improvement on equal error rate and the minimum value of detection cost function decreases from 0.0219 to 0.0164. 1.
Duration Mismatch Compensation for Ivector based Speaker Recognition Systems,” in
 Proc. IEEE ICASSP,
, 2013
"... ABSTRACT Speaker recognition systems trained on long duration utterances are known to perform significantly worse when short test segments are encountered. To address this mismatch, we analyze the effect of duration variability on phoneme distributions of speech utterances and ivector length. We d ..."
Abstract

Cited by 16 (7 self)
 Add to MetaCart
(Show Context)
ABSTRACT Speaker recognition systems trained on long duration utterances are known to perform significantly worse when short test segments are encountered. To address this mismatch, we analyze the effect of duration variability on phoneme distributions of speech utterances and ivector length. We demonstrate that, as utterance duration is decreased, number of detected unique phonemes and ivector length approaches zero in a logarithmic and nonlinear fashion, respectively. Assuming duration variability as an additive noise in the ivector space, we propose three different strategies for its compensation: i) multiduration training in Probabilistic Linear Discriminant Analysis (PLDA) model, ii) score calibration using log duration as a Quality Measure Function (QMF), and iii) multiduration PLDA training with synthesized short duration ivectors. Experiments are designed based on the 2012 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) protocol with varying test utterance duration. Experimental results demonstrate the effectiveness of the proposed schemes on short duration test conditions, especially with the QMF calibration approach.
Unsupervised Speaker Adaptation based on the Cosine Similarity for TextIndependent Speaker Verification
"... This paper proposes a new approach to unsupervised speaker adaptation inspired by the recent success of the factor analysisbased Total Variability Approach to textindependent speaker verification [1]. This approach effectively represents speaker variability in terms of lowdimensional total factor ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
(Show Context)
This paper proposes a new approach to unsupervised speaker adaptation inspired by the recent success of the factor analysisbased Total Variability Approach to textindependent speaker verification [1]. This approach effectively represents speaker variability in terms of lowdimensional total factor vectors and, when paired alongside the simplicity of cosine similarity scoring, allows for easy manipulation and efficient computation [2]. The development of our adaptation algorithm is motivated by the desire to have a robust method of setting an adaptation threshold, to minimize the amount of required computation for each adaptation update, and to simplify the associated score normalization procedures where possible. To address the final issue, we propose the Symmetric Normalization (Snorm) method, which takes advantage of the symmetry in cosine similarity scoring and achieves competitive performance to that of the ZTnorm while requiring fewer parameter calculations. In our subsequent experiments, we also assess an attempt to completely replace the use of score normalization procedures with a Normalized Cosine Similarity scoring function [3]. We evaluated the performance of our unsupervised speaker adaptation algorithm under various score normalization procedures on the 10sec10sec and core conditions of the 2008 NIST SRE dataset. Using noadaptation results as our baseline, it was found that the proposed methods are consistent in successfully improving speaker verification performance to achieve stateoftheart results. 1.
Leeuwen, “Sourcenormalisedandweighted LDA for robust speaker recognition using ivectors
 in accepted into IEEE Int. Conf. on Acoustics, Speech and Signal Processing
, 2011
"... The recently developed ivector framework for speaker recognition has set a new performance standard in the research field. An ivector is a compact representation of a speaker utterance extracted from a lowdimensional total variability subspace. Prior to classification using a cosine kernel, ivec ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
(Show Context)
The recently developed ivector framework for speaker recognition has set a new performance standard in the research field. An ivector is a compact representation of a speaker utterance extracted from a lowdimensional total variability subspace. Prior to classification using a cosine kernel, ivectors are projected into an LDA space in order to reduce intersession variability and enhance speaker discrimination. The accurate estimation of this LDA space from a training dataset is crucial to classification performance. A typical training dataset, however, does not consist of utterances acquired from all sources of interest (ie., telephone, microphone and interview speech sources) for each speaker. This has the effect of introducing sourcerelated variation in the betweenspeaker covariance matrix and results in an incomplete representation of the withinspeaker scatter matrix used for LDA. Proposed is a novel sourcenormalisedandweighted LDA algorithm developed to improve the robustness of ivectorbased speaker recognition under both mismatched evaluation conditions and conditions for which insufficient speech resources are available for adequate system development. Evaluated on the recent NIST 2008 and 2010 Speaker Recognition Evaluations (SRE), the proposed technique demonstrated improvements of up to 31 % in minimum DCF and EER under mismatched and sparselyresourced conditions. Index Terms — speaker recognition, linear discriminant analysis, ivector, total variability, source variability 1.
Support vector machines and joint factor analysis for speaker verification
 Proceedingsof ICASSP 2009
"... This article presents several techniques to combine between Support vector machines (SVM) and Joint Factor Analysis (JFA) model for speaker verification. In this combination, the SVMs are applied to different sources of information produced by the JFA. These informations are the Gaussian Mixture Mod ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
(Show Context)
This article presents several techniques to combine between Support vector machines (SVM) and Joint Factor Analysis (JFA) model for speaker verification. In this combination, the SVMs are applied to different sources of information produced by the JFA. These informations are the Gaussian Mixture Model supervectors and speakers and Common factors. We found that using SVM in JFA factors gave the best results especially when within class covariance normalization method is applied in order to compensate for the channel effect. The new combination results are comparable to other classical JFA scoring techniques.
Speaker verification using simplified and supervised ivector modeling,” appear to
 Proc. of ICASSP
, 2013
"... This paper presents a simplified and supervised ivector modeling framework that is applied in the task of robust and efficient speaker verification (SRE). First, by concatenating the mean supervector and the ivector factor loading matrix with respectively the label vector and the linear classifier ..."
Abstract

Cited by 11 (9 self)
 Add to MetaCart
This paper presents a simplified and supervised ivector modeling framework that is applied in the task of robust and efficient speaker verification (SRE). First, by concatenating the mean supervector and the ivector factor loading matrix with respectively the label vector and the linear classifier matrix, the traditional ivectors are then extended to labelregularized supervised ivectors. These supervised ivectors are optimized to not only reconstruct the mean supervectors well but also minimize the mean squared error between the original and the reconstructed label vectors, such that they become more discriminative. Second, factor analysis (FA) can be performed on the prenormalized centered GMM first order statistics supervector to ensure that the Gaussian statistics subvector of each Gaussian component is treated equally in the FA, which reduces the computational cost significantly. Experimental results are reported on the female part of the NIST SRE 2010 task with common condition 5. The proposed supervised ivector approach outperforms the ivector baseline by relatively 12 % and 7 % in terms of equal error rate (EER) and norm old minDCF values, respectively. Index Terms — Speaker verification, Simplified ivector, Supervised ivector