Results 1 - 10
of
13
Computationally efficient and robust bic-based speaker segmentation
- In Proc. ICASSP
, 2008
"... Abstract An algorithm for automatic speaker segmentation based on the Bayesian Information Criterion (BIC) is presented. BIC tests are not performed for every window shift (e.g. every milliseconds), as previously, but when a speaker change is most probable to occur. This is done by estimating the n ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
Abstract An algorithm for automatic speaker segmentation based on the Bayesian Information Criterion (BIC) is presented. BIC tests are not performed for every window shift (e.g. every milliseconds), as previously, but when a speaker change is most probable to occur. This is done by estimating the next probable change point thanks to a model of utterance durations. It is found that the inverse Gaussian fits best the distribution of utterance durations. As a result, less BIC tests are needed, making the proposed system less computationally demanding in time and memory, and considerably more efficient with respect to missed speaker change points. A feature selection algorithm based on branch and bound search strategy is applied in order to identify the most efficient features for speaker segmentation. Furthermore, a new theoretical formulation of BIC is derived by applying centering and simultaneous diagonalization. This formulation is considerably more computationally efficient than the standard BIC, when the covariance matrices are estimated by other estimators than the usual maximum likelihood ones. Two commonly used pairs of figures of merit are employed and their relationship is established. Computational efficiency is achieved through the speaker utterance modeling, whereas robustness is achieved by feature selection and application of BIC tests at appropriately selected time instants. Experimental results indicate that the proposed modifications yield a superior performance compared to existing approaches.
Learning Speaker-Specific Characteristics with a Deep Neural Architecture
, 2011
"... Speech signals convey various yet mixed information ranging from linguistic to speaker-specific information. However, most of acoustic representations characterize all different kinds of information as whole, which could hinder either a speech or a speaker recognition (SR) system from producing a b ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Speech signals convey various yet mixed information ranging from linguistic to speaker-specific information. However, most of acoustic representations characterize all different kinds of information as whole, which could hinder either a speech or a speaker recognition (SR) system from producing a better performance. In this paper, we propose a novel deep neural architecture (DNA) especially for learning speaker-specific characteristics from mel-frequency cepstral coefficients, an acoustic representation commonly used in both speech recognition and SR, which results in a speaker-specific overcomplete representation. In order to learn intrinsic speaker-specific characteristics, we come up with an objective function consisting of contrastive losses in terms of speaker similarity/dissimilarity and data reconstruction losses used as regularization to normalize the interference of non-speaker-related information. Moreover, we employ a hybrid learning strategy for learning parameters of the deep neural networks: i.e., local yet greedy layerwise unsupervised pretraining for initialization and global supervised learning for the ultimate discriminative goal. With four Linguistic Data Consortium (LDC) benchmarks and two non-English corpora, we demonstrate that our overcomplete representation is robust in characterizing various speakers, no matter whether their utterances have been used in training our DNA, and highly insensitive to text and languages spoken. Extensive comparative studies suggest that our approach yields favorite results in speaker verification and segmentation. Finally, we discuss several issues concerning our proposed approach.
Extracting Speaker-Specific Information with a Regularized Siamese Deep Network
"... Speech conveys different yet mixed information ranging from linguistic to speaker-specific components, and each of them should be exclusively used in a specific task. However, it is extremely difficult to extract a specific information component given the fact that nearly all existing acoustic repre ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Speech conveys different yet mixed information ranging from linguistic to speaker-specific components, and each of them should be exclusively used in a specific task. However, it is extremely difficult to extract a specific information component given the fact that nearly all existing acoustic representations carry all types of speech information. Thus, the use of the same representation in both speech and speaker recognition hinders a system from producing better performance due to interference of irrelevant information. In this paper, we present a deep neural architecture to extract speaker-specific information from MFCCs. As a result, a multi-objective loss function is proposed for learning speaker-specific characteristics and regularization via normalizing interference of non-speaker related information and avoiding information loss. With LDC benchmark corpora and a Chinese speech corpus, we demonstrate that a resultant speaker-specific representation is insensitive to text/languages spoken and environmental mismatches and hence outperforms MFCCs and other state-of-the-art techniques in speaker recognition. We discuss relevant issues and relate our approach to previous work. 1
An Information-Geometric Approach to Real-Time Audio Segmentation
, 2013
"... Abstract—We present a generic approach to real-time audio segmentation in the framework of information geometry for exponential families. The proposed system detects changes by monitoring the information rate of the signals as they arrive in time. We also address shortcomings of traditional cumulati ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract—We present a generic approach to real-time audio segmentation in the framework of information geometry for exponential families. The proposed system detects changes by monitoring the information rate of the signals as they arrive in time. We also address shortcomings of traditional cumulative sum approaches to change detection, which assume known parameters before change. This is done by considering exact generalized likelihood ratio test statistics, with a complete estimation of the unknown parameters in the respective hypotheses. We derive an efficient sequential scheme to compute these statistics through convex duality. We finally provide results for speech segmentation in speakers, and polyphonic music segmentation in note slices. Index Terms—Audio segmentation, real-time system, information geometry, change detection.
Speaker Change Detection Based on a Graph-Partitioning Criterion
, 2011
"... Speaker change detection involves the identification of time indices of an audio stream, where the identity of the speaker changes. In this paper, we propose novel measures for the speaker change detection based on a graph-partitioning criterion over the pairwise distance matrix of feature-vector st ..."
Abstract
- Add to MetaCart
(Show Context)
Speaker change detection involves the identification of time indices of an audio stream, where the identity of the speaker changes. In this paper, we propose novel measures for the speaker change detection based on a graph-partitioning criterion over the pairwise distance matrix of feature-vector stream. Experiments on both synthetic and real-world data were performed and showed that the proposed approach yield promising results compared with the conventional statistical measures.
Rethinking Algorithm Design and Development in Speech Processing
"... Abstract Speech processing is typically based on a set of complex algorithms requiring many parameters to be specified. When parts of the speech processing chain do not behave as expected, trial and error is often the only way to investigate the reasons. In this paper, we present a research methodo ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract Speech processing is typically based on a set of complex algorithms requiring many parameters to be specified. When parts of the speech processing chain do not behave as expected, trial and error is often the only way to investigate the reasons. In this paper, we present a research methodology to analyze unexpected algorithmic behavior by making (intermediate) results of the speech processing chain perceivable and intuitively comprehensible by humans. The workflow of the process is explicated using a real-world example leading to considerable improvements in speaker clustering. The described methodology is supported by a software toolbox available for download.
Speaker Diarization Using Divide-and-Conquer
"... Speaker diarization systems usually consist of two core components: speaker segmentation and speaker clustering. The current state-of-the-art speaker diarization systems usually apply hierarchical agglomerative clustering (HAC) for speaker clustering after segmentation. However, HAC’s quadratic comp ..."
Abstract
- Add to MetaCart
(Show Context)
Speaker diarization systems usually consist of two core components: speaker segmentation and speaker clustering. The current state-of-the-art speaker diarization systems usually apply hierarchical agglomerative clustering (HAC) for speaker clustering after segmentation. However, HAC’s quadratic computational complexity with respect to the number of data samples inevitably limits its application in large-scale data sets. In this paper, we propose a divide-and-conquer (DAC) framework for speaker diarization. It recursively partitions the input speech stream into two sub-streams, performs diarization on them separately, and then combines the diarization results obtained from them using HAC. The results of experiments conducted on RT-02 and RT-03 broadcast news data show that the proposed framework is faster than the conventional segmentation and clustering-based approach while achieving comparable diarization accuracy. Moreover, the proposed framework obtains a higher speedup over the conventional approach on a larger test data set. Index Terms: speaker diarization, speaker segmentation, speaker clustering, divide-and-conquer
KNNDIST: A Non-Parametric Distance Measure for Speaker Segmentation
"... A novel distance measure for distance-based speaker segmentation is proposed. This distance measure is nonparametric, in contrast to common distance measures used in speaker segmentation systems, which often assume a Gaussian distribution when measuring the distance between 1 two audio segments. Thi ..."
Abstract
- Add to MetaCart
(Show Context)
A novel distance measure for distance-based speaker segmentation is proposed. This distance measure is nonparametric, in contrast to common distance measures used in speaker segmentation systems, which often assume a Gaussian distribution when measuring the distance between 1 two audio segments. This distance measure is essentially a k-nearestneighbor distance measure. Non-vowel segment removal in preprocessing stage is also proposed. Speaker segmentation performance is tested on artificially created conversations from the TIMIT database and two AMI conversations. For short window lengths, Missed Detection Rated is decreased significantly. For moderate window lengths, a decrease in both Missed Detection and False Alarm Rates occur. The computational cost of the distance measure is high for long window lengths. Index Terms: speaker segmentation, distance measure, k-nearest-neighbor
Using Class Purity as Criterion for Speaker Clustering in Multi-Speaker Detection Tasks
"... Abstract—Speaker clustering is an important step in multi-speaker detection tasks and its performance directly affects the speaker detection performance. It is observed that the shorter the average length of single-speaker speech segments after segmentation is, the worse performance of the following ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Speaker clustering is an important step in multi-speaker detection tasks and its performance directly affects the speaker detection performance. It is observed that the shorter the average length of single-speaker speech segments after segmentation is, the worse performance of the following speaker recognition will be achieved, therefore a reasonable solution to better multi-speaker detection performance is to enlarge the average length of after-segmentation single-speaker speech segments, which is equivalently to cluster as many true same-speaker segments into one as possible. In other words, the average class purity of each speaker segment should be as bigger as possible. Accordingly, a speaker-clustering algorithm based on the class purity criterion is proposed, where a Reference Speaker Model (RSM) scheme is adopted to calculate the distance between speech segments, and the maximal class purity, or equivalently the minimal within-class dispersion, is taken as the criterion. Experiments on the NIST SRE 2006 database showed that, compared with the conventional Hierarchical Agglomerative Clustering (HAC) algorithm, for speech segments with average lengths of 2 seconds, 5 seconds and 8 seconds, the proposed algorithm increased the valid class speech length by 2.7%, 3.8 % and 4.6%, respectively, and finally the target speaker detection recall rate was increased by 7.6%, 6.2 % and 5.1%, respectively. I.