| H. Jin, F. Kubala and R. Schwartz, \Automatic speaker clustering", Proceedings of the Speech Recognition Workshop, pp 108-111, 1997. |
.... the number of clusters or the minimum size of each cluster; accordingly, one can go down the tree to obtain desired clustering [30] Another heuristic solution is to threshold the distance measures during the hierarchical process; the thresholding level is tuned on a training set [28] Jin et al. [25] shed some light on automatically choosing a clustering solution. In this paper, we view clustering as a model selection problem and show that the BIC criterion is an e ective termination criterion for hierarchical clustering. The diversity of BN speech data (several speakers, channel, and noise ....
H. Jin, F. Kubala and R. Schwartz, \Automatic speaker clustering", Proceedings of the Speech Recognition Workshop, pp 108-111, 1997.
....background model is the closest model to the test model. 4.3. Speaker Classification Speaker Classification as a direct product of the tree building is very useful in many different occasions including the narrow down of the search space for doing Speaker Recognition. The systems presented in [3, 4, 5] use speaker classification for performing speaker segmentation as well as improving speech recognition accuracies through adaptation. 5. INITIAL RESULTS Please also note that if the claimant is an imposter and just happens to be closest to the claimed identity in the cohort which is picked, with ....
H. Jin and F. Kubala and R. Scwartz, "Automatic speaker clustering", Proceedings of Speech Recognition Workshop, ARPA, "Chantilly, VA, Feb. 1997.
....when the segmentation is followed by a speech recognition phase. Indeed, every time that a boundary is introduced in the middle of a word, it introduces one or two word errors. It is also a characteristic of methods which arbitrarily cut the input speech into small segment and then re cluster them. [6, 7] 2 Boundary detection The proposed algorithm uses adjacent windows which are shifted in time over the whole data. Acoustic features are computed over these windows and clustered. Different clustering algorithms are compared in terms of their speed and performance. As we will illustrate later, the ....
....noises from the end times associated with changes. Re Clustering of the features associated with each new segment can be used to eliminate spurious endtimes. The clustering algorithm that perform the best is different from that which has previously been used with the chop and re cluster method [6, 7]. In this paper we use a distance measure we have developed to provide a good measure of the difference between two mixtures of Gaussians. Also, based on the same concept from which this distance measure was derived, we have developed a merging technique for merging models (mixtures of Gaussians) ....
H. Jin and F. Kubala and R. Schwartz, "Automatic Speaker Clustering," Proceedings of the Speech Recognition Workshop, Chantilly, VA, February 1997.
....all the speech segments from the same speaker into one or just a few clusters so that a training model would be built for each speaker. Later, in a decoding session, an action similar to that described in the previous section is taken to decode the incoming speech with a more suitable model. [1, 4, 5, 6] 3.3. Speaker Verification One field of speaker recognition is speaker verification in which a speaker makes a claim on his her identity and a sample of his speech is analyzed by the computer to evaluate the nature of the claim. If his her speech accurately matches the claim, he she is ....
H. Jin and F. Kubala and R. Scwartz, "Automatic speaker clustering", Proceedings of Speech Recognition Workshop, ARPA, "Chantilly, VA, Feb. 1997.
....is recognized should be also segmented according to speaker identity. Automatic segmentation of an audio stream is frequently introduced as an efficient method to improve performances of adaptative speech recognizers. The problem of acoustic segmentation has been often addressed for a few years [7, 8, 11, 5, 6, 9] : it con This work was supported by the ESPRIT Long Term Research Project THISL (23495) 1 http: www.dcs.shef.ac.uk research groups spandh projects thisl Chop Front End Re cluster MEL Cepstrums Segments Limits Segments Speech Signal Chop and Recluster Clusters Figure 1: Chop and Recluster ....
....processed separately. The Bhattacharyya (3) cluster distance is used to chop the audio streams and the average linkage within groups scheme is chosen for reclustering the segments. Such algorithms are generally used to enhance the performances of unsupervised adaptation for speech recognition [7, 11, 6, 9]. Hence, performances are given in terms of error rates reduction. Since we are only concerned by the speaker based segmentation problem, we do not produce such results. We just observe that the final segmentation is always worse than the one obtained with only true speaker changes. Because of the ....
H. Jin, F. Kubala and R. Schwartz, "Automatic speaker clustering", Proc. of DARPA Speech Recognition Workshop, 1997.
....(from post evaluation analysis) 3.4. Segmentation and Clustering We experimented with the commonly adopted strategy: Use the silence segments located by a monophone recognizer as boundaries of presegments and then use some distortion measures to cluster these segments. The method proposed by [12] was implemented. We experimented with this method on concatenated WSJ utterances and found generally it worked quite well. When we experimented with the actual BN data with monophone recognition generated boundaries, we found the presegmentation generating too many very long (short) utterances. ....
H.Jin, F. Kubala, and R. Schwartz. Automatic speaker clustering. In Proc. of DARPA Speech Recognition Workshop, 1997.
.... the number of clusters or the minimum size of each cluster; accordingly, one can go down the tree to obtain desired clustering [14] Another heuristic solution is to threshold the distance measures during the hierarchical process; the thresholding level is tuned on a training set [10] Jin et al. [7] shed some light on automatically choosing a clustering solution. In this paper, we are interested in detecting changes in speaker identity, environmental condition and channel condition; we call this the problem of acoustic change detection. The input audio stream can be modeled as a Gaussian ....
....condition and by 2:4 for the spontaneous condition. Table 2 also shows the error rates of MLLR using the ideal clustering by the true speaker identities. It is clear that our speaker clustering enhanced the performance of MLLR as much as the ideal clustering. 4.3. Discussion Jin et al. of BBN [7] proposed a similar automatic speaker clustering algorithm. They also used the log likelihood ratio distance measure proposed in Gish et al. 6] however, with the distances between consecutive segments scaled down by a parameter ff. They performed hierarchical clustering; for any given number k, ....
[Article contains additional citation context not shown here]
H. Jin, F. Kubala and R. Schwartz, "Automatic speaker clustering", Proceedings of the Speech Recognition Workshop, pp 108-111, 1997.
....gender changes, and separation of male from female seg ments. Also, long silence segments are discarded. 3. Chopping of the hypothesis into small segments, averaging 4 sec. duration, using the silence detection information from step (1) 4. Clustering of the short segments into speaker clusters [6] for adaptation. 5. Cepstral mean and variance normalization applied to each segment, with speech and non speech frames normalized separately (2 level CMS and variance normalization) Notice that in training, the cepstral normalization is applied to each speaker turn, so the above procedure ....
H. Jin, F. Kubala, R. Schwartz, "Automatic Speaker Clustering ", Proceedings of the DARPA Speech Recognition Workshop, Chantilly VA, Feb. 1997.
....gender changes, and separation of male from female seg ments. Also, long silence segments are discarded. 3. Chopping of the hypothesis into small segments, averaging 4 sec. duration, using the silence detection information from step (1) 4. Clustering of the short segments into speaker clusters [6] for adaptation. 5. Cepstral mean and variance normalization applied to each segment, with speech and non speech frames normalized separately (2 level CMS and variance normalization) Notice that in training, the cepstral normalization is applied to each speaker turn, so the above procedure ....
H. Jin, F. Kubala, R. Schwartz, "Automatic Speaker Clustering ", Proceedings of the DARPA Speech Recognition Workshop, Chantilly VA, Feb. 1997.
.... broadcast news input is segmented and gender classified in one step with a context independent 2gender phoneme decoder as described in [6] The chopped segments are clustered automatically in an attempt to pool the data from each speaker for the benefit of unsupervised adaptation as described in [4]. The spectrum mean and variance is normalized over each segment, with speech and non speech frames normalized separately. Gender dependent acoustic models are estimated from the training data without regard to the speech environment or signal bandwidth [12] The gender dependent SI models are ....
Jin, H., F. Kubala, R. Schwartz, "Automatic Speaker Clustering", 1997 DARPA Speech Recognition Workshop, Chantilly VA, Feb. 1997.
....at the longer pauses located by the standard decoder. 3.6. Speaker Clustering The goal of speaker clustering is to group segments from the same speaker and condition together to improve the effectiveness of unsupervised adaptation. We have developed a fully automatic blind clustering algorithm [4] to accomplish this. We cluster segments (within each episode and gender) using a segment distance measure borrowed from our work in Speaker Identification [3] A penalty is applied against the number of clusters created to establish a termination criterion in conjunction with the likelihood of ....
....on the whole problem, represented by the UE test. In both training and decoding for the PE test, we made no use of the given F condition labels. In training, data from all conditions was pooled. For test, we automatically identified speakers and channel conditions by a blind clustering procedure [4]. Also, though we found that we could achieve small improvements using condition specific models, we considered the gain too small to justify the additional system complexity [8] So the F condition labels have no value for us in training, and in test, they are only useful for diagnostic purposes. ....
Jin, H., F. Kubala, R. Schwartz, "Automatic Speaker Clustering", 1997 DARPA Speech Recognition Workshop, Chantilly VA, Feb. 1997, elsewhere this volume.
....Legetter, et al. 5] We found that if we used more than two transformations, the system memorized the recognition errors of the first pass. We improved the unsupervised adaptation on the test by clustering together different segments that appeared to be from the same speaker on the same channel [6]. We also considered supervised adaptation of the model to a known channel condition as described in the previous section. In this case, we used more transformations because there was more data and the transcriptions were known. Finally, we used SAT [7] This can be thought of as removing the ....
Jin, H., F. Kubala, R. Schwartz, "Automatic Speaker Clustering", 1997 DARPA Speech Recognition Workshop, Chantilly VA, Feb. 1997, elsewhere this volume.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC