Results 1 - 10
of
85
Automatic Mood Detection from Acoustic Music Data
, 2003
"... Music mood describes the inherent emotional meaning of a music clip. It is helpful in music understanding, music search and some music-related applications. In this paper, a hierarchical framework is presented to automate the task of mood detection from acoustic music data, by following some music p ..."
Abstract
-
Cited by 33 (0 self)
- Add to MetaCart
Music mood describes the inherent emotional meaning of a music clip. It is helpful in music understanding, music search and some music-related applications. In this paper, a hierarchical framework is presented to automate the task of mood detection from acoustic music data, by following some music psychological theories in western cultures. Three feature sets , intensity, timbre and rhythm, are extracted to represent the characteristics of a music clip. Moreover, a mood tracking approach is also presented for a whole piece of music. Experimental evaluations indicate that the proposed algorithms produce satisfactory results.
An overview of text-independent speaker recognition: from features to supervectors
, 2009
"... This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of ..."
Abstract
-
Cited by 31 (14 self)
- Add to MetaCart
This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
SHEEP, GOATS, LAMBS and WOLVES A Statistical Analysis of Speaker Performance in the NIST 1998 Speaker Recognition Evaluation
- INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING
, 1998
"... Performance variability in speech and speaker recognition systems can be attributed to many factors. One major factor, which is often acknowledged but seldom analyzed, is inherent differences in the recognizability of different speakers. In speaker recognition systems such differences are characteri ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
Performance variability in speech and speaker recognition systems can be attributed to many factors. One major factor, which is often acknowledged but seldom analyzed, is inherent differences in the recognizability of different speakers. In speaker recognition systems such differences are characterized by the use of animal names for different types of speakers, including sheep, goats, lambs and wolves, depending on their behavior with respect to automatic recognition systems. In this paper we propose statistical tests for the existence of these animals and apply these tests to hunt for such animals using results from the 1998 NIST speaker recognition evaluation.
A Survey on Wavelet Applications in Data Mining
, 2003
"... Recently there has been significant development in the use of wavelet methods in various data mining processes. However, there has been written no comprehensive survey available on the topic. The goal of this is paper to fill the void. First, the paper presents a high-level data-mining framework tha ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
Recently there has been significant development in the use of wavelet methods in various data mining processes. However, there has been written no comprehensive survey available on the topic. The goal of this is paper to fill the void. First, the paper presents a high-level data-mining framework that reduces the overall process into smaller components. Then applications of wavelets for each component are reviewd. The paper concludes by discussing the impact of wavelets on data mining research and outlining potential future research directions and applications.
The Gauss-tree: Efficient object identification in databases of probabilistic feature vectors
- In Proc. ICDE
, 2006
"... In applications of biometric databases the typical task is to identify individuals according to features which are not exactly known. Reasons for this inexactness are varying measuring techniques or environmental circumstances. Since these circumstances are not necessarily the same when determining ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
In applications of biometric databases the typical task is to identify individuals according to features which are not exactly known. Reasons for this inexactness are varying measuring techniques or environmental circumstances. Since these circumstances are not necessarily the same when determining the features for different individuals, the exactness might strongly vary between the individuals as well as between the features. To identify individuals, similarity search on feature vectors is applicable, but even the use of adaptable distance measures is not capable to handle objects having an individual level of exactness. Therefore, we develop a comprehensive probabilistic theory in which uncertain observations are modeled by probabilistic feature vectors (pfv), i.e. feature vectors where the conventional feature values are replaced by Gaussian probability distribution functions. Each feature value of each object is complemented by a variance value indicating its uncertainty. We define two types of identification queries, k-mostlikely identification and threshold identification. For efficient query processing, we propose a novel index structure, the Gauss-tree. Our experimental evaluation demonstrates that pfv stored in a Gauss-tree significantly improve the result quality compared to traditional feature vectors. Additionally, we show that the Gauss-tree significantly speeds up query times compared to competitive methods. 1
Improved Learning Algorithms for Mixture of Experts in Multiclass Classification
, 1999
"... Mixture of experts (ME) is a modular neural network architecture for supervised learning. A double-loop Expectation-Maximization (EM) algorithm has been introduced to the ME architecture for adjusting the parameters and the iteratively reweighted least squares (IRLS) algorithm is used to perform max ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
Mixture of experts (ME) is a modular neural network architecture for supervised learning. A double-loop Expectation-Maximization (EM) algorithm has been introduced to the ME architecture for adjusting the parameters and the iteratively reweighted least squares (IRLS) algorithm is used to perform maximization in the inner loop [Jordan, M.I., Jacobs, R.A. (1994). Hierarchical mixture of experts and the EM algorithm, Neural Computation, 6(2), 181--214]. However, it is reported in literature that the IRLS algorithm is of instability and the ME architecture trained by the EM algorithm, where IRLS algorithm is used in the inner loop, often produces the poor performance in multiclass classification. In this paper, the reason of this instability is explored. We find out that due to an implicitly imposed incorrect assumption on parameter independence in multiclass classification, an incomplete Hessian matrix is used in that IRLS algorithm. Based on this finding, we apply the Newton--Raphson met...
Real-Time Speaker Identification and Verification
- ACCEPTED FOR PUBLICATION IN IEEE TRANS. SPEECH & AUDIO PROCESSING
"... In speaker identification, most of the computation originates from the distance or likelihood computations between the feature vectors of the unknown speaker and the models in the database. The identification time depends on the number of feature vectors, their dimensionality, the complexity of the ..."
Abstract
-
Cited by 10 (7 self)
- Add to MetaCart
In speaker identification, most of the computation originates from the distance or likelihood computations between the feature vectors of the unknown speaker and the models in the database. The identification time depends on the number of feature vectors, their dimensionality, the complexity of the speaker models and the number of speakers. In this paper, we concentrate on optimizing vector quantization (VQ) based speaker identification. We reduce the number of test vectors by pre-quantizing the test sequence prior to matching, and the number of speakers by pruning out unlikely speakers during the identification process. The best variants are then generalized to Gaussian mixture model (GMM) based modeling. We apply the algorithms also to efficient cohort set search for score normalization in speaker verification. We obtain a speed-up factor of 16:1 in the case of VQ-based modeling with minor degradation in the identification accuracy, and 34:1 in the case of GMM-based modeling. An equal error rate of 7 % can be reached in 0.84 seconds on average when the length of test utterance is 30.4 seconds.
Audio-assisted scene segmentation for story browsing
- In Proceedings of the International Conference on Image and Video Retrieval
, 2003
"... Abstract. Content-based video retrieval requires an effective scene segmentation technique to divide a long video file into meaningful high-level aggregates of shots called scenes. Each scene is part of a story. Browsing these scenes unfolds the entire story of a film. In this paper, we first invest ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Abstract. Content-based video retrieval requires an effective scene segmentation technique to divide a long video file into meaningful high-level aggregates of shots called scenes. Each scene is part of a story. Browsing these scenes unfolds the entire story of a film. In this paper, we first investigate recent scene segmentation techniques that belong to the visual-audio alignment approach. This approach segments a video stream into visual scenes and an audio stream into audio scenes separately and later aligns these boundaries to create the final scene boundaries. In contrast, we propose a novel audio-assisted scene segmentation technique that utilizes audio information to remove false boundaries generated from segmentation by visual information alone. The crux of our technique is the new dissimilarity measure based on analysis of statistical properties of audio features and a concept in information theory. The experimental results on two full-length films with a wide range of camera motion and a complex composition of shots demonstrate the effectiveness of our technique compared with that of the visual-audio alignment techniques. 1
Spectral Features for Automatic Text-Independent Speaker Recognition
, 2003
"... Front-end or feature extractor is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but e#ective representation that is more stable and discriminative than the original signal. Since the front-end is the first component ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Front-end or feature extractor is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but e#ective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components (speaker modeling and pattern matching) is strongly determined by the quality of the front-end. In other words, classification can be at most as accurate as the features.
Multimodal speaker identification using an adaptive classifier cascade based on modality reliability
- IEEE Transactions on Multimedia
, 2005
"... We present a multimodal open-set speaker identification system that integrates information coming from audio, face and lip motion modalities. For fusion of multiple modalities, we propose a new adaptive cascade rule that favors reliable modality combinations through a cascade of classifiers. The ord ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
We present a multimodal open-set speaker identification system that integrates information coming from audio, face and lip motion modalities. For fusion of multiple modalities, we propose a new adaptive cascade rule that favors reliable modality combinations through a cascade of classifiers. The order of the classifiers in the cascade is adaptively determined based on the reliability of each modality combination. A novel reliability measure, that genuinely fits to the open-set speaker identification problem, is also proposed to assess accept or reject decisions of a classifier. A formal framework is developed based on probability of correct decision for analytical comparison of the proposed adaptive rule with other classifier combination rules. The proposed adaptive rule is more robust in the presence of unreliable modalities, and outperforms the hard-level max rule and soft-level weighted summation rule, provided that the employed reliability measure is effective in assessment of classifier decisions. Experimental results that support this assertion are provided.

