Results 1 - 10
of
22
An overview of automatic speaker diarization systems
- IEEE TASLP
, 2006
"... Abstract—Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/ ..."
Abstract
-
Cited by 100 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel char-acteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification. Index Terms—Speaker diarization, speaker segmentation and clustering. I.
Retrieval and browsing of spoken content
- IEEE Signal Processing Mag
, 2008
"... [A discussion of the technical issues involved in developing information retrieval systems for the spoken word] © IMAGESTATE Ever-increasing computing power and connectivity bandwidth, together with falling storage costs, are resulting in an overwhelming amount of data of various types being produce ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
(Show Context)
[A discussion of the technical issues involved in developing information retrieval systems for the spoken word] © IMAGESTATE Ever-increasing computing power and connectivity bandwidth, together with falling storage costs, are resulting in an overwhelming amount of data of various types being produced, exchanged, and stored. Consequently, information search and retrieval has emerged as a key application area. Text-based search is the most active area, with applications that range from Web and local network search to searching for personal information residing on one’s own hard-drive. Speech search has received less attention perhaps because large collections of spoken material have previously not been available. However, with cheaper storage and increased broadband access, there has been a subsequent increase in the availability of online spoken audio content such as news broadcasts, podcasts, and academic lectures.
Speaker segmentation and clustering
- Signal Processing
, 2008
"... This survey focuses on two challenging speech processing topics, namely: speaker segmen-tation and speaker clustering. Speaker segmentation aims at nding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-base ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
(Show Context)
This survey focuses on two challenging speech processing topics, namely: speaker segmen-tation and speaker clustering. Speaker segmentation aims at nding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algo-rithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algo-rithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is of-fered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benet from combined speaker segmentation and clustering.
Computationally efficient and robust bic-based speaker segmentation
- In Proc. ICASSP
, 2008
"... Abstract An algorithm for automatic speaker segmentation based on the Bayesian Information Criterion (BIC) is presented. BIC tests are not performed for every window shift (e.g. every milliseconds), as previously, but when a speaker change is most probable to occur. This is done by estimating the n ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
Abstract An algorithm for automatic speaker segmentation based on the Bayesian Information Criterion (BIC) is presented. BIC tests are not performed for every window shift (e.g. every milliseconds), as previously, but when a speaker change is most probable to occur. This is done by estimating the next probable change point thanks to a model of utterance durations. It is found that the inverse Gaussian fits best the distribution of utterance durations. As a result, less BIC tests are needed, making the proposed system less computationally demanding in time and memory, and considerably more efficient with respect to missed speaker change points. A feature selection algorithm based on branch and bound search strategy is applied in order to identify the most efficient features for speaker segmentation. Furthermore, a new theoretical formulation of BIC is derived by applying centering and simultaneous diagonalization. This formulation is considerably more computationally efficient than the standard BIC, when the covariance matrices are estimated by other estimators than the usual maximum likelihood ones. Two commonly used pairs of figures of merit are employed and their relationship is established. Computational efficiency is achieved through the speaker utterance modeling, whereas robustness is achieved by feature selection and application of BIC tests at appropriately selected time instants. Experimental results indicate that the proposed modifications yield a superior performance compared to existing approaches.
Feature compensation in the cepstral domain employing model combination
- Speech Communication
"... Abstract In this paper, we present an effective cepstral feature compensation scheme which leverages knowledge of the speech model in order to achieve robust speech recognition. In the proposed scheme, the requirement for a prior noisy speech database in off-line training is eliminated by employing ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract In this paper, we present an effective cepstral feature compensation scheme which leverages knowledge of the speech model in order to achieve robust speech recognition. In the proposed scheme, the requirement for a prior noisy speech database in off-line training is eliminated by employing parallel model combination for the noise-corrupted speech model. Gaussian mixture models of clean speech and noise are used for the model combination. The adaptation of the noisy speech model is possible only by updating the noise model. This method has the advantage of reduced computational expenses and improved accuracy for model estimation since it is applied in the cepstral domain. In order to cope with time-varying background noise, a novel interpolation method of multiple models is employed. By sequentially calculating the posterior probability of each environmental model, the compensation procedure can be applied on a frame-by-frame basis. In order to reduce the computational expense due to the multiple-model method, a technique of sharing similar Gaussian components is proposed. Acoustically similar components across an inventory of environmental models are selected by the proposed sub-optimal algorithm which employs the Kullback-Leibler similarity distance. The combined hybrid model, which consists of the selected Gaussian components is used for noisy speech model sharing. The performance is examined using Aurora2 and speech data for an in-vehicle environment. The proposed feature compensation algorithm is compared with standard methods in the field (e.g., CMN, spectral subtraction, RATZ). The experimental results demonstrate that the proposed feature compensation schemes are very effective in realizing robust speech recognition in adverse noisy environments. The proposed model combination-based feature compensation method is superior to existing model-based feature compensation methods. Of particular interest is that the proposed method shows up to an 11.59% relative WER reduction compared to the ETSI AFE front-end method. The multi-model approach is effective at coping with changing noise conditions for input speech, producing comparable performance to the matched model condition. Applying the mixture sharing method brings a significant reduction in computational overhead, while maintaining recognition performance at a reasonable level with near real-time operation.
Radio Oranje: Enhanced Access to a Historical Spoken Word Collection
"... Access to historical audio collections is typically very restricted: content is often only available on physical (analog) media and the metadata is usually limited to keywords, giving access at the level of relatively large fragments, e.g., an entire tape. Many spoken word heritage collections are n ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Access to historical audio collections is typically very restricted: content is often only available on physical (analog) media and the metadata is usually limited to keywords, giving access at the level of relatively large fragments, e.g., an entire tape. Many spoken word heritage collections are now being digitized, which allows the introduction of more advanced search technology. This paper presents an approach that supports online access and search for recordings of historical speeches. A demonstrator has been built, based on the so-called Radio Oranje collection, which contains radio speeches by the Dutch Queen Wilhelmina that were broadcast during World War II. The audio has been aligned with its original 1940s manual transcriptions to create a time-stamped index that enables the speeches to be searched at the word level. Results are presented together with related photos from an external database. 1
‘Houston, We have a solution ’ : Using NASA Apollo Program to advance Speech and Language Processing Technology
"... NASA’s Apollo program stands as one of mankind’s greatest achievements in the 20th century. During a span of 4 years (from 1968 to 1972), a total of 9 lunar missions were launched and 12 astronauts walked on the surface of the moon. It was one the most complex operations executed from scientific, te ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
NASA’s Apollo program stands as one of mankind’s greatest achievements in the 20th century. During a span of 4 years (from 1968 to 1972), a total of 9 lunar missions were launched and 12 astronauts walked on the surface of the moon. It was one the most complex operations executed from scientific, techno-logical and operational perspectives. In this paper, we describe our recent efforts in gathering and organizing the Apollo pro-gram data. It is important to note that the audio content captured during the 7-10 day missions represent the coordinated efforts of hundreds of individuals within NASA Mission Control, re-sulting in well over 100k hours of data for the entire program. It is our intention to make the material stemming from this ef-fort available to the research community to further research ad-vancements in speech and language processing. Particularly, we describe the speech and text aspects of the Apollo data while pointing out its applicability to several classical speech pro-cessing and natural language processing problems such as au-dio processing, speech and speaker recognition, information re-trieval, document linking and a range of other processing tasks which enable knowledge search, retrieval, and understanding.. We also highlight some of the outstanding opportunities and challenges associated with this dataset. Finally, we also present initial results for speech recognition, document linking, and au-dio processing systems.
NLP and the humanities: the revival of an old liaison
"... This paper present an overview of some emerging trends in the application of NLP in the domain of the so-called Digital Humanities and discusses the role and nature of metadata, the annotation layer that is so characteristic of documents that play a role in the scholarly practises of the humanities. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
This paper present an overview of some emerging trends in the application of NLP in the domain of the so-called Digital Humanities and discusses the role and nature of metadata, the annotation layer that is so characteristic of documents that play a role in the scholarly practises of the humanities. It is explained how metadata are the key to the added value of techniques such as text and link mining, and an outline is given of what measures could be taken to increase the chances for a bright future for the old ties between NLP and the humanities. There is no data like metadata! 1
Variational Noise Model Composition Through Model Perturbation for Robust Speech Recognition with Time-Varying Background Noise
- Speech Communication
, 2011
"... Abstract This study proposes a novel model composition method to improve speech recognition performance in time-varying background noise conditions. It is suggested that each element of the cepstral coefficients represents the frequency degree of the changing components in the envelope of the log-s ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract This study proposes a novel model composition method to improve speech recognition performance in time-varying background noise conditions. It is suggested that each element of the cepstral coefficients represents the frequency degree of the changing components in the envelope of the log-spectrum. With this motivation, in the proposed method, variational noise models are formulated by selectively applying perturbation factors to the mean parameters of a basis model, resulting in a collection of noise models that more accurately reflect the natural range of spectral patterns seen in the log-spectral domain. The basis noise model is obtained from the silence segments of the input speech. The perturbation factors are designed separately for changes in the energy level and spectral envelope. The proposed variational model composition (VMC) method is employed to generate multiple environmental models for our previously proposed parallel combined gaussian mixture model (PCGMM) based feature compensation algorithm. The mixture sharing technique is integrated to reduce computational expenses, caused by employing the variational models. Experimental results prove that the proposed method is considerably more effective at increasing speech recognition performance in time-varying background noise conditions, with +31.31%, +10.65%, and +20.54% average relative improvements in word error rate for speech babble, background music, and real-life in-vehicle noise conditions respectively, compared to the original basic PCGMM method.
Disclosing Spoken Culture: User Interfaces for Access to Spoken Word Archives
"... Over the past century alone, millions of hours of audiovisual data have been collected with great potential for e.g., new creative productions, research and educational purposes. The actual (re-)use of these collections, however, is severely hindered by their generally limited access. In this paper ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Over the past century alone, millions of hours of audiovisual data have been collected with great potential for e.g., new creative productions, research and educational purposes. The actual (re-)use of these collections, however, is severely hindered by their generally limited access. In this paper a framework for improved access to spoken content from the cultural heritage domain is proposed, with a focus on online user interface designs that support access to speech archives. The evaluation of the user interface for an instantiation of the framework is presented, and future work for the adaptation of this first prototype to other collections and archives is proposed.