Results 1 - 10
of
124
Speech Segregation Based on Sound Localization
- J. Acoust. Soc. Am
, 2003
"... We study the cocktail-party effect, which refers to the ability of a listener to attend to a single talker in the presence of adverse acoustical conditions. It has been observed that this ability improves in the presence of binaural cues. In this paper, we explore a technique for speech segrega ..."
Abstract
-
Cited by 61 (22 self)
- Add to MetaCart
We study the cocktail-party effect, which refers to the ability of a listener to attend to a single talker in the presence of adverse acoustical conditions. It has been observed that this ability improves in the presence of binaural cues. In this paper, we explore a technique for speech segregation based on sound localization cues. The auditory masking phenomenon motivates an "ideal" binary mask in which time-frequency regions that correspond to the weak signal are canceled. In our model we estimate this binary mask by observing that systematic changes of the interaural time differences and intensity differences occur as the energy ratio of the original signals is modified. The performance of our model is comparable with results obtained using the ideal binary mask and it shows a large improvement over existing pitch-based algorithms. 1
Soft Decisions In Missing Data Techniques For Robust Automatic Speech Recognition.
- Proc. ICSLP-2000
, 2000
"... In previous work we have developed the theory and demonstrated the promise of the Missing Data approach to robust Automatic Speech Recognition. This technique is based on hard decisions as to whether each time-frequency "pixel" is either reliable or unreliable. In this paper we replace these discret ..."
Abstract
-
Cited by 59 (18 self)
- Add to MetaCart
In previous work we have developed the theory and demonstrated the promise of the Missing Data approach to robust Automatic Speech Recognition. This technique is based on hard decisions as to whether each time-frequency "pixel" is either reliable or unreliable. In this paper we replace these discrete decisions with soft estimates of the probability that each "pixel" is reliable. We adapt the probability calculation to use these estimates as weighting factors for the complementary reliable/unreliable interpretations for each feature vector component. Experiments using the TIDigits connected digit recognition task demonstrate that this technique affords significant performance improvements at low SNRs. 1. INTRODUCTION In previous work [2, 5, 6] we have developed the theory and demonstrated the promise of the Missing Data approach to robust Automatic Speech Recognition. In this technique, spectral-temporal regions uncontaminated by noise are identified and CDHMM recognition methods are ...
Monaural speech segregation based on pitch tracking and amplitude modulation
- IEEE Trans. Neural Networks
, 2004
"... Speech segregation is an important task of auditory scene analysis (ASA), in which the speech of a certain speaker is separated from other interfering signals. Wang and Brown proposed a multistage neural model for speech segregation, the core of which is a two-layer oscillator network. In this paper ..."
Abstract
-
Cited by 55 (23 self)
- Add to MetaCart
Speech segregation is an important task of auditory scene analysis (ASA), in which the speech of a certain speaker is separated from other interfering signals. Wang and Brown proposed a multistage neural model for speech segregation, the core of which is a two-layer oscillator network. In this paper, we extend their model by adding further processes based on psychoacoustic evidence to improve the performance. These processes include pitch tracking and grouping based on amplitude modulation (AM). Our model is systematically evaluated and compared with the Wang-Brown model, and it yields significantly better performance. 1.
On ideal binary mask as the computational goal of auditory scene analysis
- in Speech Separation by Humans and Machines
, 2005
"... What is the computational goal of auditory scene analysis? This is a key issue to address in the Marrian information-processing framework. It is also an important question for researchers in computational auditory scene analysis (CASA) because it bears directly on how a CASA system should be evaluat ..."
Abstract
-
Cited by 40 (20 self)
- Add to MetaCart
What is the computational goal of auditory scene analysis? This is a key issue to address in the Marrian information-processing framework. It is also an important question for researchers in computational auditory scene analysis (CASA) because it bears directly on how a CASA system should be evaluated. In this chapter I discuss different objectives used in CASA. I suggest as a main CASA goal the use of the ideal time-frequency (T-F) binary mask whose value is one for a T-F unit where the target energy is greater than the interference energy and is zero otherwise. The notion of the ideal binary mask is motivated by the auditory masking phenomenon. Properties of the ideal binary mask are discussed, including their relationship to automatic speech recognition and human speech intelligibility. This CASA goal has led to algorithms that directly estimate the ideal binary mask in monaural and binaural conditions, and these algorithms have substantially advanced the state-of-the-art performance in speech separation. 1.
Decoding Speech In The Presence Of Other Sources
- SPEECH COMMUNICATION
, 2002
"... Acoustic interference is arguably the most serious problem facing current speech recognisers. The maturation of statistical pattern recognition techniques has brought very low word error rates when both training and test material consist solely of speech. However, in real-world situations, any speec ..."
Abstract
-
Cited by 34 (11 self)
- Add to MetaCart
Acoustic interference is arguably the most serious problem facing current speech recognisers. The maturation of statistical pattern recognition techniques has brought very low word error rates when both training and test material consist solely of speech. However, in real-world situations, any speech signal of interest will be mixed with background noises coming from the full range of sources encountered in our acoustic environment. In this paper,
Uncertainty decoding for noise robust speech recognition
- in Proc. Interspeech
, 2004
"... This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference proceedings
Missing Data Theory, Spectral Subtraction And Signal-To-Noise Estimation For Robust Asr: An Integrated Study
, 1999
"... In the missing data approach to robust Automatic Speech Recognition (ASR), time-frequency regions which carry reliable speech information are identified. Recognition is then based on these regions alone. In this paper, we address the problem of identifying reliable regions and propose two criteria t ..."
Abstract
-
Cited by 25 (6 self)
- Add to MetaCart
In the missing data approach to robust Automatic Speech Recognition (ASR), time-frequency regions which carry reliable speech information are identified. Recognition is then based on these regions alone. In this paper, we address the problem of identifying reliable regions and propose two criteria to solve this based on negative energy ( # s < 0 ) and SNR ( # s s n 2 2 1 2 < + ). These criteria are evaluated on the TIDigits corpus for several noise sources and compared with spectral subtraction. We show that in this task the missing data method performs considerably better than spectral subtraction and the combination of the two techniques outperforms either technique used alone. We report robust performance at 0dB SNR for car noise and 10dB SNR for factory noise. 1. INTRODUCTION In the missing data approach to robust ASR, two problems arise: the identification of the non-reliable time-frequency regions of the speech and recognition techniques to deal with the incomplete data ...
Assessing Local Noise Level Estimation Methods
- SPEECH COMMUNICATION
, 1999
"... In this paper, we assess and compare two well-known methods for the local estimation of noise level in frequency subbands to a new one based of the following of lower signal energy envelope. Moreover we introduce, for those three approaches, a new pre-processing algorithm expected to better follow f ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
In this paper, we assess and compare two well-known methods for the local estimation of noise level in frequency subbands to a new one based of the following of lower signal energy envelope. Moreover we introduce, for those three approaches, a new pre-processing algorithm expected to better follow fast modulations of the noise energy. Speech periodicity property is used to update the noise level estimate during voiced parts of speech (without explicit detection of voiced portions) . This evaluation is performed on four different kinds of noise (both artificial and real noises) added to clean speech. The best approach is used for spectral subtraction in a speech recognition experiment and compared to more classical noise robust features (J-RASTA).
Robust speech recognition using cepstral domain missing data techniques and noisy masks
- in Proceedings of IEEE ICASSP
"... Missing Data Techniques (MDT) have shown to be an effective method for curing the performance degradation of HMM-based speech recognition systems operating on noisy signals. However, a major drawback of the approach is that MDT requires that the acoustic model be expressed as a mixture of diagonal G ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
Missing Data Techniques (MDT) have shown to be an effective method for curing the performance degradation of HMM-based speech recognition systems operating on noisy signals. However, a major drawback of the approach is that MDT requires that the acoustic model be expressed as a mixture of diagonal Gaussians in the log-spectral domain, whereas a higher accuracy can be obtained with Gaussian mixtures in the cepstral domain. This paper describes a recognizer based on the recently described cepstral-domain MDT approach using missing data masks computed from the noisy signal. It exploits a novel decision criterion that integrates harmonicity with signal-to-noise ratio and which makes minimal assumptions on the noise. The system is shown to exhibit a recognition accuracy that is comparable to the ETSI Advanced Front-End reference. 1.

