DMCA
Effect of temporal envelope smearing on speech reception. (2011)
Venue: | International Journal of Bioelectromagnetism |
Citations: | 145 - 0 self |
BibTeX
@ARTICLE{Drullman11effectof,
author = {Rob Drullman and Joost M Festen and Reinier Plomp},
title = {Effect of temporal envelope smearing on speech reception.},
journal = {International Journal of Bioelectromagnetism},
year = {2011},
pages = {217--222}
}
OpenURL
Abstract
The effect of smearing the temporal envelope on the speech-reception threshold (SRT) for sentences in noise and on phoneme identification was investigated for normal-hearing listeners. For this purpose, the speech signal was split up into a series of frequency bands (width of The ear's resolution in both frequency and time is sufficiently high to perceive the essential acoustical features of the various speech sounds. Depending on the speech material, we even have a reserve capacity. This reserve capacity is rather small for isolated phoneroes, but large for sentences. For normal-hearing listeners, the speechreception threshold (SRT) in noise, defined as the speechto-noise ratio at which 50% of short everyday sentences are reproduced correctly, is about -5 dB • An interesting question is: How critical is the resolution in frequency and What is the effect of reducing the degree to which temporal fluctuations are present in the speech signal? In The significance of the various modulation frequencies for speech communication can be compared with the significance of the various audiofrequencies. For example, in designing channel vocoders, we need to know not only the frequency range (e.g., up to 4 kHz) to be covered by the channels, but also the upper limit of the envelope frequencies required to preserve intelligible speech. Similarly, in applying alternative presentation of speech information to the deaf, we need to know up to which envelope frequency the (tactile, visual) channel must transfer the signal faithfully. The range of modulation frequencies most relevant for speech, as mentioned above, has been determined by means of physical/acoustical measurements, and not by any formal perceptual evaluation. In much the same way, a 25-Hz limit for temporal modulations in up to 100 filter bands was applied in early channel vocoders (cf. Flanagan, 1972). There have been measurements of consonant and vowel intelligibility scores, mostly for diagnostic purposes. But, as far as we know, the limit for temporal modulations has never been determined explicitly by means of intelligibility tests. As to phoneme perception, several investigators have studied the information contained in the temporal envelope for consonant recognition in cases of limited spectral cues. With noise stimuli modulated by the amplitude envelope of /aCa/syllables, Van Tasell et al. (1987) found poor identification scores, with highly variable performance across (untrained) subjects. Several features (voicing, amplitude, and burst) could be derived from the envelope, but for modulation frequencies up to 20 Hz, they accounted for only 19% of the transmitted information. When /aCa/ stimuli are masked by white noise with the same temporal envelope as the speech waveform, Freyman et al. (1991) have shown that nonlinear amplification of the envelope (a 10-dB increase of the consonant portion) has no effect on overall consonant recognition, but it can alter confusion patterns for specific consonant groups. In cases without limited spectral information, Behrens and Blumstein (1988) found that interchanging the amplitude of various voiceless fricatives in CV syllables resulted in few or inconsistent place of articulation errors. They concluded that, at least for voiceless fricative noise, compatibility of the spectral properties and of formant transitions dominated the effect of amplitude manipulations. It should be noted that the results of all these studies are based on wideband amplitude envelopes. With the present perception experiments we investigated the extent to which speech intelligibility depends on the details (fast modulations) in the temporal envelope of the signal. In a first experiment the intelligibility for sentences in quiet and the SRT for sentences in noise were measured as a function of temporal smearing. In a second experiment. the effects on vowel and consonant identification in nonsense syllables were studied. We adopted the second method for smearing the temporal envelope. It has the advantage that the fine structure remains intact and that we can control the process of filtering the envelope by selecting the cutoff frequency and the slope of the filter. In this way it is known how the temporal modulation spectrum changes, and the intelligibility can be evaluated as a function of the temporal envelope cutoff frequency. Smearing of the temporal envelope should not be done on the wideband speech signal. It is known that those modulations can be reduced considerably, without affecting the intelligibility in a major way. Since the temporal fluctuations of speech are only partly correlated over frequency (the more two frequency bands are separated, the lower their correlation, cf. Houtgast and Verhage, 1991), the wideband signal does not include all temporal amplitude variations in the different frequency bands. Therefore, the speech signal has to be split up into several frequency bands, so that the temporal envelope of each individual band can be modified. B. Signal processing For the signal processing, an analysis-resynthesis scheme for smearing the amplitude envelope of digitized speech was developed. A block diagram of the processing is shown in The slopes of the low-pass filters were empirically set to approximately -40 dB/oct, so that the modified envelope will not become negative. After filtering, the envelope is unsampled again. The modified band signal is obtained by multiplying the original band (fine structure) by the ratio of the filtered envelope and the original envelope at each corresponding point in time. As a result of the envelope filtering (especially at low cutoff frequencies), parts of the original band signal having low amplitude are amplified in the modified band signal, particularly just before and after periods with a high amplitude. These modified parts sometimes contain amplified quantization noise of high frequencies (not belonging in the frequency band), causing sharp, clicking sounds. To eliminate these, the modified band signal is low-pass filtered, using a FIR filter with a cutoff frequency 5% above the upper cutoff frequency of the corresponding bandpass B. Subjects Subjects were 36 normal-hearing students and employees of the Free University, whose ages ranged from 18 to 30. All had pure-tone air-conduction thresholds less than 15 dB HL in their preferred ear at octave frequencies from 125 to 4000 Hz and at 6000 Hz. They were divided into three groups of twelve, each group receiving the ten conditions for one of the three processing bandwidths. C. Procedure From the ten lists of 13 sentences, six were used in the SRT experiment and four in the SIQ experiment. Lists were presented in a fixed order. The sequence of the conditions was varied according to a digram-balanced Latin square--6X6 for the SRT and 4•4 for the SIQ experiment--to avoid order and list effects. Having twelve subjects in a group, each sequence was presented to two subjects in the SRT experiment and to three subjects in the SIQ experiment. For the SIQ experiment, all four lists were presented in quiet, at an average level of 70 dB(A). Every sentence was presented once, after which a subject had to reproduce it as accurately as possible. Subjects were encouraged to re- The sentences in both the SIQ and the SRT experiment were presented monaurally through earphones (Sony MDR-CD999) at the ear of preference in a soundproof room. Before the actual tests, a list of 13 sentences pronounced by a male speaker was presented, in order to familiarize the subjects with the procedure. For the SIQ experiment this list consisted of sentences in the 1-Hz condition; for the SRT experiment another list in the 16-Hz condition was used. All subjects started with the SIQ experiment, continuing with the SRT experiment after a short break. The entire session lasted about 1 h per subject. D. Results and discussion The mean results of the SIQ experiment for the four filtering conditions in the three bandwidths are plotted in The results of the experiments indicate that the intelligibility increases progressively with low-pass cutoff frequency up to about 16 Hz. In other words, modulation frequencies in the amplitude envelope above 16 Hz (with lower modulation frequencies present} do not really contribute to understanding ordinary sentences. For cutoff frequencies above 4 Hz, the intelligibility appears to be independent of the processing bandwidth; the beneficial effect of a larger bandwidth is only demonstrated for cutoff frequencies below 4 Hz. An explanation for this, considering there is no 100% correlation between frequency bands, could be as follows. For the limit case of 0 Hz, the only information is in the variations of the energy distribution within each processing band. This can be referred to as spectral micro-information. This micro-information is quite useful within the l-oct bands (and to a lesser extent in the •oct bands), since it can be resolved by the ear, whereas it cannot be resolved within the •-oct bands. When increasing the envelope cutoff frequency, more and more of the spectral macro-information becomes available (variations in the overall spectral shape), dominating the role of the micro-information. The breaking-point appears to be around a cutoff frequency of 4 Hz; perhaps because then the information on the word/syllable structure is sufficiently present. With respect to the infinite peak-clipping question raised in the introduction, 100% compression of the temporal modulations (0-Hz condition) in l-oct bands still B. Subjects Subjects were 24 normal-hearing students of the Free University, whose ages ranged from 19 to 28. All had pure- IV. GENERAL DISCUSSION In the experiments described above, we tried to assess the contribution of temporal modulations to intelligibility and identification. The manipulations of the speech signal are quite artificial and are not intended to model reduced temporal resolution by hearing-impaired listeners. Nor do they reflect any disturbances that may occur in communication channels, except maybe for early channel vocoders, as mentioned in the Introduction. The method provides information on the importance of temporal modulations without disturbing the fine structure, as is the case with noisy or reverberated speech. Filtering of the envelope generally causes low amplitude regions to be amplified and high amplitude regions to be attenuated, yielding smaller modulation depths. Rapid changes in amplitude are flattened. As a result of this, short-duration phonemes (stops) will be more affected than long-duration phonemes (vowels, fricatives, vowellike consonants), as was observed in the second experiment. By decreasing the cutoff frequency of the envelope filter, the amplitude variations in each processing band get smaller, eventually leading to a stationary sound. As noted earlier, for narrow processing bands this means that spectral content will ever more resemble the long-term average spectrum of the utterance. As an illustration, As far as the reduction of fast modulations is concerned, this corresponds roughly to a situation between our 4-and 8-Hz condition. Gelfand and Silman found that initial consonants are on average less affected than final consonants, whereas our study does not show this difference. In reverberant speech, final consonants are masked by delayed energy of preceding segments, which does not hold for initial consonants. In our processing however, smearing of the envelope causes segments to integrate with preceding and following segments. Initial consonants will thus be corrupted by the following vowel, and will therefore also de- (4) Phoneme identification with nonsense syllables shows that consonants are more affected by temporal smearing than vowels. Stops appear to suffer most, due to their short duration, with confusion patterns depending on the position in the syllable. ACKNOWLEDGMENTS