It has been observed that a talker in a difficult communication environment usually alters the speaking style to make the speech more intelligible. The resulting speech is known as “clear speech”. Studies have shown that, in comparison to the conversational style speech, it is more intelligible for listeners in noisy backgrounds and for listeners with hearing impairment, children with learning disabilities, and non-native listeners. Increased consonant intensity and duration have been identified as the main contributors to the intelligibility advantage of clear speech. Studies using modification of conversational speech have shown that enhancement of consonant intensity resulted in improved speech intelligibility, while duration modification resulted in only marginal improvements, possibly due to errors in locating the boundaries of segments to be modified and due to processing related artifacts. It may also be due to the fact that formants in conversational speech are relatively less targeted which cannot be improved by duration modification.
Increasing the intensity of consonant segments relative to the nearby vowel segments is known as consonant-vowel ratio (CVR) modification. It is reported to be effective in improving perception of consonants, across speakers and vowel context dependencies, for listeners in noisy backgrounds and for hearing-impaired listeners. The techniques for CVR modification can be broadly classified into manual and automated depending on the methods used for locating the segments for modification. The manual techniques are useful in investigating the effectiveness of CVR modification in improving speech perception. Results of investigations with such techniques have shown that a significant improvement in speech intelligibility can be achieved by accurate selection and careful modification of perceptually salient segments in conversational speech. Automated techniques for CVR modification, implemented for real-time processing, can be useful for enhancing speech intelligibility in communication devices and hearing aids. For being useful in such applications, the technique should meet the following requirements: (i) the segments for modification should be detected with a high temporal accuracy and low rate of insertion errors and without being significantly affected by speaker variability, (ii) modification of speech characteristics should be carried out without introducing perceptible distortions, (iii) the processing should have low computational complexity and memory requirement to enable real-time processing using the processors available in communication devices and hearing aids, (iv) the signal delay introduced by the processing (processing delay consisting of the algorithmic and computational delays) should not be disruptive for audio-visual speech perception. These requirements are only partly met by the existing systems.
Kates (J. M. Kates, “Speech intelligibility enhancement,” U.S. Pat. No. 4,454,609, 1984) has described a method for enhancement of intelligibility of consonant sounds in communication systems by boosting high frequency components. The system comprises a bank of band-pass filters and envelope detectors, a controller to set the gain for each filter channel, by comparing its short-time energy with those of the selected reference channels, and application of these gains for dynamically modifying the overall spectral shape. Reference channels are selected for boosting the short-time energy of the high frequency channels with respect to the low frequency channels. Thus the method enhances the sounds characterized by high frequency release bursts and transitions and not all transient segments. Further, use of fixed frequency bands in the processing limits its adaptability to speaker variability.
Terry (A. M. Terry, “Method and apparatus for enhancement of telephonic speech signals,” U.S. Pat. No. 5,737,719, 1998) has described a system for boosting the second formant with respect to the first formant and modification of the consonant-vowel ratio. Processing uses a bank of bark-scale based band-pass filters. Short-time band energies are used to get an approximation of the auditory spectrum. Peak-picking is applied to locate first two formants and the second formant is enhanced with respect to the first one. Segments having energy levels below those associated with vowels but above those associated with silence are identified as consonantal and these are amplified. Auditory spectrum is converted to Fourier spectrum and inverse Fourier transform is used to produce the output. Although the method is suitable for real-time processing, errors in formant identification, errors in selecting consonantal segments, and use of analysis-synthesis, particularly conversion from auditory spectrum to Fourier spectrum and discarding of the phase information, are likely to result in processing related artifacts. Further, use of fixed bands in the method limits its adaptability to speech and speaker variability.
Michaelis (P. R. Michaelis, “Method and apparatus for improving the intelligibility of digitally compressed speech,” U.S. Pat. No. 6,889,186B1, 2005) has described a method which involves segmenting input speech into frames, carrying out spectral analysis to identify the type of sound in each frame, and applying a gain based on the type of sound in the frame and in the surrounding frames, to improve speech intelligibility. Frames identified as unvoiced fricatives and plosives are amplified and the preceding voiced frames are attenuated. This method does not address enhancement of voiced stops and fricatives which may be hard to perceive under adverse listening conditions. Fixed-frame based segmentation may cause short duration release bursts to get merged with the voiced segments, resulting in errors in classification of frames, thereby limiting the effectiveness of the modification in improving speech intelligibility. Further, need for classification of the frames increases computational complexity and dependence of the gain of a frame on the type of neighbouring frames causes excessive signal delay.
Vandali et al. (A. E. Vandali, G. M. Clark, “Emphasis of short-duration transient speech features,” U.S. Pat. No. 8,296,154B2, 2012) have described a transient emphasis system for use in auditory prostheses to assist in perception of low-intensity short-duration speech features. The method uses a bank of band-pass filters and envelope detectors. For each filter channel, a running history buffer of the envelope spanning 60 ms with 2.5 ms intervals is used to estimate its second derivative which is used to determine a channel gain function. As the method uses fixed frequency bands, it is not adaptive to speech and speaker variability and it also suffers from a relatively large signal delay.
Skowronski et al. (M. D. Skowronski, J. G. Harris, “Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environments,” Journal of Speech Communication, vol. 48, pp. 549-558, 2006) reported a method for speech intelligibility enhancement based on redistribution of energy in voiced and unvoiced segments. In this method, a measure of spectral flatness derived from the short-time speech spectrum along with a Schmitt trigger based thresholding is used for classifying the segments as voiced or unvoiced. The voiced segments (those corresponding to vowels, semivowels, nasals, voiced plosives, and voiced fricatives) are attenuated and unvoiced segments are amplified, maintaining the overall energy unaltered. Possible errors in classification and sensitivity of the classification method to additive noise are the limiting factors in its usefulness in enhancing the unvoiced segments. Further, attenuation of the low-energy voiced plosives and fricatives may adversely affect their perception. Colotte et al. (V. Colotte, Y. Laprie, “Automatic enhancement of speech intelligibility,” Proceedings of ICASSP 2000, Istanbul, pp. 1057-1060) have reported a method using spectral variation function based on mel-cepstral analysis to locate stop and fricative segments and their amplification by 4 dB. In a method reported by Yoo et al. (S. D. Yoo, J. R. Boston, A. Jaroudi, C. C. Li, “Speech signal modification to increase intelligibility in noisy environment,” Journal of Acoustical Society of America, vol. 122, pp. 1138-1149, 2007), the transient regions of speech are extracted and emphasized using time-varying band-pass filters based on formant tracking. Tantibundhit et al. (C. Tantibundhit, F. Pernkopf, G. Kubin, “Speech enhancement based on joint time-frequency segmentation,” Proceedings of ICASSP 2009, Taipei, pp. 4673-4676) have described a method for speech modification based on wavelet packet decomposition. These methods are computation intensive and introduce significant signal delays.
In view of the foregoing, there is a need for a new method and system for consonant-vowel ratio modification without introducing perceptible distortions for improving speech intelligibility.