Robustness in the presence of noise is a crucial issue normally addressed in connection with speech recognition, especially when performance in a real-world environment is concerned.
In cases where the noise corrupting the speech is stationary and where its characteristics are known in advance, robustness issues can, to a certain extent, be addressed during the training of the speech recognition system. Particularly, the acoustic model of the speech recognition system can be trained on a representative collection of noisy data; this approach is known as “multi-style training” and has been shown to reduce the degradation of the recognition accuracy in the presence of noise. However, in most applications, the noise corrupting the speech is neither accurately known in advance nor completely stationary. In such cases on-line compensation algorithms provide better performances than multi-style training.
To date, various efforts have been made in the contexts just described, yet various shortcomings and disadvantages have been observed.
Conventionally, on-line algorithms that aim at enhancing speech corrupted by environmental noise are audio-only approaches; they process the noisy speech signal using audio information only. The Codebook Dependent Cepstral Normalization (CDCN) approach (See Alejandro Acero, “Acoustical and Environmental Robustness in Automatic Speech Recognition”, PhD thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pa. 15213, September 1990), and the SPLICE approach (See Li Deng, Alex Acero, Li Jiang, Jasha Droppo and Xuedong Huang, “High-performance Robust Speech Recognition Using Stereo Training Data”, in the Proceedings of the International Conference on Acoustics Speech and Signal Processing (ICASSP) 2001, May 2001) are examples of non linear audio-only approaches. In these approaches, a non linear compensation term is estimated from the observed noisy speech features and by using some a priori information or on-the-fly estimate of the characteristics of the corrupting noise. The estimated compensation term is then combined with the observed noisy features to produce an estimate of the clean speech features. Since usually all the audio information that is available is affected by the noise and since the exact characteristics of the noise are usually not known with accuracy, the estimation of the compensation term can be a very arduous problem.
On the other hand, the visual data that can be obtained from the mouth area of the speaker's face and that carry information on the movements of the speaker's lips can be expected to be relatively unaffected by environmental noise. Audio-visual speech recognition where both an audio channel and a visual channel are input to the recognition system, has already been demonstrated to outperform traditional audio-only speech recognition in noise conditions (See C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari and J. Zhou, “Audio-visual speech recognition, final workshop report”, Center for Language and Speech Processing, 2000. And see G. Potamianos, C. Neti and J. Luettin, “Hierarchical discriminant features for audio-visual LVCSR”, Proceedings of ICASSP2001, May 2001). Audio-visual speech recognition is more robust than audio-only speech recognition in the presence of noise as it is making use of visual information which is correlated with the phonetic content of the speech and which is relatively unaffected by noise. However audio-visual speech recognition does not explicitly address the problem of compensating for the effect of noise on speech, i.e., it does not enhance the noisy audio features.
In addition to audio-visual speech recognition, a visual modality is also being investigated as a medium of speech enhancement, where clean audio features are estimated from audio-visual speech when the audio channel is corrupted by noise; see L. Girin, J. L. Schwartz and G. Feng, “Audio-visual enhancement of speech in noise”, Journal of the Acoustical Society of America, vol. 6, n. 109, pp. 3007-3020, 2001, and also R. Goecke, G. Potamianos and C. Neti, “Noisy audio feature enhancement using audio-visual speech data”, Proceedings of ICASSP'02, 2002. In both these works, audio-visual enhancement relies on a training phase with a stereo training database consisting of clean audio features in the first channel and of noisy audio features and visual features in the second channel. The noisy audio data in the second channel are generated by adding noise to the waveform of the clean audio features contained in the first channel. The training procedure involves estimating a transfer function between the noisy audio-visual features in the first channel and the clean audio features in the second channel. Girin et al. experiment with a transfer function that is either a linear filter or a non linear associator. The enhancement provided by either transfer function is assessed on a simplistic task of audio-only speech recognition of a vowel-plosive-vowel test corpus with a single speaker. Girin et al. appear to disclose that enhancing the noisy audio-visual data with the linear filter improves the speech recognition accuracy of the vowels but results in a lower recognition accuracy of the consonants. Girin et al. also set forth that enhancing the noisy audio-visual data with the non linear transfer function instead of the linear transfer function provides a better improvement on the vowel recognition task but that it does not provide any clear improvement on the consonant recognition task.
In Goecke et al., supra, the transfer function is a linear filter. The enhancement provided by the transfer function is assessed on an automatic speech recognition task. Goecke et al. set forth that enhancing the noisy audio-visual data with the linear filter results in a better speech recognition accuracy than not enhancing the data in the case where the data are decoded with an audio-only speech recognizer. Further, Goecke et al. set forth that enhancing the noisy audio-visual data with the linear filter and decoding the enhanced data with an audio-only speech recognizer results in a speech recognition accuracy that is significantly worse than the accuracy obtained by decoding non-enhanced data with an audio-visual speech recognizer. In other words, the performance of audio-visual speech enhancement combined with audio-only speech recognition remains significantly inferior to the performance of audio-visual speech recognition alone.
In conclusion, audio-visual speech enhancement techniques have an advantage over audio-only speech enhancement techniques in that the visual modality provides information that is not affected by environmental noise. However, linear approaches to audio-visual speech enhancement make the assumption of a linear coupling between the noisy audio-visual features and the clean audio features. This assumption of linearity is somehow arbitrary and may not be a valid assumption. Non-linear approaches to speech enhancement have thus far been very little investigated and so far have not been reported to be successful even on very simple and controlled speech recognition tasks. More generally, state-of-the-art approaches to audio-visual speech enhancement so far have not provided gains in recognition accuracy over state-of-the-art approaches to audio-visual speech recognition, even though audio-visual speech recognition does not explicitly address the problem of compensating for the effect of noise on speech.
In view of the foregoing, a need has been recognized in connection with improving upon the shortcomings and disadvantages of conventional efforts.