Humans can easily recognize speech in a noisy environment. However, this task is difficult for automatic speech recognition (ASR) systems. One explanation is that the brain has complex acoustic pattern recognition capabilities. These capabilities are difficult to duplicate in ASR systems. The human peripheral auditory system has sophisticated signal representations, which can easily distinguish speech from noise. The cognitive processes that are brought to bear on human speech recognition tasks are not well understood and difficult to emulate.
The human peripheral auditory system has been well studied and several of the processes in it are well understood, and can be modeled. It may be expected that by simulating some of the processes in the peripheral auditory system within the signal processing schemes used by a speech recognizer, its ability to reduce noise may be improved.
The means by which the peripheral auditory system acquires acoustic pressure waves in a manner that can be forwarded to higher levels of the auditory pathway includes various processes that are analogous to automatic gain control, critical band analysis, equal loudness pre-emphasis, two tone suppression, forward and backward masking, half-wave rectification, and envelope detection.
Some ASR systems model the peripheral auditory system in detail using feature representations. Those systems perform at about the same level as ASR systems implemented with a Mel filter bank and cepstral analysis. However, the additional gains derived by feature representation are not commensurate with the greatly increased computation required by these models.
A more successful trend in anthropomorphic signal processing for speech recognition has been to model specific auditory phenomena, rather than the entire auditory process, for example, modeling critical band response in the computation of cepstral front ends for ASR. Critical band response is modeled in the signal processing schemes employed by almost all current ASR systems. The PLP features described by Hermansky incorporate equal-loudness preemphasis and root compression, H. Hermansky. “Perceptual linear predictive (PLP) analysis of speech.” J. Acoust. Soc. Am. 87. pp 1738-1752, 1990.
The peripheral auditory system employs a variety of masking phenomena. Temporal masking is a phenomenon whereby high-energy sounds mask lower energy sounds immediately preceding or succeeding the lower energy sounds. Simultaneous masking is a phenomenon whereby high-energy frequencies mask out adjacent, concurrent, and lower-energy frequencies.
Computational analogues for temporal masking are described by B. Strope and A. Alwan, “A model of dynamic auditory perception and its application to robust word recognition,” IEEE Trans. Speech Audio Processing, vol. 95, pp. 451-464, 1997, and M. Holmberg, D. Gelbart, W. Hemmert, “Automatic speech recognition with an adaptation model motivated by auditory processing,” IEEE Trans. Speech Audio Process., vol. 14, no. 1, pp. 43-49, January 2006.
Other techniques compress and filter an effective envelope of an output of a critical-band filter bank, M. Holmberg, D. Gelbart, W. Hemmert, “Automatic speech recognition with an adaptation model motivated by auditory processing,” IEEE Trans. Speech Audio Process., vol. 14, no. 1, pp. 43-49, January 2006, J. Tchorz and B. Kollmeier, “A model of auditory perception as front end for automatic speech recognition,” J. Acoust. Soc. Am., vol. 106, pp. 2040-2050, 1999, and H. Hermansky and N. Morgan, “RASTA processing of speech”, IEEE Trans. Speech and Audio Processing, vol. 2, no. 4, pp. 578-589, 1994. Those techniques have an incidental effect that high-energy sounds partially mask temporally adjacent low-energy acoustic phenomena.
Two-tone suppression is a nonlinear phenomenon observed in the cochlea. The presence of a first tone suppresses a frequency response of a second tone that is near to the first tone in frequency. This effect is likely to involve saturating amplification in the outer hair cells of the cochlea. At the psychoacoustic level, two-tone suppression manifests itself as simultaneous masking. Two tone suppression is defined by the American Standards Association (ASA) as the process by which the threshold of audibility for one sound is raised by the presence of another masking sound.
An analog device for spectral contrast enhancement in hearing aids is described by M. A. Stone and B. C. J. Moore, “Spectral feature enhancement for people with sensorineural hearing impairment: Effects on speech intelligibility and quality,” in J. Rehabil Res. Dev., vol. 29, no. 2, pp. 39-56, 1992.
A digital spectral-contrast-enhancement process can yield a significant improvement of speech perception in noise with a digital spectral-contrast-enhancement algorithm in noise in hearing-impaired listeners, T. Baer, B. C. J. Moore, and S. Gatehouse, “Spectral contrast enhancement of speech in noise for listeners with sensorineural hearing impairment: Effects on intelligibility, quality, and response times,” J. Rehabil Res. Dev., vol. 30, no. 1, pp. 49-72, 1993.
Similarly, a peak-isolation mechanism, based on raised-sine cepstral liftering, can enhance spectral contrast and benefit ASR, B. Strope and A. Alwan, “A model of dynamic auditory perception and its application to robust word recognition,” IEEE Trans. Speech Audio Processing, vol. 95, pp. 451-464, 1997, and B. H. Juang, L. R. Rabiner, and J. G. Wilpon, “On the use of bandpass liftering in speech recognition,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 35, pp. 947-954, July 1987.
In general, ASR systems often improve recognition performance in “mismatched” conditions, i.e., the recognizer has been trained on clean speech, but the speech to be recognized is noisy. However, ASR systems do not improve performance when the training data are similar to the test data. This is a more realistic situation for most applications. Although ASR systems can obtain significant improvements for speech that has been corrupted by artificial digital noise, the ASR systems fail to deliver similar improvements on genuine noisy speech.
It is well known that the recognition performance obtained on noisy speech with systems that have been trained on noisy speech is generally better than that obtained on denoised noisy speech using systems that have been trained on clean speech, Hunt, M. J. “Some Experience in In-Car Speech Recognition.” 1999 Proc. IEEE/Nokia Workshop on Robust Methods for Speech Recognition in Adverse Conditions, May 25-26, 1999.
A cochlear model with traveling-wave amplification and distributed gain control that exhibits two-tone suppression is described by L. Turicchia and R. Sarpeshkar, “The silicon cochlea: From biology to bionics,” in Biophysics of the Cochlea: From Molecules to Models, A. W. Gummer, Ed. Singapore: World Scientific, 2003, pp. 417-423.
A companding process simply mimics tone-to-tone suppression and masking in the auditory system. Spectral-contrast enhancement results as a consequence, and perception in noise is improved. Other techniques that explicitly enhance spectral contrast in the signal has can improve speech recognition in the presence of noise.
A significant improvement in speech recognition accuracy can be obtained, particularly at very low SNRs, using digital simulation of the analog implementation of the companding process, J. Guiness, B. Raj, B. Nielsen, L. Turicchia, and R. Sarpeshkar, “A Companding Front End for Noise-Robust Automatic Speech Recognition,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), pp. 249-252, Mar. 18, 2005. Such an implementation, while suitable for implementation in low-power analog VLSI, is inefficient for a real-time recognizer that functions entirely on digitized signals.
A bio-inspired companding process that mimics two-tone suppression in a highly programmable filter-bank architecture is describe by L. Turicchia and R. Sarpeshkar, “A bio-inspired companding strategy for spectral enhancement,” IEEE Trans. Speech Audio Proc. vol. 13, no. 2, pp. 243-253, March 2005. The companding process filters an incoming signal by a bank of broad filters, compresses the outputs of the filters by an estimated instantaneous RMS value, re-filters the compressed signals by a bank of narrow filters and finally expands them again by their instantaneous RMS values. This processing has the effect of retaining spectral peaks almost unchanged, whereas frequencies adjacent to spectral peaks are suppressed, resulting in two-tone suppression.
An emergent property of the companding process is that that the process enhances spectral contrast and naturally emphasizes high signal-to-noise (SNR) ratio spectral channels, while suppressing channels with a lower signal-to-noise ratio. The companding process significantly improves the intelligibility of the processed signal, both in simulations of cochlear implants, and for real cochlear implants, A. Bhattacharya and F.-G. Zeng, “Companding to improve cochlear implants' speech processing in noise,” 2005 Conference on Implantable Auditory Prostheses, 2005, Y. W. Lee, S. Y. Kwon, Y. S. Ji, S. M. Lee, S. H. Hong, J. S. Lee, I. Y. Kim, “Speech Enhancement in Noise Environment Using Companding Strategy,” 5th Asia Pacific Symposium on CI and related Sciences, Hong Kong, China, 2005, and P. C. Loizou, K. Kasturi, L. Turicchia, R. Sarpeshkar, M. Dorman and T. Spahr, “Evaluation of the companding and other strategies for noise reduction in cochlear implant,” 2005 Conference on Implantable Auditory Prostheses, 2005.