1. Field of the Invention
The present invention relates generally to audio processing, and more particularly to transform domain reconstruction of an acoustic signal that can improve the accuracy of automatic speech recognition systems in noisy environments.
2. Description of Related Art
An automatic speech recognition (ASR) system in an audio device can be used to recognize spoken words, or phonemes within the words, in order to identify spoken commands by a user. The ASR system takes an acoustic signal and carries out an analysis to extract speech parameters or “features” of the acoustic signal. These features are then compared to a corresponding set of features of known speech to determine the spoken command. The ASR system typically relies upon recognition models of known speech which have been trained on a speech collection from various speakers.
A specific issue arising in ASR concerns how to adapt the recognition models to different acoustic environments. In particular, the accuracy of the ASR system typically depends on the appropriateness of the recognition models it relies upon. For example, if the ASR system uses recognition models built using speech collected in a quiet environment, using these speech models to perform speech recognition in a noisy environment can result in poor recognition accuracy. One approach to improving recognition accuracy is to retrain the recognition models using new speech collected in the noisy environment. However, to ensure reasonable recognition performance, a large amount of new speech typically needs to be collected. Such an approach is time consuming, and in many instances is not practical.
A noise reduction system in the audio device can reduce background noise to improve voice quality in the acoustic signal from the perspective of a listener. The noise reduction system may extract and track speech characteristics such as pitch and level in the acoustic signal to build speech and noise models. These speech and noise models are used to generate a signal modification that strongly attenuates the parts of the acoustic signal that are dominated by noise, and preserves the parts that are dominated by speech.
Although the noise reduction system can improve voice quality from the perspective of a listener, strongly attenuating parts of the acoustic signal can be problematic for the ASR system. Specifically, after attenuation, the transform domain representation of the acoustic signal may not be similar to that of speech. As a result, the extracted features of the attenuated acoustic signal may not closely match those expected by the recognition models, resulting in possible recognition errors by the ASR system. In some instances, the attenuation may corrupt the extracted features more than the original noise would have, which causes the speech recognition performance of the ASR system to worsen rather than get better.
It is desirable to provide techniques for improving the accuracy of ASR systems in noisy environments.