1. Technical Field
The present invention relates to the fields of signal processing, speech processing, machine learning, and probabilistic methods. More specifically the invention pertains to fast on-line adaptation of acoustic training models to achieve robust automatic audio recognition in the presence of sound disturbances, such as the disturbances created by changing environmental conditions, deviation of a speaker's accent from the standard language, and deviation of a sound from the standard sound characteristics.
2. Discussion
One recent technological trend is the use of automatic speech recognition (ASR) systems in many commercial and defense applications such as information retrieval, air travel reservations, signal intelligence for surveillance, voice activated command and control systems, and automatic translation. However, in all of these applications robustness is a primary issue since the ASR system's performance degrades easily if any interference signals are present, or the ASR testing environment is significantly significantly from the standard language pronunciation, or a non-native speaker uses the system. Furthermore, the ASR systems must perform recognition in real-time in order for the users to be able to comfortably use the system and to be satisfied with the results.
The ASR system's performance degrades even further, as the spoken dialogue information retrieval applications are becoming more popular for mobile users in automobiles using cellular telephones. Due to the typical presence of background noise and other interfering signals when using a mobile system, speech recognition accuracy reduces significantly if it is not trained explicitly using the specific noisy speech signal for each environment. This situation also applies to noisy sound signals distorted by changing environments, which need to be recognized and classified by an automatic sound recognition system. The sound recognition accuracy reduces significantly if the system is not trained explicitly using the specific noisy sound signal for each environment. Since it is very hard to know a priori the environment in which mobile platforms are going to be used, the number of interfering signals that are present, and who would be using the system (a standard speaker or a non-standard speaker, and if non-standard from which regional accent or which mother tongue), it is not practical to train recognizers for the appropriate range of typical noisy environments and/or non-standard speakers.
Therefore, it is imperative that the automatic audio recognition systems are robust to mismatches in training and testing environments. The mismatches in general correspond to variations in background acoustic environment, non-standard speakers, and channel deviations. Several techniques have been developed to address the robustness issues in ASR. These techniques can be broadly classified into front-end processing and speaker/environment adaptation, as discussed in “A maximum likelihood approach to stochastic matching for robust speech recognition,” IEEE Trans. On Speech and Audio Processing, vol. 4, pp. 190-202, May 1996, by A. Sankar and C-H. Lee. The front-end processing techniques mainly try to remove the noise from an acoustic input signal (cleaning up the input signal) prior to attempting to recognize the acoustic signal. These techniques do not work well with constantly changing environmental conditions, or with speakers who deviate from the standard language pronunciation, since it is not noise that is deforming the input signal.
On the other hand, the adaptation techniques conceptually correspond to projecting the trained models to the testing environment. These techniques work well with changing environmental conditions or with nonstandard speakers only if the system has a large amount of training models that closely resemble the changing environmental conditions, the deviation of a speaker from the standard language, and the deviation of a sound from the standard sound characteristics. This kind of projection can be performed at signal space, at feature space, and at model space.
Most of the state of the art adaptation techniques developed to date perform the projection in the model space and are based on linear or piece-wise linear transformations. These techniques are computationally expensive, need separate adaptation training data that models the current environment and sound input, and these techniques are not applicable to derivatives of cepstral coefficients. All of these factors contribute to slow down the adaptation process and, therefore, prevent the prior art techniques from achieving adaptation in real-time.
Thus, artisans are faced with competing goals. First: maintaining the speech recognition accuracy or robustness, and second: performing the first goal in real time. For the foregoing reasons, there is a great need for fast on-line adaptation of acoustic training models to achieve robust automatic audio recognition in the presence of sound disturbances, such as the disturbances generated by changing environmental conditions, deviations of a speaker from the standard language, and deviations of an input sound from the standard sound characteristics.
The following references are presented for further background information:
[1] A. Sankar and C-H. Lee, “A maximum likelihood approach to stochastic matching for robust speech recognition”, IEEE Trans. On Speech and Audio Processing, vol. 4, pp. 190-202, May 1996.
[2] R. C. Rose, E. M. Hoftsetter and D. A. Reynolds, “Integrated models of signal and background with application to speaker identification in noise”, IEEE Transactions on Speech and Audio Processing, vol. 2, no. 2, pp. 245-257, April 1994.
[3] Hui Jiang and Li Deng, “A robust compensation strategy for extraneous acoustic variations in spontaneous speech recognition”, IEEE Transactions on Speech and Audio Processing, vol. 10, no. 1, pp. 9-17, January 2002.
[4] J. McDonough, T. Schaaf, and A. Waibel, “On maximum mutual information speaker-adapted training”, ICAASP 2002, vol. 1, pp. 601-604, 2002.
[5] Bowen Zhou and J. Hansen, “Rapid speaker adaptation using multi-stream structural maximum likelihood eigenspace mapping”, ICASSP 2002, vol. 4, pp. 4166-4169, 2002.
[6] J.-T. Chien, “Online unsupervised learning of hidden Markov models for adaptive speech recognition”, IEEE Proceedings on Vision, Image and Signal Processing, vol. 148, no. 5, pp. 315-324, October 2001.
[7] Shaojun Wang and Yunxin Zhao, “Online Bayesian tree-structured transformation of HMMs with optimal model selection for speaker adaptation”, IEEE Transactions on Speech and Audio Processing, vol. 9, no. 6, September 2001.