This invention relates to automatic speech recognition and more particularly to automatic speech recognition methods and systems for interacting with telephone callers cross a network.
In telephony, especially mobile telephony, speech signals are often degraded by the presence of acoustic background noise as well as by system introduced interference. Such degradations have an adverse effect on both the perceived quality and the intelligibility of speech, as well as on the performance of speech processing applications in the network. To improve the perceived speech quality, noise reduction algorithms are implemented in cellular handsets, often in conjunction with network echo cancellers as shown by an article of E. J. Diethorn, xe2x80x9cA subband noise-reduction method for enhancing speech in telephony and teleconferencing,xe2x80x9d IEEE Workshop on Applications of Signal Processing to Audio and Acoustics,xe2x80x9d 1997. The most common methods for noise reduction assume that acoustic noise and speech are picked up by one microphone. These methods are mostly based on spectral magnitude subtraction where the short-term spectral amplitude of noise is estimated during speech pauses and subtracted from the noisy microphone signal as shown in the article of J. S. Lim and A. V. Oppenheim, xe2x80x9cEnhancement and bandwidth compression of noisy speech,xe2x80x9d Proc. IEEE, Vol. 67, pp. 1586-1604, 1979. The spectral magnitude subtraction method inherently implies the use of a voice activity detector (VAD) to determine at every frame whether there is speech present in that frame, such as found in U.S. Pat. No. 5,956,675, which is hereby incorporated by reference, and a related article by A. R. Setlur and R. A. Sukkar, entitled xe2x80x9cRecognition-based word counting for reliable barge-in and early endpoint detection in continuous speech recognition,xe2x80x9d Proc. ICSLP, pp. 823-826, 1998. The performance of these methods depends a great deal on the efficacy of the VAD. Even though about 12 to 18 dB of noise reduction can be achieved in real-world settings, spectral subtraction can produce musical tones and other artifacts which further degrade speech recognition performance as indicated in an article by C. Mokbel and G. Chollet, xe2x80x9cAutomatic word recognition in cars,xe2x80x9d IEEE Transactions on Speech and Audio Processing, Vol. 3, No. 5, pp. 346-356, 1995.
When the interference causing the degradation is an echo of the system announcement, then echo cancellers are able to reduce this type of echo by up to about 25 dB and such echo cancellers generate very few artifacts. However, if the echo is loud and the incoming speech is quiet, the residual echo energy following cancellation still begins to approach the level of the incoming speech. Such echoes sometimes cause false triggering of automatic speech recognition, especially in systems that allow users to interrupt prompts with spoken input. It is desirable to reduce or remove any such false triggers and the speech recognition errors they cause. Even when these echoes do not cause automatic speech recognition errors, such echoes do interfere with the recognition of valid input, and it is desirable to reduce any such interference.
Briefly stated in accordance with one aspect of the invention the aforementioned shortcomings are addressed and advance in the art achieved by providing a method for preventing a false triggering error from an echo of an audible prompt in an interactive automatic speech recognition system which uses a plurality of hidden Markov models of the system""s vocabulary with each of the hidden Markov models corresponding to a phrase that is at least one word long. The method includes the steps of receiving an input which has signals that correspond to a caller""s speech and an echo of the audible prompt of the interactive automatic speech response system; using a hidden Markov model of the audible prompt""s echo along with the plurality of hidden Markov models of the system""s vocabulary in the automatic speech recognition system to match the input when an energy of the echo of the audible prompt is at most the same order of magnitude as the energy of the signals that correspond to the caller""s speech instead of falsely triggering a match to one of the plurality of hidden Markov models of the vocabulary.
In accordance with another aspect of the invention, the aforementioned shortcomings are addressed and an advance in the art achieved by a speech recognition system for connection to a telephone network and telephone equipment of a caller that introduce an echo of a prompt played by the speech recognition system to reduce false triggering by the echo. The speech recognition system includes a network interface connecting the speech recognition system to a telephone line of the telephone network. When the network interface unit receives a call from caller via said telephone network, a play-prompt unit plays a prompt via the network interface unit to the caller to prompt a response from the caller. At the same time, also in response to the call and in response to the playing of the prompt, a network echo canceller partially cancels the echo of the prompt that is present in the call received by the network interface unit. The echo canceller is connected to an automatic speech recognizer and sends the input from the caller along with the partially cancelled echo of the prompt to the automatic speech recognizer. The automatic speech recognizer, which has a prompt echo model, prevents the automatic speech recognizer from falsely triggering on the partially cancelled echo and the automatic speech recognizer correctly recognizes the caller""s response.