1. Technical Field
The present invention relates generally to speech recognition. More particularly, the present invention relates to a method and system for spotting words in a speech signal that is able to dynamically compensate for background noise and channel effect.
2. Discussion
Speech recognition is rapidly growing in popularity and has proven to be quite useful in a number of applications. For example, home appliances and electronics, cellular telephones, and other mobile consumer electronics are all areas in which speech recognition has blossomed. With this increase in attention, however, certain limitations in conventional speech recognition techniques have become apparent.
One particular limitation relates to end point detection. End point detection involves the automatic segmentation of a speech signal into speech and non-speech segments. After segmentation, some form of pattern matching is typically conducted in order to provide a recognition result. A particular concern, however, relates to background (or additive) noise and channel (or convolutional) noise. For example, it is well documented that certain applications involve relatively predictable background noise (e.g., car navigation), whereas other applications involve highly unpredictable background noise (e.g., cellular telephones). While the above end point detection approach is often acceptable for low noise or predictable noise environments, noisy or unpredictable backgrounds are difficult to handle for a number of reasons. One reason is that the ability to distinguish between speech and non-speech deteriorates as the signal-to-noise ratio (SNR) diminishes. Furthermore, subsequent pattern matching becomes more difficult due to distortions (i.e., spectral masking effect) introduced by unexpected background noise.
With regard to channel noise, it is known that the channel effect can be different depending upon the signal transmission/conversion devices used. For example, an audio signal is very likely to be altered differently by a personal computer (PC) microphone versus a telephone channel. It is also known that the noise type, noise level, and channel all define an environment. Thus, unpredictable channel noise can cause many of the background noise problems discussed above. Simply put, automatic segmentation in terms of speech and non-speech rapidly becomes unreliable when dealing with unpredictable channels, medium to high noise levels or non-stationary backgrounds. Under those conditions, automatic end point detectors can make mistakes, such as triggering on a portion without speech or adding a noise segment at the beginning and/or end of the speech portion.
Another concern with regard to traditional endpoint detection is the predictability of the behavior of the end-user (or speaker). For example, it may be desirable to recognize the command “cancel” in the phrase “cancel that”, or recognize the command “yes” in the phrase “uh . . . yes”. Such irrelevant words and hesitations can cause significant difficulties in the recognition process. Furthermore, by alternatively forcing the user to follow a rigid speaking style, the naturalness and desirability of a system is greatly reduced. The endpoint detection approach is therefore generally unable to ignore irrelevant words and hesitations uttered by the speaker.
Although a technique commonly known as word spotting has evolved to address the above user predictability concerns, all conventional word spotting techniques still have their shortcomings with regard to compensating for background noise. For example, some systems require one or several background models, and use a competition scheme between the word models and the background models to assist with the triggering decision. This approach is described in U.S. Pat. No. 5,425,129 to Garman et al., incorporated herein by reference. Other systems, such as that described in U.S. Pat. No. 6,029,130 to Ariyoshi, incorporated herein by reference, combines word spotting with end point detection to help locate the interesting portion of the speech signal. Others use non-keyword or garbage models to deal with background noise. Yet another approach includes discriminative training where the scores of other words are used to help increase the detection confidence, as described in U.S. Pat. No. 5,710,864 to Juange et al., incorporated herein by reference.
All of the above word spotting techniques are based on the assumption that the word matching score (representing an absolute likelihood that the word is in the speech signal) is the deciding recognition factor regardless of the background environment. Thus, the word with the best score is considered as being detected as long as the corresponding score exceeds a given threshold value. Although the above assumption generally holds in the case of high SNR, it fails in the case of low SNR where the intelligibility of a word can be greatly impacted by the spectral characteristics of the noise. The reduction in intelligibility is due to the noise masking effect that can either hide or de-emphasize some of the relevant information characterizing a word. The effect varies from one word to another, which makes the score comparison between words quite difficult and unreliable. It is therefore desirable to provide a method and system for spotting words in a speech signal that dynamically compensates for channel noise and background noise on a per-word basis.
The above and other objectives are provided by a method for spotting words in a speech signal in accordance with the present invention. The method includes the step of generating a first recognition score based on the speech signal and a lexicon entry for a first word. The first recognition score tracks an absolute likelihood that the first word is in the speech signal. A first background score is estimated based on the first recognition score. In the preferred embodiment, the first background score is defined by an average value for the first recognition score. The method further provides for calculating a first confidence score based on a matching ratio between a first minimum recognition value and the first background score. The first confidence score therefore tracks a noise-corrected likelihood that the first word is in the speech signal. The above process can be implemented for any number of words (i.e., a second, third and fourth word, etc.). Thus, the present invention acknowledges that the relationship between recognition scores of words is noise-type and noise-level dependent. As such, the present invention provides a level of reliability that is unachievable through conventional approaches.
Further in accordance with the present invention, a method for calculating a word spotting confidence score for a given word is provided. The method provides for dividing a minimum value of a speech recognition score by an average value of the speech recognition score over a predetermined period of time such that a matching ratio results. The average value defines an estimated background score. The method further provides for normalizing the matching ratio, where the normalized matching ratio defines the confidence score.
In another aspect of the invention, a word spotting system includes a speech recognizer and a spotting module. The speech recognizer generates recognition scores based on a speech signal and lexicon entries for a plurality of words. The recognition scores track absolute likelihoods that the words are in the speech signal. The spotting module estimates background scores based on the recognition scores. The spotting module further calculates confidence scores on a frame-by-frame basis based on matching ratios between minimum recognition scores and the background scores. The confidence scores therefore track noise-corrected likelihoods that the words are in the speech signal.
It is to be understood that both the foregoing general description and the following detailed description are merely exemplary of the invention, and are intended to provide an overview or framework for understanding the nature and character of the invention as it is claimed. The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute part of this specification. The drawings illustrate various features and embodiments of the invention, and together with the description serve to explain the principles and operation of the invention.