1. Field of the Invention
The present invention relates generally to the field of speech recognition and, more particularly, speech recognition in noisy environments.
2. Related Art
Automatic speech recognition (“ASR”) refers to the ability to convert speech signals into words, or put another way, the ability of a machine to recognize human voice. ASR systems are generally categorized into three types: speaker-independent ASR, speaker-dependent ASR and speaker-verification ASR. Speaker-independent ASR can recognize a group of words from any speaker and allow any speaker to use the available vocabularies after having been trained for a standard vocabulary. Speaker-dependent ASR, on the other hand, can identify a vocabulary of words from a specific speaker after having been trained for an individual user. Training usually requires the individual to say words or phrases one or more times to train the system. A typical application is voice dialing where a caller says a phrase such as “call home” or a name from the caller's directory and the phone number is dialed automatically. Speaker-verification ASR can identify a speaker's identity by matching the speaker's voice to a previously stored pattern. Typically, speaker-verification ASR allows the speaker to choose any word/phrase in any language as the speaker's verification word/phrase, i.e. spoken password. The speaker may select a verification word/phrase at the beginning of an enrollment procedure during which the speaker-verification ASR is trained and speaker parameters are generated. Once the speaker's identity is stored, the speaker-verification ASR is able to verify whether a claimant is whom he/she claims to be. Based on such verification, the speaker-verification ASR may grant or deny the claimant's access or request.
Detecting when actual speech activity contained in an input speech signal begins and ends is a basic problem for all ASR systems, and it is well-recognized that proper detection is crucial for good speech recognition accuracy. This detection process is referred to as endpointing. FIG. 1 shows a block diagram of a conventional energy-based endpointing system integrated widely in current speech recognition systems. Endpoint detection system 100 illustrated in FIG. 1 comprises endpointer 102, feature extraction module 104 and recognition system 106.
Continuing with FIG. 1, endpoint detection system 100 utilizes a conventional energy-based algorithm to determine whether an input speech signal, such as speech signal 101, contains actual speech activity. Endpoint detection system 100, which receives speech signal 101 on a frame-by-frame basis, determines the beginning and/or end of speech activity by processing each frame of speech signal 101 and measuring the energy of each frame. By comparing the measured energy of each frame against a preset threshold energy value, endpoint detection system 100 determines whether an input frame has a sufficient energy value to classify as speech. The determination is based on a comparison of the energy value of the frame and a preset threshold energy value. The preset threshold energy value can be based on, for instance, an experimentally determined difference in energy between background/silence and actual speech activity. If the energy value of the input frame is below the threshold energy value, endpointer 102 classifies the contents of the frame as background/silence or “non-speech.” On the other hand, if the energy value of the input frame is equal to, or greater than, the threshold energy value, endpointer 102 classifies the contents of the frame as actual speech activity. Endpointer 102 would then signal feature extraction module 104 to extract speech characteristics from the frame. A common extracting means for extracting speech characteristics is to determine a feature set such as a cepstral feature set, as is known in the art. The cepstral feature set can then be sent to recognition system 106 which processes the information it receives from feature extraction module 104 in order to “recognize” the speech contained in the input frame.
Referring now to FIG. 2, graph 200 illustrates the endpointing outcome from a conventional endpoint detection system such as endpoint detection system 100 in FIG. 1. In graph 200, the energy of the input speech signal (axis 202) is plotted against the cepstral distance (axis 204). Esilence point 206 on axis 202 represents the energy value of background/silence. As an example, silence can be determined experimentally by measuring the energy value of background/silence or non-speech in different conditions such as in a moving vehicle or in a typical office and averaging the values. Esilence+K point 208 represents the preset threshold energy value utilized by the endpointer, such as endpointer 102 in FIG. 1, to classify whether an input speech signal contains actual speech activity. The value K therefore represents the difference in the level of energy between background/silence, i.e. Esilence, and the energy value of what the endpointer is programmed to classify as speech.
It is seen in graph 200 of FIG. 2 that an energy-based algorithm produces an “all-or-nothing” outcome: if the energy of an input frame is below the threshold level, i.e. Esilence+K, the frame is grouped as part of silence region 210. Conversely, if the energy value of an input frame is equal to or greater than Esilence+K, it is classified as speech and grouped in speech region 212. Graph 200 shows that the classification of speech utilizing only an energy-based algorithm disregards the spectral characteristics of the speech signal. As a result, a frame which exhibits spectral characteristics similar to actual speech activity may be falsely rejected as non-speech if its energy value is too low. At the same time, a frame which has spectral characteristics very different from actual speech activity may be mistakenly classified as speech simply because it has high energy. It is recalled that with a conventional endpoint detection system such as endpoint detection system 100 in FIG. 1, only frames classified by the endpointer as speech are subsequently exposed to the recognition system for further processing. Thus, when actual speech activity is mistakenly classified by the endpointer as silence or non-speech, or when non-speech activity is erroneously grouped with speech, speech recognition accuracy is significantly diminished.
Another disadvantage of the conventional energy-based endpoint detection algorithm, such as the one utilized by endpoint detection system 100, is that it has little or no immunity to background noise. In the presence of background noise, the conventional endpointer often fails to determine the accurate endpoints of a speech utterance by either (1) missing the leading or trailing low-energy sounds such as fricatives, (2) classifying clicks, pops and background noises as part of speech, or (3) falsely classifying background/silence noise as speech while missing the actual speech. Such errors lead to high false rejection rates, and reflect negatively on the overall performance of the ASR system.
Thus, there is an intense need in the art for a new and improved endpoint detection system that is capable of handling background noise. It is also desired to design the endpoint detection system such that computational requirements are kept to a minimum. It is further desired that the endpoint detection system be able to detect the beginning and end of speech in real time.