According to one approach of speech recognition, a speech portion must be determined and separated from input voice data. The speech portion generally includes words that is uttered by a human. In one example of the endpoint detection, the speech portion is processed so as to extract a predetermined characteristics based upon parametric spectral analyses such as a linear predictive coding (LPC) melcepstrum. The selected speech portion or a series of frames is compared to a predetermined set of standard patterns or templates in order to determine a distance or similarity between them. Speech is thus recognized based upon similarity.
The above described process critically depends upon the accurate detection and separation of the speech portion or words. However, the input voice data often includes other noises such as overlapping background noise in addition to the human speech. Human speech itself also contains variable speech elements due to undesirable noises such as a mouth click, dialects and individual differences even if the same words are uttered. Because of these and other reasons, it has been difficult to correctly isolate speech elements in order to recognize human speech.
One prior art approach includes endpoint detection as disclosed in "Fundamentals of Speech Recognition," L. Rabiner and B. H. Juang (1993). In general, in order to determine end points, an input speech signal is first processed and feature measurements are made. Then, the speech-detection method is applied to locate and define the speech events. Lastly, the isolated speech elements are compared against the speech templates or standard speech patterns. In other words, a start and an end of each speech element are determined prior to the pattern matching step. Although this approach is functional when the input speech lacks background noise or contains relatively minor non-speech elements, speech recognition based upon the above described explicit endpoint detection deteriorates with a high level of background noise. Background noise erroneously causes to define a start or an end of speech events.
In order to improve the above described problem, another prior art approach includes a word spotting technique as disclosed in "A Robust Speech Recognition System Using Word-Spotting With Noise Immunity Learning," Takebayashi, et al., pgs. 905-908, IEEE, ICASSP (1991). In general, word spotting generally does not rely upon a particular pair of speech event boundaries. In other words, in a pure word spotting approach, all possible beginnings and endings are implicitly selected and are considered for the pattern-matching and recognition-decision process. For example, a continuous dynamic programming matching technique (a DP matching) continuously adjusts input data in the time domain to enhance matching results, "Digital Voice Processing," Furui (1995). In the word spotting approach, although the common background noise problem is substantially reduced, certain background sound may be confused with certain speech such as a nasal sound when a characteristic value such as melcepstrum is used for recognition. Furthermore, since a large number or all possible endpoint candidates are examined, the amount of calculation is burdensome and affects a performance level.
In addition to the above described spectral analyses, the energy level of the input voice data is combined to improve the accuracy. The energy level appears as power or gain in the speech spectral representation. The energy information has been incorporated into every spectral value or every frame as discussed in "Fundamentals of Speech Recognition," L. Rabiner and B. H. Juang (1993).
Despite the above described use of the energy information, the accuracy of the speech recognition remains to be desired. The energy level, however, is not generally an accurate indication since the energy level as a characteristic value is variable among individuals and over time. In fact, the incorporation of the energy information into every frame tends to cause a large degree of error by cumulating inaccurate energy information. The problem in word spotting occurs when the energy level of the speech input is relatively low but when the spectral information of background resembles speech.