1. Field of the Invention
The present invention relates to a speech segment detection method, and a speech recognition system and method in which the speech segment detection method is utilized. Further, the present invention relates to a computer-readable medium storing program code instructions that cause the processor to carry out the speech segment detection method.
2. Description of the Related Art
Speech recognition by machine has proven an extremely difficult task. One complicating factor is that, unlike written text, no clear spacing exists between spoken words; speakers typically utter full phrases or sentences without pause. Further, acoustic variability in the speech signal typically precludes an unambiguous mapping to a sequence of words or subword units, such as pronunciations of consonants and vowels. One major source of variability in speech is coarticulation, or the tendency for the acoustic characteristics of a given speech sound or phone sound to differ depending upon the phonetic context in which it is produced.
Speech recognizers can be categorized by the speaking styles, vocabularies, and language models that they accommodate. Isolated word recognizers require speakers to insert brief pauses between individual words. Continuous speech recognizers operate on fluent speech, but typically employ strict language models, or grammars, to limit the number of allowable word sequences. Wordspotters operate on fluent speech as input. However, rather than providing full transcription, wordspotters selectively locate relevant words or phrases in an utterance. Wordspotting is useful both in information-retrieval tasks based on keyword indexing and as an alternative to isolated word recognition in voice command applications.
In principle, the wordspotting technique does not require detecting a speech segment in the input speech signal. However, in practical applications, there are some cases in which the detection of speech segments prior to the recognition process is needed to determine word recognition timing or determine a selected range of the input speech signal to be recognized. If wordspotting is applied, in such cases, to the entire range of the input speech without detecting the speech segments, the processing load will be significantly increased, which is detrimental to quickly obtaining the results of recognition. Hence, the detection of speech segments in the input speech signal is very useful for practical applications of speech recognition.
For example, Japanese Laid-Open Patent Application No.1-244497 discloses a speech segment detection method of one type. In this detection method, an average noise power over some frames of an input signal just following the starting time of a speech segment detection process is calculated, and a speech segment in the input speech signal is detected through the comparison with a threshold level that is varied by the average noise power.
However, the conventional method in the above publication has a problem in effectively detecting the speech segment when a relatively large noise (e.g., a key-depressing sound) takes place just following the time the speech segment detection process is started. A waveform of the input speech signal in such a condition is shown in FIG. 11. In the case of the waveform shown in FIG. 11, it is difficult for the conventional method to accurately detect a start-point of the speech segment (such as one indicated by the arrow "A" in FIG. 11) or an end point of the speech segment (such as one indicated by the arrow "B" in FIG. 11) since an excessively large threshold level (indicated by the dotted line in FIG. 11) is provided due to the average noise power calculated by including the relatively large noise.
Japanese Laid-Open Patent Application No.9-050288 discloses another speech segment detection method. In this speech segment detection method, a portion of an input speech signal in which the amplitude of the input speech signal exceeds a predetermined threshold level is detected as being a startpoint of a speech segment contained in the input speech signal.
Another portion of the input speech signal in which the amplitude is less than the threshold level is detected as being an end point of the speech segment. In this manner, the speech segment in the input speech signal is identified based on the start-point and the end point.
However, the conventional method in the above publication also does not eliminate the above-described problem. In the case of the waveform shown in FIG. 11, it is difficult for the conventional method to accurately detect a start-point of the speech segment or an end point of the speech segment when a relatively large noise takes place just following the starting time of the speech segment detection process.