1. Field of the Invention
The present invention relates to a speech-recognition device.
2. Description of the Related Art
One of the well-known methods of speech recognition is to attend to speech recognition based on speech-frame detection. This scheme determines a start and an end of a speech frame to be recognized by using power information of the speech or the like, and performs a recognition process based on information obtained from the speech frame.
FIG. 1 is a flowchart of a method of recognizing speech based on speech-frame detection. In the speech recognition based on the speech-frame detection, a recognition process is started (step S1), and speech frames are detected as a speaker produces a speech (step S2). Speech information obtained from a speech frame is matched against a dictionary pattern (step S3), and a recognition object (a word in the dictionary) is output as a recognition result when this object exhibits the highest similarity (step S4). At the step S2, a beginning of a speech frame can be easily detected based on power information. An end of a speech frame, however, is detected when a silence continues to be present for more than a predetermined time period. This measure is taken in order to insure that a silence before a plosive consonant and a silence of a double consonant are differentiated.
A period of silence for detecting an end of a speech frame, however, is generally as long as about 250 msec to 350 msec because of a need to differentiate a silence of a double consonant. In this scheme, therefore, a recognition result is not available until the end of the time period of 250 msec to 350 msec after a completion of speech input. This makes a recognition system which is slow in response. If the period of silence for detecting the end of a speech frame is shortened for the sake of faster response, an erroneous recognition result may be obtained because the result of a double consonant comes out before the end of a speech.
It is often observed that a speaker makes redundant sounds irrelevant to recognition of speech as in a situation where he/she may say "ah", "oh", etc. Since matching with a dictionary is started at a beginning of a speech frame when the speech frame is subjected to a recognition process, such redundant voices as "ah" and "oh" hinder detection of similarities, and result in an erroneous recognition result.
A word spotting scheme is designed to counter various drawbacks described above. FIG. 2 is a flowchart of a process of a word spotting scheme. In this scheme, a recognition process is started (step S11), and speech information is matched against a dictionary without detecting a speech frame as a speaker makes a speech (step S12). Then, a check is made as to whether a detected similarity measure exceeds a predetermined threshold value (step S13). If it does not, a procedure goes back to the step S12 to continue matching of speech information against the dictionary. If the similarity measure exceeds the threshold at the step S13, a recognition object corresponding to this similarity measure is output as a recognition result (step S14). The word spotting scheme does not require detection of a speech frame, so that it facilitates implementation of a system having a faster response. Also, the word spotting scheme takes redundant words away from a speech before outputting recognition results, thereby providing a better recognition result.
The word spotting scheme has its own drawback as described in the following. In the word spotting scheme, no speech frame is detected, and matching against a dictionary is conducted consecutively. If a result of the matching exceeds a threshold, a recognition result is obtained. Otherwise, the matching process is continued. Since the matching process is kept running regardless of the speaker's action, the recognition result obtained from this process may be output even when the speaker is not voicing a word to be recognized. This is called fountaining. For example, fountaining is observed when the speaker is not talking to the recognition device but is talking with someone around him/her.
A method of implementing the word spotting scheme can be found, for example, in "Method of Recognizing Word Speech Using a State Transition Model of Continuous Time Control Type", Journal of the Institute of Electronics, Information and Communication Engineers, vol. J72-D-II, No.11, pp.1769-1777 (1989). According to the method disclosed in this document, data indicative of a time length is attached to phonemics in a dictionary or codebook. As a result, an improved recognition performance is obtained while reducing the amount of computation. In this method, however, a dictionary of recognized words is compiled by connecting phonemics using an average time length of each phonemic. Because of this, a long word in the dictionary may not correspond to an actually spoken word in terms of the time length of the word. This is because there is a psychological tendency that a speaker tries to speak a shorter word and a longer word in an equal time length. Further, when the speaker is excited, speech may become faster, and voice may be raised. In such situations, a speech-recognition device may experience a degradation in matched similarity measures, and may suffer a drop in a recognition performance. If the speech-recognition device uses the time length as a parameter, a speed of making a speech for a given speaker may be far different from a time length stored in a standard dictionary.
In this manner, the related-art voice-recognition device compiles words of a dictionary by connecting phonemics using an average time length of each phonemic. Because of this, there may be a discrepancy in a time length between a word in the dictionary and an actually spoken word, resulting in a degradation in recognition performance.
Accordingly, there is a need for a speech-recognition device which can enhance a recognition performance by updating time-length parameters in a standard dictionary in accordance with a time length of an actually spoken word.