1. Field of the Invention
The present invention relates to a voice recognition system and a voice processing system for recognizing a speech vocabulary sequence from a user's voice and accepting it as an input to the system. In particular, the present invention relates to a voice recognition system and a voice processing system in which a self-repair utterance (also referred to as a speech-repair utterance) immediately after a misstatement can be inputted and recognized appropriately. It should be noted that the voice recognition system is also referred to as an auto speech recognition system or a speech recognition system in some cases.
2. Description of Related Art
Conventionally, as a means of accepting an input from a user, a voice recognition system that utilizes a voice recognition engine so as to recognize a speech vocabulary sequence from a user's voice and accept it as an input to the system has been known. Such a voice recognition system has already been commercialized in, for example, information delivery systems including a voice portal and a car navigation system.
FIG. 10 illustrates an exemplary configuration of a conventional voice recognition system. A voice recognition system 90 shown in FIG. 10 includes a signal processing unit 91, a voice section detecting unit 92, a decoder 93, a grammar storing unit 94, an acoustic model storing unit 95, a vocabulary dictionary storing unit 96 and a result output unit 97.
When a voice uttered by a user is inputted, the voice section detecting unit 92 detects a voice section in the inputted voice. In other words, the voice section detecting unit 92 estimates a background noise level from power information of a voice per certain time period (i.e., per frame), which is called frame power information, and if the frame power is larger than the estimated background noise level and the difference between them is larger than a preset threshold value, determines that this frame is a voice section. Then, after detecting the voice section, if the next voice section is not found even after a predetermined time period T (a speech ending detection time period), the voice section detecting unit 92 determines that an input voice has ended. If the next voice section begins within the time period T, the length of the detected voice section is prolonged. In this manner, the voice section detecting unit 92 determines one voice section.
The signal processing unit 91 receives information in individual voice sections determined by the voice section detecting unit 92 and converts voices corresponding to the individual voice sections into features. The decoder 93 compares an acoustic model (acquired from the acoustic model storing unit 95) obtained by modeling information about which phoneme tends to become what kind of features and the features calculated by the signal processing unit 91, thus calculating a phoneme score for each frame. Furthermore, based on the calculated score, the decoder 93 assumes a word sequence (a sentence) according to a grammar stored in the grammar storing unit 94 and recognition vocabulary information stored in the vocabulary dictionary storing unit 96, thus calculating a score of each word sequence. The decoder 93 sends the obtained word sequence and score to the result output unit 97. When the decoder 93 finishes processing all the features inputted from the signal processing unit 91, it notifies the result output unit 97 that the processing has finished. When being notified by the decoder 93 that the processing has finished, the result output unit 97 outputs the word sequence having the best score calculated in the decoder 93 as a recognition result candidate.
As described above, the conventional voice recognition system can recognize from a user's voice its speech vocabulary sequence.
In many cases, a natural human utterance contains unwanted sounds such as misstatement, hesitation or coughing. In a conversation between humans, it is possible to recognize the self-repair utterance after the misstatement or hesitation accurately or ignore unwanted sounds. Conventionally, in voice recognition systems, there also have been several suggestions for recognizing such a self-repair utterance correctly or ignoring such unwanted sounds.
For example, a voice recognition apparatus has been suggested in which, when conducting a voice recognition per segment, according to the recognition result of preceding segments in a sentence, the handling of the following segment is changed, thereby improving the degree of freedom of a speaker's speech and the voice recognition rate (see JP 7(1995)-230293 A).