The present invention relates to the improvement of a speech recognition device and in particular to the improvement of a speech recognition device by means of which the specific pattern of the previously registered and recorded voice of a specific speaker, the distinctive features of which have been analyzed and determined, can be unmistakably recognized.
Considerable research has been conducted into speech recognition technology in the past, and a simple form of speech recognition device has been developed which has been able to recognize the vocalization of limited words particularly limited to the most recent utterance, while at the same time recognizing the voice data of the speech of a previous utterance, the distinctive features of which have been registered and recorded. This device is on the way to being put to practical application.
A typical example of this type of speech recognition device is shown in FIG. 1. In FIG. 1., when a voice enters a microphone 1, a voice signal passes from this microphone 1 into an amplifier 2 in which the sound is amplified, after which, by means of a frequency spectrum analyzer 3, where, for example, 16 frequency bands in a row are resolved, then by means of a subsequent switch 4, sampling of the frequency, for example, 100 Hz is carried out, and the result by means of an AD transducer 5 is converted into a digital value of, for example, 8 bits. The output of the AD transducer 5 is entered into a voice interval (voice fields) detecting appratus 6. This voice section detecting apparatus 6 provides initial timing and terminal timing. A recognition decision section 7, into which a reference pattern of perhaps 240 bits is entered from a reference pattern memory 8, compares the input pattern with the reference pattern. Also, the sectioned time region bit pattern from the voice interval detecting apparatus 6 is compressed into, say, 240 bits in a code compression apparatus 9. When the voice which is to be registered speaks, each variation in speech is revised in an evaluation apparatus 10, and an average reference pattern is drawn up, and this reference pattern is entered into a reference pattern memory 8. That is to say, the person whose voice is intended to be recognized speaks the words, etc. which are already capable of being recognized, and this speech is converted into a pattern through the channel, and through the main circuit an investigative action is repeated many times, and the reference pattern which is stored in the reference pattern memory 8 is formed. The recognition decision section 7, on comparing the output pattern from the code compression apparatus 9 with the reference pattern, makes the decision as to which pattern the input voice belongs, and that decision, for instance, the selected reference pattern code number, is set in an output register 11, and the recognition is completed. In this type of conventional speech recognition apparatus, the voice to be registered is simply analyzed, and the distinct features are extracted, and the pattern of the special features which have been extracted is stored in the reference pattern memory (with no additional processing.). Therefore, the speaker can speak properly, but a discordant noise could also be mixed in, and outside of the speaker's intuitive judgement, there is no way of determining whether or not the apparatus can correctly detect the voice intervals. In other words, there is the problem that a correct pattern of the distinctive features of the wave form of the voice being registered, in the case of wave forms such as those shown in FIG. 2(a) to (c), may not be obtained.
The wide, U-shaped bands under the patterns shown in FIG. 2 indicate the voice sections detected by the apparatus, where case (a) is an example of a noise being mixed into the voice wave form of the correct voice sections which were detected; case (b) is an example of the detection of the voice section becoming shortened because the voice intermission (section b-a pause between words) is too long; case (c) is an example of the leading end of the speech being weak, so that that leading end of the voice section is cut off; and case (d) is an example of the detection of the voice section being cut off at the trailing end because of the huskiness of the trailing end.
In these types of examples the lack of correct registration of the distinctive pattern can only be judged intuitively by the speaker himself.