This invention relates to a speech recognition system.
A speech recognition system for automatically recognizing words spoken by human beings is believed to be very effective as a novel installation for supplying various data and commands as voice inputs from the human beings to electronic digital computers and other objective apparatus herein called either "controlled devices" or, as usual, utilization devices. For instance, a speech recognition system for recognizing spoken numerals is capable for supplying numerical data from slips or tickets and others to a controlled device connected thereto. This renders it possible to provide a novel and effective mode of operation of supplying input data to the controlled device from remote locations because speech signals are readily transmitted through inexpensive telephone channels. A speech recognition system for recognizing spoken commands necessary for control by an operator of various controlled devices enables the operator to control the devices only by voice and to use his hands and feet for other purposes, thereby enabling him to simultaneously carry out a plurality of jobs and to make full use of his capability.
A speech recognition system hitherto developed, however, is liable to operate incorrectly and get into misrecognition when ambient noises are present and/or when utterance or pronunciation of the voice inputs is ambiguous. It is therefore necessary in cases where errors in the inputs to a controlled device are strictly forbidden to provide a speech recognition system with facilities for displaying a result of recognition for confirmation by the operator as soon as a voice input comes to an end and ultimately decided by the system to be a certain sequence of vowels and consonants. With such facilities, the operator cyclically advances steps of utterance of voice inputs and confirmation of the recognition results and carries out the input operation by pronouncing successive voice inputs so long as no misrecognition is found during the confirmation step and by repeatedly pronouncing the same voice input in the presence of errors with supply to the controlled device of the incorrect recognition result suspended until it is confirmed that the misrecognition has been corrected.
In order to proceed with the input operation with such a speech recognition system at a high speed, it is indispensable in the first place to make the speech recognition system rapidly display the result of recognition for confirmation by the operator. The problem here is that several hundreds of milliseconds are necessary after actual termination of utterance for the system to decide the result of recognition. More particularly, termination of utterance is detected in almost all speech recognition system available at present by watching amplitude levels of the voice inputs. It is thereby inappropriate to determine in haste that instant to be an end of the voice input of a word at which the amplitude level falls instantaneously to zero (or, in practice, to a sufficiently low level). Determination of the end is possible for the first time when the amplitude level is left at zero for a predetermined period of time, such as 250 milliseconds.
Let the utterance be for a numeral "6" (/roku/ in Japanese). A break or pause in a sense is interposed between /ro/ and /ku/ at which the amplitude level falls to zero (such a pause being hereafter called a "pause interval in a word"). If an instant at which the amplitude level falls to zero were decided to be an end of utterance of a word, then /ro/ would be understood to be a complete word and be possibly misrecognized as another numeral "5" (/go/ in Japanese). It is therefore mandatory for avoidance of such troubles to correctly judge whether an interal in which the amplitude level is left at zero is a pause interval in a word of a true end of a word, namely, an "end interval" either following a word or between two consecutive words. A pause interval in a word is usually shorter than about 200 milliseconds. From this fact, it is possible to conclude that a zero amplitude level interval equal to or shorter than a predetermined period of about 250 milliseconds and that longer than the predetermined period of time are a pause interval in a word and an end interval of a word, respectively. As an eventual result, it has been infeasible to display the recognition result before a lapse of the predetermined period of time after termination of utterance.
In view of the facts described hereinabove, it has been impossible with a conventional speech recognition system for an operator to know the result of recognition before a lapse in vain of several hundreds of milliseconds after termination of utterance and to pronounce a next following voice input until the recognition result is displayed and confirmed to be correct. A conventional speech recognition system has therefore been incapable of supplying inputs by voice to controlled devices at a high speed.
In order to smoothly carry out with a speech recognition system of the type described the steps of utterance of voice inputs, confirmation of the results of recognition, and correction, if necessary, of incorrect results of recognition, it is indispensible in the second place that the correction should be accomplished before the incorrect recognition result is undesiredly supplied to a controlled device. If the recognition result were supplied to the device no later than the result is displayed, confirmation and correction are next to impossible. It is therefore desirable to provide a sufficient interval of time for the confirmation and correction and yet to keep the high speed of the input operation.