This invention relates to speech recognition techniques and, more particularly, to improved apparatus and method for recognizing words that are spoken at speeds that approach the speed of "continuous" speech. An improved apparatus and method of speech feature array correlation is also set forth. The subject matter of this invention relates to subject matter set forth in the copending U.S. Appl. Ser. No. 138,646, entitled "Speech Recognition Apparatus And Method", filed of even date herewith and assigned to the same assignee as the present application.
There have been previously developed various equipments that recognize limited vocabularies of spoken words by analysis of acoustic events. Typically, such equipments are useful in "voice command" applications wherein, upon recognizing particular words, the equipment produces electrical signals which control the operation of a companion system. For example, a voice command could be used to control a conveyor belt to move in a specified manner or may control a computer to perform specified calculations.
Previous efforts to develop automatic methods of speech recognition, while attaining some success, led to the realization of the exceedingly complex nature of speech communication. Normal speech has a high information content with considerable variability from speaker to speaker and some variability even in the same word when spoken by the same individual. Therefore, a "perfect" recognition scheme is unattainable since the nature of the speech signal to be recognized cannot be precisely defined. As a result, the preferred past schemes have been empirical approaches which have yielded at least a reasonable level of confidence, from a statistical standpoint, that a particular spoken word corresponded to a selected one of a limited machine vocabulary. The desirability of such schemes are thus not determinable by theoretical examination, but rather by a straightforward measure of recognition accuracy over an extended period of operation.
For various reasons, most prior art systems have been found unsuitable for practical applications. One of the prime reasons has been the sheer complexity of equipments that attempted to make an overly rigorous analysis of received speech signals. In addition to the expense and appurtenant unreliability, such systems have a tendency to establish highly complicated and restrictive recognition criteria that may reject normal variations of the system vocabulary words. Conversely, some equipments suffer from establishing recognition criteria that are too easily met and result in the improper acceptance of extraneous words not included in the preselected vocabulary of the equipment.
In the U.S. Pat. No. 4,069,393, assigned to the same assignee as the present application, there is disclosed an apparatus which receives spoken input "training" words and a subsequent spoken input "command" word and generates a correlation function that is indicative of the resemblance of the command word to each training word. A feature extraction means processes received input words and generates digital feature output signals on particular ones of a number of feature output lines, the particular ones depending on the characteristic features of the word being spoken. The status of the feature signals which occur during each training word are stored as a time normalized matrix or array. Subsequently, the status of the feature signals which occur during a command word are also stored as a time normalized array. The command word array is then compared, member by member, with each training word array and a correlation figure is generated for each comparison. If a sufficiently high correlation is found between the command word array and a particular training word array, the command word is deemed to correspond to the particular training word. Existing versions of this type of system have been found to operate most satisfactorily in applications where command words are spoken in "isolation"; i.e., where there are very distinct pauses (e.g. of the order of hundreds of milliseconds), between words, the pauses defining the word boundaries. Generally, circuitry is provided which senses the onset of speech after a pause and which then senses the next substantial absence of speech. These occurrences are considered the boundaries of a word and the feature events which occur between these boundaries are used to form the array referred to above. While this type of system has achieved commercial acceptance, the informational entry rate is necessarily limited, and the speakers must be trained to provide the required relatively long pauses between words, or else recognition errors will occur at an unacceptable rate.
In the U.S. Pat. No. 3,883,850 assigned to the same assignee as the present application, there is described a type of system that has been employed in the past with limited success to recognize the occurrence of words during continuous or connected speech. The technique utilized is a sequential analysis of phonetic events. A sequential logic "chain" is provided for each word to be recognized. Each chain includes a number of logic stages, one stage being provided for each phonetic event of the word to be recognized. The logic stages are configured in a series arrangement and selectively enabled in such a manner that they are sequentially activated when a particular sequence of phonetic events (or features) occurs. As a simplified example, the word "red" can be expressed by the phonetic sequence /r/.fwdarw./.epsilon./.fwdarw./d/. Accordingly, a logic chain employed to recognize the word red would have three logic stages coupled in series, the first stage being enabled by the sensing of an /r/ sound, the second stage being enabled by the sensing of an /.epsilon./ sound and the third stage being enabled by the sensing of a /d/ sound. Of course, the second and third stages would each also require the prior stage to have been enabled as a precondition. When the last stage is enabled, the system indicates that the word red has been spoken since the phonemes /r/, /.epsilon./, and /d/ are known to have occurred in the listed order. As explained in abovereferenced patents, the system typically requires that the phonemes occur within certain time constraints and provides for a logic chain to be "reset" (i.e., start over from scratch in looking for its vocabulary word) upon occurrence of certain acoustic features which would indicate a strong improbability that the sought vocabulary word is being uttered.
It will be appreciated that the sequential logic type of system as described has capability of recognizing vocabulary words among continuous speech, even when there is no discernable pause before or after the word is spoken. This is because the system is designed to sense the occurrence of a particular sequence of phonemes and no word boundaries need occur to isolate the word so an analysis can be made. Notwithstanding this advantage, it has been found that the described type of sequential logic system has certain deficiencies in the present state of the art. In general terms, speech recognition systems sometimes establish an overly restrictive recognition criteria, and this is often the case with the sequential logic type of system. Specifically, if the sequential logic system requires a certain restrictive sequence of phonemes for recognition, the absence of even a single phoneme from the prescribed sequence will prevent a recognition indication. In may cases such restriction causes a sought word to go unrecognized since contextual effects can easily cause even the same speaker to extraneously insert or omit a phoneme (or, more precisely, a phonetic feature) when uttering the same word on different occasions. This type of error lowers the system's recognition rate. The recognition rate can obviously be raised by relaxing the recognition criteria and allowing various alternative sequences to trigger recognition indications. However, such relaxation is found to increase the occurrence of "false alarms"; i.e., false triggerings of recognition indications by words (or phonetic sequences in adjacent words) that are similar to a word being sought.
In the U.S. Pat. No. 3,943,295 and 4,107,460, assigned to the same assignee as the present invention, there are disclosed improved techniques of recognizing one or more words from among continuous speech wherein a sequential type of analysis is employed to determine the boundaries of a command word candidate, and then the speech features which occur between the boundaries are correlated against stored speech features, using an array comparison of the type mentioned above in conjunction with "isolated" speech recognition systems. In the latter patent, the sequential processing includes comparing feature subsets of received speech with stored feature subsets of vocabulary words in order to determine the boundaries of command word candidates. While the techniques of these patents are promising, the type of processing used therein has not, in the present state of the art, been successfully employed for continuous speech recognition of significant vocabulary sizes.
It has been proposed that continuous speech could be processed by considering each speech sample (i.e., sample of speech features taken at regular time intervals) of a length of speech as a possible start or end point of an individual vocabulary word. All possible words of the system's vocabulary are correlated against each group of speech samples. In other words, each vocabulary word is correlated against the speech samples comprising each possible word position within the length of speech being processed. Once this has been done, the correlation scores can be used to select an optimum sequence of vocabulary words which best matches the continuous speech being processed. A problem, however, with such a rigorous approach is the cost and/or processing time involved when the number of possible start and end points becomes large, as will occur for continuous lengths of speech that includes a sequence of only a few words.
It is among the objects of the invention to provide the following:
(a) A practical apparatus and method for recognizing strings of words that are spoken at a rate faster than was possible in prior "isolated" word recognition systems and which approaches the speed of continuous speech.
(b) An apparatus having defined operational units, with associated memory, that perform on a priority basis that renders practical the processing of speech that is continuous or almost continuous in nature.
(c) An improved apparatus and method of speech feature array correlation that has application in both "isolated" word types of speech recognition systems and to "continuous", or almost continuous speech recognition systems.