This invention relates to speech recognition apparatus and, more particularly to an apparatus for recognizing the occurrence of a specific word or words from among continuous speech.
There have been previously developed various equipments that attempt to recognize limited vocabularies of spoken words by analysis of acoustic events. Typically, such equipments are envisioned as being useful in "voice command" applications wherein, upon recognizing particular words, the equipment produces electrical signals which control the operation of a companion system. For example, a voice command could be used to control a conveyor belt to move in a specified manner or may control a computer to perform specified calculations.
Previous efforts to develop automatic methods of speech recognition have had limited success and have led to the realization of the exceedingly complex nature of speech communication. Normal speech has a high information content with considerable variability from speaker to speaker and some variability even in the same word when spoken by the same individual. Therefore, a "perfect" recognition scheme is unattainable since the nature of the speech signal to be recognized cannot be precisely defined. As a result, the preferred past schemes have been empirical approaches which have yielded at least a reasonable level of confidence, from a statistical standpoint, that a particular spoken word corresponded to a selected one of a limited machine vocabulary. The desirability of such schemes are thus not determinable by theoretical examination, but rather by a straightforward measure of recognition accuracy over an extended period of operation.
For various reasons, most prior art systems have been found unsuitable for practical applications. One of the prime reasons has been the sheer complexity of equipments that attempted to make an overly rigorous analysis of received speech signals. In addition to the expense and appurtenant unreliability, such systems have a tendency to establish highly complicated and restrictive recognition criteria that may reject normal variations of the system vocabulary words. Conversely, some equipments suffer from establishing recognition criteria that are too easily met and result in the improper acceptance of extraneous words not included in the preselected vocabulary of the equipment.
In the copending application Ser. No. 531,543, filed Dec. 11, 1974, and assigned to the same assignee as the present application, there is disclosed an apparatus which receives spoken input "training" words and a subsequent spoken input "command" word and generates a correlation function that is indicative of the resemblance of the command word to each training word. A feature extraction means processes received input words and generates digital feature output signals on particular ones of a number of feature output lines, the particular ones depending on the characteristic features of the word being spoken. The status of the feature signals which occur during each training word are stored as a normalized time dependent matrix. Subsequently, the status of the feature signals which occur during a command word are also stored as a normalized time dependent matrix. The command word matrix is then compared, member by member, with each training word matrix and a correlation figure is generated for each comparison. If a sufficiently high correlation is found between the command word matrix and a particular training word matrix, the command word is deemed to correspond to the particular training word. This type of system has been found to operate most satisfactorily in applications where command words are spoken in "isolation"; i.e., where there are discernable pauses between words, the pauses defining the word boundaries. Generally, circuitry is provided which senses the onset of speech after a pause and which then senses the next substantial absence of speech. These occurrences are considered the boundaries of a word and the feature events which occur between these boundaries are used to form the matrix referred to above. Clearly, any system wherein distinct pauses are required to determine word boundaries will necessarily have severely limited capability for recognizing words from among natural continuous speech since there is often little or no discernable pauses between words in natural speech.
In the U.S. Pat. No. 3,883,850 assigned to the same assignee as the present application, there is described a type of system that has been employed in the past with some success to recognize the occurrence of words during continuous or connected speech. The technique utilized is a sequential analysis of phonetic events. A sequential logic "chain" is provided for each word to be recognized. Each chain includes a number of logic stages, one stage being provided for each phonetic event of the word to be recognized. The logic stages are configured in a series arrangement and selectively enabled in such a manner that they are sequentially activated when a particular sequence of phonetic events (or features) occurs. As a simplified example, the word "red" can be expressed by the phonetic sequence /r/.fwdarw./.epsilon./.fwdarw./d/. Accordingly, a logic chain employed to recognize the word red would have three logic stages coupled in series, the first stage being enabled by the sensing of an /r/ sound, the second stage being enabled by the sensing of an /.epsilon./ sound and the third stage being enabled by the sensing of a /d/ sound. Of course, the second and third stages would each also require the prior stage to have been enabled as a precondition. When the last stage is enabled, the system indicates that the word red has been spoken since the phonemes /r/, /.epsilon./, and /d/ are known to have occurred in the listed order. As explained in abovereferenced application, the system typically requires that the phonemes occur within certain time constraints and provides for a logic chain to be "reset" (i.e., start over from scratch in looking for its vocabulary word) upon occurrence of certain acoustic features which would indicate a strong improbability that the sought vocabulary word is being uttered.
It will be appreciated that the sequential logic type of system as described has a capability of recognizing vocabulary words among continuous speech, even when there is no discernable pause before or after the word is spoken. This is because the system is designed to sense the occurrence of a particular sequence of phonemes and no word boundaries need occur to isolate the word so an analysis can be made. Notwithstanding this advantage, it has been found that the described type of sequential logic system has some recognition deficiencies that could use improvement. As alluded to above in general terms, speech recognition systems sometimes establish an overly restrictive recognition criteria, and this is often the case with the sequential logic type of system. Specifically, if the sequential logic system requires a certain restrictive sequence of phonemes for recognition, the absence of even a single phoneme from the prescribed sequence will prevent a recognition indication. In many cases such restriction causes a sought word to go unrecognized since contextual effects can easily cause even the same speaker to extraneously insert or omit a phoneme (or, more precisely, a phonetic feature) when uttering the same word on different occasions. This type of error lowers the system's recognition rate. The recognition rate can obviously be raised by relaxing the recognition criteria and allowing various alternative sequences to trigger recognition indications. However, such relaxation is found to increase the occurrence of "false alarms"; i.e. false triggerings of recognition indications by words (or phonetic sequences in adjacent words) that are similar to a word being sought.
In the U.S. Pat. No. 3,943,295 assigned to the same assignee as the present invention, there is disclosed a speech recognition apparatus which is capable of recognizing words from among continuous speech and which exhibits a relatively high recognition rate and a relatively low false alarm rate. In that invention, means are provided for generating feature signals which depend on the features of an input word being spoken. The feature signals are processed to determine the time interval of occurrence of a predetermined sequence of features. Further means are provided for comparing the feature signals which occur during the determined time interval with a stored set of features that are expected to occur characteristically during the command word to determine the degree of correlation therebetween. In other words, a sequential type of analysis is performed initially to determine the boundaries of a command word during continuous speech and, once determined, the speech features which occur between the boundaries are correlated as against a stored set of features. The present invention is of the general type set forth in the U.S. Pat. No. 3,943,295, but is an improvement thereon. In a disclosed embodiment in the patent, the sequential processing of feature signals is performed using a sequential logic chain having a plurality of sequential logic units which are sequentially activated when signals appear on logic input terminals of the sequential logic units. The present invention includes, inter alia, an improved version of the sequential processing technique of the described system.
It is an object of the present invention to provide a speech recognition apparatus which is capable of recognizing words from among continuous speech and which exhibits a relatively high recognition rate and a relatively low false alarm rate.