This invention relates to speech recognition and, more particularly, to an apparatus and method for receiving spoken input "training" (or vocabulary) words, and subsequently recognizing a spoken input command word as being one of the training words.
There have been previously developed various equipments that recognize limited vocabularies of spoken words by analysis of acoustic events. Typically, such equipments are useful in "voice command" applications wherein, upon recognizing particular words, the equipment produces electrical signals which control the operation of a companion system. For example, a voice command could be used to control a conveyor belt to move in a specified manner or may control a computer to perform specified operations.
Preveious efforts to develop automatic methods of speech recognition, while attaining varying levels of success, led to the realization of the exceedingly complex nature of speech communication. Normal speech has a high information content with considerable variability from speaker to speaker and some variability even in the same word when spoken by the same individual. Therefore, a "perfect" recognition scheme is unattainable since the nature of the speech signal to be recognized cannot be precisely defined. As a result, the preferred past schemes have been empirical approaches which have yielded at least a reasonable level of confidence, from a statistical standpoint, that a particular spoken word corresponded to a selected one of a limited machine vocabulary. The desirability of such schemes are thus not determinable by theoretical examination, but rather by a straightforward measure of recognition accuracy over an extended period of operation.
For various reasons, most prior art systems have been found unsuitable for practical applications. One of the prime reasons has been the shear complexity of equipments that attempted to make an overly rigorous analysis of received speech signals. In addition to the expense and appurtenant unreliability, such systems have a tendency to establish highly complicated and restrictive recognition criteria that may reject normal variations of the system vocabulary words. Conversely, some equipments suffer from establishing recognition criteria that are too easily met and result in the improper acceptance of extraneous words not included in the preselected vocabulary of the equipment.
In the U.S. Pat. No. 4,069,393, assigned to the same assignee as the parent application, there is disclosed an apparatus which receives spoken input "training" words and a subsequent spoken input "command" word and generates a correlation figure that is indicative of the resemblance of the command word to each training word. A feature extraction means processes received input words and generates digital feature output signals on particular ones of a number of feature output lines, the particular ones depending on the characteristic features of the word being spoken. The status of the feature signals which occur during each training word are stored as a time normalized matrix or array. Subsequently, the status of the feature signals which occur during a command word are also stored as a time normalized array. The command word array is then compared, member by member, with each training word array and a correlation figure is generated for each comparison. If a sufficiently high correlation is found between the command word array and a particular training word array, the command word is deemed to correspond to the particular training word. Existing versions of this type of system have been found to operate most satisfactorily (although not exclusively) in applications where command words are spoken in "isolation"; i.e., where there are distinct pauses (e.g. of the order of hundreds of milliseconds), between words, the pauses defining the word boundaries. Generally, circuitry is provided which senses the onset of speech after a pause and which then senses the next substantial absence of speech. These occurrences are considered the boundaries of a word, and the feature events which occur between these boundaries are used to form the array referred to above.
The just described type of speech recognition apparatus has found useful commercial application and can operate with relatively high recognition accuracy, especially when sufficient processing capability is provided to obtain a fairly rigorous analysis of spoken words at a relatively high sampling rate, and when sophisticated correlation techniques are employed. Applicants have noted, however, that speech recognition techniques which employ comparisons of time-dependent arrays can sometimes be subject to degradation of recognition accuracy when a particular word is spoken in a different manner at different times, even by the same speaker. As described in the above-referenced U.S. Pat. No. 4,069,393, this problem can be alleviated somewhat by employing time-normalization of feature arrays and by utilizing time-shifted comparisons of arrays as well as non-time-shifted array comparisons. However, there is still substantial room for improvement of recognition accuracy, especially in systems intended to be relatively inexpensive and therefore intended to operate with limited memory and processing capability. Further, improvement is particularly necessary in systems wherein the training words and command words are not necessarily spoken by the same person.
It is an object of the present invention to provide an apparatus and method which results in improved recognition accuracy without undue increases in the cost or complexity of the recognition system.