The present invention relates to a speaker-independent speech recognizer, that is to a machine capable of automatically recognizing and decoding speech from an unknown human speaker.
There are many applications where it would be highly desirable to have such a speaker-independent speech recognizer configured for a small vocabulary. For example, such a word recognizer would be extremely useful for automotive controls and video games. If even a very small control vocabulary were available, many non-critical automotive control functions which frequently require the driver to remove his eyes from the road could be done by direct voice inputs. Control of a car's radio or sound system could be usefully accomplished in this matter. The more sophisticated monitoring and computational functions available in some cars could also be more efficiently met with a voice query/voice output system. For example, if a driver could say "fuel", and have his dashboard reply verbally "seven gallons--refuel within 160 miles," this would be very convenient in automotive control design. Similarly, an arcade video game could be designed to accept a limited set of verbal inputs such as "shoot", "pull up", "dive", "left", and "right". These applications, like many others, are extremely cost sensitive.
Thus, to provide a word recognizer for the large body of applications of this type, it is not necessary that the recognizer be able to recognize a very large vocabulary. A small vocabulary, e.g. 6 to 20 words, can be extremely useful for many applications. Secondly, it is not necessary that a word recognizer a for such applications be able to recognize word embedded in connected speech. Recognition of isolated words is quite sufficient for many simple command applications. Third, in many such applications, substitution errors are much more undesirable than rejection errors. For example, if a consumer is making purchases from a voice-selected vending machine, it is much more desirable to have the machine reply "input not understood" than to have the machine issue the wrong item.
Thus, it is an object of the present invention to provide a low cost word recognizer system which has a very low rate of substitution errors.
It is highly desirable to have such word recognizer systems operate with a low computational load. In many attractive applications, a modest error rate can easily be tolerated (e.g. 85% accurate recognition), but the cost requirements are stringent. Thus, it would be highly desirable to have a word recognizer which could be implemented with an ordinary cheap 8 bit microcomputer, together with cheap analog chips, but without requiring any high speed chips or dedicated processors. Of course, it is always possible to do speaker-independent word recognition using a minicomputer or a main frame, but such an implementation has no practical relevance to most of the desirable applications, since most of the applications are cost-sensitive.
Thus, it is an object of the present invention to provide a speaker-independent word recognizer which can be implemented with an ordinary 8-bit microcomputer, and does not require any high-speed or special-function processing chips.
It is a further object of the present invention to provide a speaker-independent word recognizer for a limited vocabulary which can be implemented using an 8-bit microcomputer and analog chips.
A further problem in speaker-independent recognition has been the preparation of an appropriate set of templates. Any one speaker, or any set of speakers with a common regional accent, may pronounce a certain word consistently with certain features which will not be replicated in the general population. That is, the reference templates for speaker-independent vocabulary must not specify any feature of a word which is not a strictly necessary feature. It is always possible to prepare a set of reference templates using empirical optimization, but this can be immensely time consuming, and also precludes the possibility of user-generation of reference templates in the field.
Thus, it is a further object of the present invention to provide a speech recognizer, for which the preparation of reference templates requires minimal empirical input from trained researchers.
It is a further object of the present invention to provide a method for preparing vocabulary templates for a speaker-independent word recognizer which can be implemented by minimally skilled users.
A further difficulty in preparing a speaker-independent word recognizer for cost-sensitive systems is memory requirements. That is, it is highly desirable in many systems where small microcomputers are to be used not to tie up too much program memory with the word recognition algorithm and templates. In particular, in many applications for portable devices (e.g. a calculator or watch which can receive spoken commands), the power requirements of unswitched memory impose a critical constraint. Since speech vocabulary templates must be saved during power-off periods, the amount of memory (CMOS or nonvolatile) required for speech reference templates is a very important cost parameter.
Thus, it is a further object of the present invention to provide a speaker-independent word recognizer which has absolutely minimal memory requirements for storing reference templates.
A further problem in any word recognizer, which is most particularly important in a speaker-independent word recognizer, is that speakers will typically vary, not only in their average rate of speech, but in their timing of the syllable within a given word. Since this information is not normally used by human listeners in making a word recognition decision, it will typically vary substantially among the speech patterns of different speakers. It is therefore necessary that a speaker-independent word recognizer be insensitive to a reasonable range of variation in the average rate and localized timing of human speech.
It is therefore a further object of the present invention to provide a speaker-independent word recognizer which is reasonably insensitive both to average rate and to localized variations in timing of human speech.
It is a further object of the present invention to provide a speech recognition system which is reasonably insensitive both to average rate and to localized variations in timing of human speech, which can be implemented using a simple microcomputer with no expensive custom parts required.
A further characteristic which it would be desirable to implement in a speaker-independent word recognition system is the capability for vocabulary change. Thus, for example,in a calculator which can be addressed by spoken commands, it would be desirable to have the set of spoken commands be variable with different modules (for example),or to be user variable as user-customized software is loaded into the calculator.
However, to accomplish this, it is desirable that the reference template set preparation be baed on reasonably simple exclusion algorithms, so that a reasonably unskilled user can prepare a new template set. It is also necessary that the template set be addressable, so that templates can be downloaded and substituted.
It should also be noted that the capability to change templates is sensitive to the memory space required for each template. That is, if the memory templates can be stored reasonably compactly, then a mask location can be used to indicate which subset of all possible stored templates corresponds to the currently active vocabulary. Thus, for example, in an automotive control system, a master vocabulary might contain only a set of words indicating various areas of control functions, such as "radio", "wipers", "engine", "computer", etc. After any one of these function areas have been selected, a new localized set of reference templates would then be used for each particular function area. Each localized set of reference templates would have to include one command to return to the master template set, but otherwise could be fully customized. Thus, a localized set of commands for radio control could include such commands as "FM", "AM", "higher", "lower", "frequency", "volume", etc.
Thus, it is a further object of the present invention to provide a speaker-independent recognizer which functions on a limited vocabulary, but in which the vocabulary set can be esily changed.
It is a further object of the present invention to provide a speaker-independent recognizer which functions on a limited vocabulary, but in which the vocabulary set can be easily changed, which can be implemented using simple commercially available microcomputer parts.
A further desirable option in speaker-independent word recognizer systems is the capability to function in a speaker-dependent mode. That is, in such applications as automobile controls or speech-controlled calculators, it is necessary that the systems be shipped from the factory with a capability to immediately receive speech input. However, many such devices will typically be used only by a limited set of users. Thus, it is desirable to be able to adapt the template set of a speaker-independent device to be optimized for a particular user or group of users. Such re-optimization could be used to increase the vocabulary size or lower the error rate in service, but requires that the process of modifying templates be reasonably simple.
Thus, it is a further object of the present invention to provide a speaker-independent word recognizer which can be re-optimized easily to operate in a speaker dependent mode, for a specific speaker or for a limited group of speakers.
Thus, it is a further object of the present invention to provide a speaker-independent word recognizer, which can be easily re-optimized to operate in a speaker dependent mode for a specific speakder or for a limited group of speakers, and which can be economically configured using a simple microcomputer and simple analog parts.
Speaker-independent word recognition is performed, based on a small acoustically distinct vocabulary, with minimal hardware requirements. After a simple preconditioning filter, the zero crossing intervals of the input speech are measured and sorted by duration, to provide a rough measure of the frequency distribution within each input frame. The distribution of zero crossing intervals is transformed into a binary feature vector, which is compared with each reference template using a modified Hamming distance measure. A dynamic time warping algorithm is used to permit recognition of various speaker rates, and to economize on the reference template storage requirements. A mask vector for each reference template is used to ignore insignificant (speaker-dependent) features of the words detected.
To achieve these and other objects of the invention, the present invention comprises:
A word recognizer, comprising:
input means for receiving an analog speech input signal;
a signal conditioner connected to said input means, said signal conditioner measuring said input signal to provide a sequence of feature vectors at predetermined frame intervals;
distance measurement means, connected to said signal conditioner, for comparing each said feature vector with each of a plurality of reference vectors, said reference vectors being organized in sequences, each said reference vector sequence corresponding to a word which can be recognized, to provide a distance measure with respect to each of said reference vectors for each successive one of said feature vectors;
means for recognizing words in accordance with said distance measures between each said sequence of said reference vectors and successive ones in said sequence of feature vectors;
each said sequence of reference vectors having a mask vector associated therewith, said distance measurement means ignoring elements of each said reference vector which are designated as "don't care" bits by said corresponding mask vector.
According to a further embodiment of the present invention, the present invention comprises:
A method for recognizing speech, comprising the steps of:
receiving an analog input signal corresponding to speech;
conditioning and measuring said input signal to provide a sequence of feature vectors at predetermined frame intervals;
comparing each said feature vector with each of a plurality of reference vectors, said reference vectors being organized in sequences, each said reference vector sequence corresponding to a word which can be recognized, to provide a distance measure with respect to each of said reference vectors for each successive one of said feature vectors; and
recognizing words in accordance with said distance measures between each said sequence of said reference vectors and successive ones in said sequence of feature vectors;
each said sequence of reference vectors having a mask vector stored therewith, said distance measure ignoring elements of each said reference vector which are designated as "don't care" bits by said mask vector.