The present invention generally relates to a word recognition method and system for recognizing acoustically distinct voice commands of simple words which is speaker-independent, wherein the operator is not required to train the word recognition system. More particularly, the present invention is directed to a word recognition method and system, wherein a limited number of spoken words may be identified by comparison of feature vectors forming acoustic descriptions thereof and based upon the zero-crossing rate and energy measurements of an input analog speech signal with reference feature vectors included as components of a plurality of reference templates respectively representative of the limited number of words contained within the memory storage of a microprocessor or microcomputer. Generally, the word recognition method and system disclosed herein is of the type disclosed and claimed in copending U.S. application Ser. No. 484,730 filed Apr. 13, 1983 by Rajasekaran et al, now U.S. Pat. No. 4,712,242 issued Dec. 8, 1987; and copending U.S. application Ser. No. 484,820 filed Apr. 13, 1983 by Rajasekaran et al.
Many highly desirable applications exist where a speaker-independent speech recognition system limited in its recognition capability to a vocabulary of a small number of words could be extremely useful. For example, such a word recognition system could perform a worthwhile function in certain toys, games and other low-end consumer products. Automotive controls are another aspect where such a word recognition system could have a desirable impact. In the latter respect, many non-critical automotive control functions which normally require the driver to frequently remove his eyes from the road over which the vehicle is traveling could be accomplished by direct voice inputs by the driver. Thus, a car radio or sound system could be turned "on" and "off" in this manner through simple voice inputs by the driver. More sophisticated monitoring and computational functions as available in some cars could also be accomplished by a word recognition system as incorporated into an electronic device having speech synthesis capability, for example. In this respect, the driver of the car could verbally say "fuel" (recognizable as a key word within the limited vocabulary of the voice recognition system), which would elicit the audible reply by the dash board as synthesized speech "seven gallons--refuel within 160 miles".
A word recognition system based upon a vocabulary of a limited number of words could also be incorporated into the operation of a video game, wherein the video game would be designed to accept a limited number of verbal inputs, such as "shoot", "pull up", "dive", "left", and "right" to which the characters or objects in the video game would respond during the performance of the video game, in lieu of hand controls or in addition thereto.
The use of a word recognition system in applications of the type described hereinabove renders it unnecessary to equip the word recognition system with sizable memory storage to accommodate a large vocabulary of words. A small vocabulary of words, e.g. 2 to 20 words, if recognizable from virtually any human speaker by the word recognition system can be employed in a highly practical manner for effecting desired system responses as based upon such word recognition. To this end, a word recognition system for such applications dealing with a limited number of words is not required to recognize a word embedded in connected speech, since the recognition of isolated words as simple commands spoken by an operator of the word recognition system is adequate to effect the operation of functional components associated with the word recognition system.
It would further be desirable to combine a word recognition system having a limited vocabulary of words with a speech capability, such as via speech synthesis, wherein simple speaker-independent recognition of a small number of acoustically distinct words is performed in conjunction with speech synthesis from a common memory, preferably on a single semiconductor chip, such as a microprocessor or a microcomputer. In this instance, the word recognition system is capable of operating with a low computational load and can tolerate a modest rate (e.g. 85% accurate word recognition) of acceptable responses without excessive manufacturing costs. Thus, it would be highly desirable to provide a word recognition system as implemented with a four-bit or eight-bit microprocessor or microprocessor, togehter with relatively inexpensive analog circuitry, wherein the microprocessor or microcomputer is provided with an adequate on-board random access memory, i.e. RAM. Such an implementation would not require high speed integrated circuit chips or dedicated signal processors. Of course, a minicomputer or a main frame computer could be employed to do speaker-independent word recognition, but the expense of such an endeavor has no practical relevance to the types of applications as described herein which are cost-sensitive.
Speaker-independent word recognition by its very nature presents problems in determining the speech data content of an appropriate set of reference templates representative of the respective words. In this respect, different speakers with different regional accents must be accommodated within the general identifying characteristics of the word acoustic description as defined by the reference templates of the respective words. It may often happen that one speaker or a set of speakers with a common regional accent could consistently pronounce a certain word with certain sound characteristics not duplicated by speakers grouped in the general population category. Thus, the reference templates representative of the words comprising the limited vocabulary in a speaker-independent word recognition system should not specify any feature of a word which is not a strictly necessary feature. While it is always presumably possible to prepare a set of reference templates representative of the limited number of words included in the vocabulary of a word recognition system via empirical optimization, such a procedure is extremely time-consuming and is probably prohibitive of the generation of such reference templates by a user in the field.
The cost-sensitive nature of a word recognition system as envisioned for the types of applications described herein has particular relevance to memory requirements. Thus, in many systems where small microcomputers are to be used, the amount of program memory which might be allocated to word recognition techniques and reference templates is generally restricted, because it is not desirable to tie up too much program memory with the word recognition function. In particular, in many applications for portable devices (e.g. a calculator or watch which can receive spoken commands), a critical constraint may be imposed by the power requirements of memory storage. In this respect, the reference templates representative of words and comprising the limited vocabulary of such devices must be saved during power-off periods. Thus, the amount of memory (CMOS or non-volatile) required for reference templates representative of words in such portable devices is an important cost consideration.
A further problem to be dealt with by any word recognition system which is speaker-independent has to do with the variance between individual speakers using the word recognition system, not only in their average rate of speech, but in their timing of syllables within a given word. Since such information is not normally used by human listeners in making a word recognition decision, it will typically vary substantially among the speech patterns of different speakers.