Current automatic speech recognition (ASR) systems are divided basically into two general functional categories. The first category is designated as connected speech recognition (CSR) systems and the second category is word spotting systems. The function of a CSR system is to determine which of a closed set of valid phrases has been spoken, assuming that the input speech is one of these phrases. Word spotting systems, on the other hand, assume the input signal to be a sequence of random sounds interspersed with occasional vocabulary words or keywords. A word spotter detects the occurrence of these keywords.
The recognition methods currently employed in CSR and word spotting systems frequently cause word or phrase utterances to be reported when they are not actually spoken. In the prior art there essentially are two main methods of word spotting. A first method is referred to as keyword scoring (KS) method. The principle of this method was developed and described in 1973 by J. F. Bridle in an article entitled An Efficient Elastic-Template Method for Detecting Given Words in Running Speech published by the British Accoustical Society in the spring meeting, pages 1-4, April 1973. That article is incorporated herein by reference and discusses the derivation of elastic templates from a parameter representation of spoken example of the keywords to be detected.
Briefly, keyword templates are "dragged across" the input speech producing match scores at every input frame. Each match score measures the distance or dissimiliarity between the keyword template and the input speech ending at that frame. The keyword with the lowest match score is hypothesized as having been spoken. The hypothesis is accepted or rejected by comparing the match score with the threshold value. The accuracy of the KS method is improved by a technique called "bias removal" which makes the threshold value a function of the keyword and the speaker.
The second technique can generally be defined as the CSR method because it is implemented using a modified CSR algorithm. This method is described in U.S. patent application Ser. No. 655,958, filed on Sept. 28, 1984 and entitled "Keyword Recognition System and Method Using Template-Concatenation Model" filed for A. L. Higgins et al 1-1-1 and assigned to the assignee herein. In that application there is described a CSR method which uses both "keyword templates" and "filler templates". The technique finds the concatenation or string of templates that most closely matches the incoming speech without making any distinction between "keyword templates" and "filler templates". The system then serves to report the occurrence of a keyword whenever the template for that keyword appears in the best matching template string. The modification to the CSR algorithm involves a concatenation penalty that biases the system in favor of longer templates or "keyword templates". Essentially, that system employs a method that detects the occurrence of keywords in continuously spoken speech and evaluates both the keyword hypothesis and the alternative hypothesis that the observed speech is not a keyword.
A general language model is used to evaluate the latter hypothesis. Arbitrary utterances of the language according to the model described in the application are approximated by concatenations of a set of filler templates. The system allows for automatic detection of the occurrence of keywords in unrestricted natural speech. The systemm can be trained by a particular speaker or can function independently of the speaker.
In regard to the above-described techniques the primary disadvantage of the KS method is that it only uses templates for the keywords. Thus it views all speech from the perspective of keyword templates. The human being on the other hand correctly classifies highly distorted or atypical speech, evidently using models of other speech sounds to tell whether an unknown that is far from the target sound is actually moved closer to other sounds. Because of this limitation, the KS method is highly sensitive to channel conditions, noise and the speaker's voice. In any event, the CSR method briefly described above takes a step towards alleviating this problem by using filler templates, which are intended to model all speech sounds. For a keyword to appear in the best matching template string as to enable it to be detected and reported, incoming speech must be closer to the keyword template than to any concatenation of filler templates.
Thus keyword matches are judged in relation to matches to other speech sounds. The main shortcoming of the CSR method is that it does not treat keyword templates separately from filler templates in the matching process. In terms of hypothesis testing, it does not explicitly separate the keyword hypothesis from the null hypothesis. A specific problem therefore is that keyword matches are not compared with filler template matches over exactly the same intervals. This diminishes the statistical power and therefore the performance of the method. The second shortcoming is that the method does not allow the operating point or tradeoff between false acceptance and false rejection errors to be controlled separately for each keyword.
It is therefore an object of the present invention to provide an apparatus and a method which maintain the distinction between keyword templates and filler templates in both the matching and decision procedures.
It is a further object to provide a system and method which compares keyword matches with filler matches over exactly the same intervals of the input speech thus eliminating the above-noted problems associated with prior art devices.
It is a further object of the present invention to provide separate parametric control of the operating point for each keyword thus providing greater accuracy and control of an automatic speech recognition system.