The present invention relates to a speech recognition method and more particularly to a method for recognizing, in real time, one or more keywords in a continuous audio signal.
Various speech recognition systems have been proposed herebefore to recognize isolated utterances by comparing an unknown isolated audio signal, suitably processed, with one or more previously prepared representations of the known keywords. In this context, "keywords" is used to mean a connected group of phonemes and sounds and may be, for example, a portion of a syllable, a word, a phrase, etc. While many systems have met with limited success, one system, in particular, has been employed successfully, in commercial applications, to recognize isolated keywords. That system operates substantially in accordance with the method described in U.S. Pat. No. 4,038,503, granted July 26, 1977, assigned to the assignee of this application, and provides a successful method for recognizing one of a restricted vocabulary of keywords provided that the boundaries of the unknown audio signal data are either silence or background noise as measured by the recognition system. That system relies upon the presumption that the interval, during which the unknown audio signal occurs, is well defined and contains a single utterance.
In a continuous audio signal, (the isolated word is one aspect of the continuous speech signal), such as continuous conversational speech, wherein the keyword boundaries are not a priori known or marked, several methods have been devised to segment the incoming audio data, that is, to determine the boundaries of linguistic units, such as phonemes, syllables, words, sentences, etc., prior to initiation of a keyword recognition process. These prior continuous speech systems, however, have achieved only a limited success in part because a satisfactory segmenting process has not been found. Other substantial problems still exist; for example, only limited vocabularies can be consistently recognized with a low false alarm rate, the recognition accuracy is highly sensitive to the differences between voice characteristics of different talkers, and the systems are highly sensitive to distortion in the audio signals being analyzed, such as typically occurs, for example, in audio signals transmitted over ordinary telephone communications apparatus. Thus, even though continuous speech is easily discernible and understood by the human observer, machine recognition of even a limited vocabulary of keywords in a continuous audio signal has yet to achieve major success.
A speech analysis system which is effective in recognizing keywords in continuous speech is described and claimed in copending application Ser. No. 901,001, filed Apr. 27, 1978, entitled Continuous Speech Recognition Method. That system employs a method in which each keyword is characterized by a template consisting of an ordered sequence of one or more target patterns and each target pattern represents a plurality of short-term keyword power spectra spaced apart in time. Together, the target patterns cover all important acoustical events in the keyword. The invention claimed in U.S. Ser. No. 901,001, features a frequency analysis method comprising the steps of repeatedly evaluating a set of parameters determining a short-term power spectrum of the audio signal at each of a plurality of equal duration sampling intervals, thereby generating a continuous time-ordered sequence of short-term, audio power spectrum frames; and repeatedly selecting from the sequence of short-term power spectrum frames, one first frame and at least one later occurring frame to form a multi-frame spectral pattern. The method further features the steps of comparing, preferably using a likelihood statistic, each thus formed multi-frame pattern, with each first target pattern of each keyword template; and deciding whether each multi-frame pattern corresponds to one of the first target patterns of the keyword templates. For each multi-frame pattern which, according to the deciding step, corresponds to a first target pattern of a potential candidate keyword, the method features selecting later occurring frames to form later occurring multi-frame patterns. The method then features the steps of deciding in a similar manner whether the later multi-frame patterns correspond respectively to successive target patterns of the potential candidate keyword, and identifying a candidate keyword when a selected sequence of multi-frame patterns corresponds respectively to the target patterns of a keyword template, designated the selected keyword template.
Even though the method claimed in copending application Ser. No. 901,001, is significantly more effective in recognizing keywords in continuous speech than the prior art systems, even that method falls short of the desired goals.
A principal object of the present invention is therefore a speech recognition method having improved effectiveness in recognizing keywords in a continuous, unmarked audio signal. Other objects of the invention are a method which is relatively insensitive to phase and amplitude distortion of the unknown audio input signal data, a method which is relatively insensitive to variations in the articulation rate of the unknown audio input signals, a method which will respond equally well to different speakers and hence different voice characteristics, a method which is reliable, and a method which will operate in real time. Yet other objects of the invention are a method which reduces the dimensionality of the unknown input signal.