The present invention relates to a speech recognition method and apparatus, and mcre particularly to a method of and apparatus for recognizing in real time, keywords in a continuous audio signal.
Various speech recognition systems have been proposed herebefore to recognize isolated utterances by comparing an unknown isolated audio signal, suitably processed, with one or more previously prepared representations of known keywords. In this context, "keywords" is used to mean a connected group of phonemes and sounds and may be, for example, a portion of a syllable, a word, a word string, a phrase, etc. While many systems have met with limited success, one system, in particular, has been employed successfully, in commercial applications, to recognize isolated keywords. This system operates substantially in accordance with the method described in U.S. Pat. No. 4,038,503, granted July 26, 1977, assigned to the assignee of this application, and provides a successful method for recognizing one of a restricted vocabulary of keywords provided that the boundaries of the unknown audio signal data are either silence or background noise as measured by the recognition system. That system relies upon the presumption that the interval, during which the unknown audio signal occurs, is well defined and contains a single keyword utterance.
In a continuous audio signal, such as continuous conversational speech, wherein the keyword boundaries are not a priori known or marked, several methods have been devised to segment the incoming audio data, that is, to determine the boundaries of linguistic units, such as phonemes, syllables, words, sentences, etc., prior to initiation of a keyword recognition process. These prior continuous speech systems, however, have achieved only a limited success in part because a satisfactory segmenting process has not been found. Other substantial problems still exist: for example, only limited vocabularies can be consistently recognized with a low false alarm rate; the recognition accuracy is highly sensitive to the differences between voice characteristics of different talkers; and the systems are highly sensitive to distortion in the audio signals being analyzed, such as typically occurs, for example, in audio signals transmitted over ordinary telephone communications apparatus.
The continuous speech recognition methods described in U.S. applications Ser. Nos. 901,001; 901,005; and 901,006, all filed Apr. 27, 1978, and now U.S. Pat. Nos. 4,227,176; 4,241,329; and 4,227,177, respectively, describe commercially acceptable and effective procedures for successfully recognizing, in real time, keywords in continuous speech. The general methods described in these patents are presently in commercial use and have been proved both experimentally and in practical field testing to effectively provide a high reliability and low error rate, in a speaker-independent environment. Nevertheless, even these systems, while at the forefront of present day technology, and the concept upon which they were developed, have shortcomings in both the false-alarm rate and speaker-independent performance.
The continuous speech recognition methods described in the above-identified U.S. patents are directed primarily to an "open vocabulary" environment wherein one of a plurality of keywords in continuous speech is recognized or spotted. An "open vocabulary" is one where not all of the incoming vocabulary is known to the apparatus. In a particular application, a continuous word string can be recognized wherein the result of the recognition process is the identity of each of the individual word elements of the continuous word string. A continuous word string in this context is a plurality of recognizable elements (a "closed vocabulary") which are bounded by silence. This is related for example to the commercial equipment noted above with respect to the isolated word application in which the boundaries are a priori known. Here however the boundaries, silence, are unknown and must be determined by the recognition system itself. In addition, the elements being examined are no longer single word elements but a plurality of elements "strung" together to form the word string.
While various methods and apparatus have been suggested in the art for recognizing continuous speech, less attention has been focused upon automatic training of the apparatus to generate the necessary parameters for enabling accurate speech recognition. Furthermore, the methods and apparatus for determining silence in earlier apparatus and the use of grammatical syntax in such earlier apparatus while generally sufficient for its needs, has left much room for improvement.
Therefore, a principal object of the present invention is a speech recognition method and apparatus having improved effectiveness in training the apparatus for generating new recognition patterns. Other objects of the invention are a method and apparatus which effectively recognize silence in an unknown audio input signal data, which employ grammatical syntax in the recognition process, which will respond equally well to different speakers and hence different voice characteristics, which are reliable and have an improved lower false-alarm rate, and which will operate in real time.