Appendices 1, 2, and 3 have been submitted with the application for entry and availability in the application file, but for convenience, have not been submitted for publication. The appendices are available on microfiche. There are 15 microfiche and a total of 731 frames.
This application is related to U.S. application Ser. No. 308,891, for "Speech Recognition Method and Apparatus", filed Oct. 5, 1981, in the name of Stephen L. Moshier and assigned to the assignee of this application.
The present invention relates to a speech recognition method and apparatus, and more particularly to a method of and apparatus for recognizing in real time, word strings in a continuous audio signal.
Various speech recognition systems have been proposed herebefore to recognize isolated utterances by comparing an unknown isolated audio signal, suitably processed, with one or more previously prepared representations of known keywords. In this context, "keywords" is used to mean a connected group of phonemes and sounds and may be, for example, a portion of a syllable, a word, a phrase, etc. While many systems have met with limited success, one system, in particular, has been employed successfully, in commercial applications, to recognize isolated keywords. This system operates substantially in accordance with the method described in U.S. Pat. No. 4,038,503, granted July 26, 1977, assigned to the assignee of this application, and provides a successful method for recognizing one of a restricted vocabulary of keywords provided that the boundaries of the unknown audio signal data are either silence or background noise as measured by the recognition system. That system relies upon the presumption that the interval, during which the unknown audio signal occurs, is well defined and contains a single keyword utterance.
In a continuous audio signal, such as continuous conversational speech, wherein the keyword boundaries are not a priori known or marked, several methods have been devised to segment the incoming audio data, that is, to determine the boundaries of linguistic units, such as phonemes, syllables, words, sentences, etc., prior to initiation of a keyword recognition process. These prior continuous speech systems, however, have achieved only a limited success in part because a satisfactory segmenting process has not been found. Other substantial problems still exist: for example, only limited vocabularies can be consistently recognized with a low false alarm rate; the recognition accuracy is highly sensitive to the differences between voice characteristics of different talkers; and the systems are highly sensitive to distortion in the audio signals being analyzed, such as typically occurs, for example, in audio signals transmitted over ordinary telephone communications apparatus.
The continuous speech recognition methods described in U.S. applications Ser. Nos. 901,001; 901,005; and 901,006, all filed Apr. 27, 1978, and now U.S. Pat. Nos. 4,227,176; 4,241,329; and 4,227,177, respectively, describe commercially acceptable and effective procedures for successfully recognizing, in real time, keywords in continuous speech. The general methods described in these patents are presently in commercial use and have been proved both experimentally and in practical field testing to effectively provide a high reliability and low error rate, in a speaker-independent environment. Nevertheless, even these systems, while at the forefront of present day technology, and the concept upon which they were developed, have shortcomings in both the false-alarm rate and speaker-independent performance.
The continuous speech recognition methods described in the above-identified U.S. patents are directed primarily to recognizing or spotting one of a plurality of keywords in continuous speech. In other applications, a continuous word string can be recognized wherein the result of the recognition process is the identity of each of the individual word elements of the continuous word string. A continuous word string in this context is a plurality of recognizable elements which are bounded by silence. This is related for example to the commercial equipment noted above with respect to the isolated word application in which the boundaries are a priori known. Here however the boundaries, silence, are unknown and must be determined by the recognition system itself. In addition, the elements being examined are no longer keyword elements but a plurality of elements "strung" together to form the word string. Various methods and apparatus have been suggested in the art for recognizing continuous word strings. These apparatus and methods however have various shortcomings again, for example, in false alarm rate, speaker independent performance, and real time operation.
Therefore, a principal object of the present invention is a speech recognition method and apparatus having improved effectiveness in recognizing continuous word strings in a continuous, unmarked audio signal. Other objects of the invention are a method and apparatus which are relatively insensitive to phase and amplitude distortion of the unknown audio input signal data, which are relatively insensitive to variations in the articulation rate of the unknown audio input signals, which will respond equally well to different speakers and hence different voice characteristics, which are reliable and have an improved lower false-alarm rate, and which will operate in real time.