It has been generally recognized that speech recognition is the most difficult of the three generic kinds of problems in the speech processing area. Speech coding of course is by now the most commercial type of speech processing equipment; and there is also some commercial equipment for speech synthesis.
Nevertheless, the speech recognition problem has been intractable to a large degree. Most recognition systems have been restricted in their ability to recognize speech from different speakers or to recognize more than an extremely limited vocabulary in an extremely focused or task-directed environment.
It has also been widely recognized that it would be desirable to have a speech recognition system that was capable of continuous speech recognition.
In recent years several word-based continuous speech recognition systems have been built. For example, one such system that has been built is that described by S. E. Levinson and L. R. Rabiner "A Task-Oriented Conversational Mode Speech Understanding System", Speech and Speaker Recognition, M. R. Schroeder, Ed., Kargar, Based Switzerland, pp. 149-96, 1985. That system and other similar systems recently developed are word-based in the first instance. While these systems have all been successful in their ability to accurately recognize speech in certain restricted ways, there is reason to believe that their use of words as the fundamental acoustic patterns precludes the possibility of relaxing the constraints under which they presently operate so that they can accept fluent discourse of many speakers over large vocabularies.
An often suggested alternative to the word-based approach is the so called acoustic/phonetic method in which a relatively few short-duration phonetic units, out of which all words can be constructed, are defined by their measurable acoustic properties. Generally speaking, speech recognition based on this method should occur in three stages. First, the speech signal should be segmented into its constituent phonetic units which are then classified on the basis of their measured acoustic features. Second, the phonetic units should then be combined to form words on some basis, using in part a lexicon which describes all vocabulary words in terms of the selected phonetic units. Third, the words should be combined to form sentences in accordance with some specification of grammar.
Several quite different embodiments of this basic methodology can be found in the literature if one is diligent to search for the less obvious aspects of some of the above components. See for example, the article by W. A. Woods, "Motivation and Overview of SPEECHLIS: An Experimental Prototype for Speech Understanding Research", IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-23, No. 1, February 1975, pp. 2-10. All such systems are very complex systems in which the basic components as outlined above are always present yet sometimes in disguised form. The fact that all these elements aren't fully capable with respect to all syntax and with respect to all vocabulary means that they are very restricted in their use (e.g., task-oriented, as, for instance, retrieval of lunar rock sample information).
Accordingly, it is an object of this invention to provide an organization of such systems which is powerful in its ability to encompass all grammar and vocabulary and all speakers and is methodically organized so that it can be readily expanded.
It is a further object of this invention to provide an organization of such a system that handles durational variations in speech with facility and without the complexity of the so-called dynamic time-warping techniques.