The present invention relates to speech recognition systems and more particularly, to a system for recognizing an utterance as one of a plurality of reference utterances, and the method therefor.
In communication, data processing and control systems, it is often desirable to utilize speech as direct input for data, commands, or other information. Speech input arrangements may be utilized to record transactions, to record and request information, to control machine tools, or to permit a person to interact with data processing and control equipment without diverting attention from other activity. Because of the complex nature of speech, its considerable variability from speaker to speaker and variability even for a particular speaker, it is difficult to attain perfect recognition of speech segments.
One type of priorly known speech recognition system converts an input speech signal into a sequence of phonetically based features. The derived features, generally obtained from a spectral analysis of speech segments, are compared to a stored set of reference features corresponding to the speech segment or word to be recognized. If an input speech segment meets prescribed recognition criteria, the segment is accepted as the reference speech segment. Otherwise it is rejected. The reliability of the recognition system is thus highly dependent on the prescribed set of reference features and on the recognition criteria.
Another type of speech recognition system disclosed in the article "Minimum Prediction Residual Principle Applied to Speech Recognition," by Fumitada Itakura in the IEEE Transactions on Acoustics, Speech, and Signal Processing, February 1975, pages 67-72, does not rely on a prescribed set of spectrally derived phonetic features but instead obtains a sequence of vectors representative of the linear prediction characteristics of a speech signal and compares these linear prediction characteristic vectors with a corresponding sequence of reference vectors representative of the linear prediction characteristics of a previous utterance of an identified speech segment or word. As is well known in the art, linear prediction characteristics include combinations of a large number of speech features and thus can provide an improved recognition over arrangements in which only a limited number of selected spectrally derived phonetic features are used.
The prior art systems mentioned above require the use of an A-D converter in order to digitize the input speech signal, the digitized quantities being stored for subsequent processing by a digital computer or processor. The amount of storage required to store the digitized quantities, while dependent upon the sampling rate, can be extremely large. Therefore, there exists a need for a speech recognition system which would eliminate the plurality of spectral filters, eliminate the bulky and costly A-D converters, and reduce memory requirements of the prior art systems while maintaining a high degree of speech recognition capability, and also be more readily implementable in VLSI technology.