Our invention relates to speech recognition and more particularly to an arrangement for recognizing prescribed speech segments in continuous speech.
In communication, data processing and control systems, it is often desirable to utilize speech as direct input for data, commands, or other information. Speech input arrangements may be utilized to record transactions, to record and request telephone call information, to control machine tools, or to permit a person to interact with data processing and control equipment without diverting his attention from other activity. Because of the complex nature of speed, its considerable variability from speaker to speaker and variability even for a particular speaker, it is difficult to attain perfect recognition of speech segments.
One type of priorly known speech recognition system converts an input speech signal into a sequence of phonetically based features. The derived features, generally obtained from a spectral analysis of speech segments, are compared to a stored set of reference features corresponding to the speech segment or word to be recognized. If an input speech segment meets prescribed recognition criteria, the segment is accepted as the reference speech segment. Otherwise it is rejected. The reliability of the recognition system is thus highly dependent on the prescribed set of reference features and on the recognition criteria. Where the set of reference features are obtained from the same speaker and the word to be recognized is spoken in isolation, the speech recognition system is relatively simple and may be highly accurate.
Another type of speech recognition system disclosed in the article "Minimum Prediction Residual Principle Applied to Speech Recognition," by Fumitada Itakura in the IEEE Transactions on Acoustics, Speech, and Signal Processing, February 1975, pages 67-72, does not rely on a prescribed set of spectrally derived phonetic features but instead obtains a sequence of vectors representative of the linear prediction characteristics of a speech signal and compares these linear prediction characteristic vectors with a corresponding sequence of reference vectors representative of the linear prediction characteristics of a previous utterance of an identified speech segment or word. As is well known in the art, linear prediction characteristics include combinations of a large number of speed features and thus can provide an improved recognition over arrangements in which only a limited number of selected spectrally derived phonetic features are used.
The linear prediction recognition system of Itakura requires that the same speaker provide the reference speech segment as well as the speech segment to be identified and also requires that the speech segment be spoken in isolation. In continuous speech, however, the prediction characteristics of each segment are dependent on the preceding and following speech segments. Therefore, the successful recognition of an identified speech segment or word in a continuous speech sequence is limited. The technique of Itakura further requires the use of the prediction characteristics of the entire speech segment for recognition. It has been found, however, that the use of the unvoiced region prediction parameters for speech segment recognition severely limits its accuracy.
It is an object of the invention to provide an improved speech recognition arrangement for recognizing speech segments in continuous speech on the basis of linear prediction characteristics of prescribed regions of a continuous speech signal.