In communication, data processing and similar systems, it is often advantageous to simplify interfacing between system users and processing equipment by means of audio facilities. Speech input and synthesized voice response may then be used for inquiries, commands and for exchange of data and other information. Speech based interface apparatus also permits communication with processing apparatus from remote locations without hand-operated terminals and allows a user to perform other functions at the same time.
Speech patterns, as is well known in the art, are relatively complex and exhibit a high degree of variability among speakers. These factors have made it difficult to devise accurate automatic speech recognition equipment. Acceptable results have been achieved in special situations restricted to constrained vocabularies and to particular individuals. Expanding the number of speakers or the vocabular to be recognized, however, causes an unacceptable decrease in accuracy for practical utilization.
Speech recognition arrangements are generally adapted to transform an unknown input speech pattern into a sequence of acoustic features. Such features may be based on a spectral or a linear predictive analysis of the pattern. The feature sequence generated for the pattern is then compared to a set of previously stored acoustic feature patterns representative of reference utterances for a selected vocabulary. As a result of the comparison, the input speech pattern is identified as the closest corresponding reference pattern. The accuracy of the recognition process is therefore highly dependent on the selected features and the predetermined criteria controlling the comparison.
While many comparison techniques have been suggested for recognition, the most successful ones take into account variations in speech rate and articulation. One such technique, dynamic programming, has been employed to determine an optimum alignment between acoustic feature patterns in the speech recognition process. Advantageously, dynamic time warping of patterns in accordance with dynamic programming principles mitigates the effects of differences in speech rate and articulation. Signal processing systems for dynamic time warping based recognition such as disclosed in U.S. Pat. No. 3,816,722 are relatively complex, must operate at high speeds to provide results in real time and require very large digital storage facilities. Further, the recognition processing time is a function of the size of the reference vocabulary and the number of reference feature patterns needed for each vocabulary item. Consequently, speaker-dependent recognition for vocabularies of the order of 50 words is difficult to achieve in real time.
Several techniques have been suggested to improve speech recognition processing time. One arrangement disclosed in U.S. Pat. No. 4,256,924 issued to H. Sakoe on Mar. 17, 1981 utilizes a set of standard vectors to represent the reference speech pattern acoustic features so that the complexity of the dynamic time warping as well as the time delay required are reduced. It is necessary, however, to perform a dynamic time warping operation for each reference pattern in the vocabulary set. An alternative scheme described in the article, "Discrete Utterance Speech Recognition Without Time Normalization" by John E. Shore and David Burton, ICASSP Proceedings, May 1, 1982, pp. 907-910, performs a vector quantization to produce a code book for each reference word in a vocabulary set. A signal representative of the similarity between the sequence of input speech pattern features and the sequence of code book features for each word in the reference vocabulary is then generated and the best matching reference is selected. While the vector quantized code book recognition eliminates the delays in dynamic time warp processing, the accuracy obtained is subject to errors due to variations in speech rate and articulation. It is an object of the invention to provide improved speech recognition having high accuracy and reduced processing time requirements.