In various types of communication and data processing systems, it is advantageous to use voice interface arrangements for inquiries, commands and exchange of data and other information. The complexity of speech patterns and variations therein among speakers, however, makes it difficult to construct satisfactory automatic speech recognition equipment. Acceptable results have been obtained in special applications restricted to particular individuals and constrained vocabularies. The limited speed and accuracy of automatic speech recognizers, however, has so far precluded wider utilization.
In general, an automatic speech recognition arrangement is adapted to transform an unknown speech pattern into a frame sequence of prescribed acoustic features. These acoustic features are then compared to previously stored sets of acoustic features representative of identified reference patterns. The unknown speech pattern is identified as the closest matching reference pattern. The accuracy of speech recognition is highly dependent on the features that are selected and the criteria used in the comparisons. Acoustic features may be obtained from a spectral, linear predictive or another type analysis of a speech pattern over periods of 5 to 20 milliseconds and the speech pattern features may comprise time frame sequences of spectral distributions or linear prediction coefficients. For an utterance of a single word, the number of time frames may range between 30 to 70 and there may be 10 to 15 spectral distributions or prediction coefficients per frame.
Where a large vocabulary of reference patterns is used, the storage requirements for the reference pattern features and the extended signal processing needed for comparisons of acoustic features result in complex data processing equipment and long delays in pattern identification. It has been recognized that a reduction of the number of feature signals results in an improvement in the cost and speed of recognition. It is difficult, however, to reduce the number of acoustic features without affecting the accuracy of recognition.
U.S. Pat. No. 4,038,503 discloses an arrangement that modifies the time scale of a speech pattern as a function of the changes in spectral distributions and selects representative spectral features for the speech frames. In this way, the number of spectral features is reduced, i.e., by a factor of eight. With respect to linear prediction analysis, U.S. Pat. No. 4,282,403, issued to H. Sakoe on Aug. 4, 1981, describes a method of quantizing prediction parameters to reduce the storage requirements for reference patterns. These techniques are useful in selecting already formed acoustic features for speech frames but do not reduce the number of frames in recognition processing. It is an object of the invention to provide improved speech recognition having both reduced storage and signal processing requirements.