My invention relates to pattern recognition and more particularly to arrangements for automatically recognizing speech patterns.
Speech recognition systems used in communication, data processing and control systems generally compare the acoustic features of an input speech pattern with previously stored acoustic features of identified reference patterns. The input pattern is recognized when its acoustic features match the stored features of a reference pattern according to predetermined recognition criteria. The variability of speech patterns from speaker to speaker, however, limits recognition accuracy and makes it highly dependent on the selected acoustic features and recognition criteria. Best results are obtained when the reference features and the input speech features are derived from the same individual and the patterns are spoken with distinct pauses between words.
Identification of single word patterns may be accomplished by comparing the word pattern to each of a set of reference word patterns. A multiword pattern must be recognized as one of many combinations of reference vocabulary words. Consequently, automatic recognition of such multiword patterns is more difficult than automatic recognition of single word patterns. Such recognition of multiword speech patterns may be performed by comparing the sequence of input speech features with every possible combination of reference word feature signal patterns. This method, however, requires time-consuming testing of all possible reference word pattern combinations and exhaustive searching for the best matching combination. But, as is well known, the number of possible combinations increases exponentially with the number of words in a speech pattern. As a result, it is generally impractical to perform the testing and searching even for a speech pattern of a limited number of words.
U.S. Pat. Nos. 4,049,913 and 4,059,725 disclose speech recognition systems in which the similarity between individual reference word feature patterns and the features of all possible intervals of an input speech pattern are calculated. Partial recognition results are derived for each reference word feature pattern from the similarity measures. Both the partial similarity measures and partial recognition results are stored in a table. The recognized results are extracted from the table to provide the reference word series corresponding to the speech pattern. All possible partial pattern combinations from the table are selected and the selected pattern with maximum similarity is chosen. Since all possible partial pattern combinations are required, the signal processing is highly complex and uneconomical for many applications.
The number of possible reference sequences in recognition processing may be reduced by utilizing semantic and syntactic restrictions. U.S. Pat. No. 4,156,868 issued to S. E. Levinson, May 29, 1979, discloses a recognition arrangement based on syntactic analysis in which an input speech pattern is compared to only syntactically possible reference patterns. The syntactic restrictions significantly reduce the recognition processing. There are, however, many situations where the recognition of sequences of unrelated spoken words such as a series of spoken digits in credit card or telephone numbers is important. For such unrestricted vocabularies, recognition is not improved by resorting to syntactic or other semantic restraints.
U.S. Pat. No. 4,400,788, issued Aug. 28, 1983 to C. S. Myers et al and assigned to the same assignee, discloses a dynamic time warping recognition arrangement in which a set of word levels is defined for the input speech pattern and the successive segments of the input speech pattern are assigned to each word level. At each successive level, the assigned segment feature signals and the feature signals of each reference word are time registered to generate time registration endframe and time registration similarity signals for the reference words in the level. Reference word strings are selected responsive to the time registration endframe and similarity signals by backtracking through the recorded time registration paths. While the arrangement obviates the need to consider all possible reference word combinations in recognition processing, it is necessary to first store the entire input speech pattern feature signal sequence and to repeatedly sequence through speech pattern feature signal frames. The multiple passes through the input speech pattern time frames, extends the recognition processing and makes "real time" speech recognition difficult to achieve.
The article "An Algorithm for Connected Word Recognition", by Bridle, Brown and Chamberlain, appearing in the Proceedings of International Conference on Acoustics, Speech and Signal Processing, 1982, pp. 899-902, discloses a dynamic time warping recognition method in which time registration paths are formed for all reference words in a single pass through the input speech pattern frames. The single pass is performed as the feature signals are generated for the sequence of input speech pattern frames. Consequently, real time operation is more easily realizable. If, however, two different registration paths representing different length word strings converge during the single pass, only one registration path can be retained. The other converging path is discarded from consideration. But the other converging path could result in a better match by the end of the registration process. The elimination of such potential candidates lowers the accuracy of the recognition arrangement.
It is an object of the invention to provide improved automatic speech recognition having higher accuracy and real time response.