This invention relates to a system for automatically recognizing continuous speech composed of a plurality of words spoken continuously.
Various methods have been tried hitherto for voice recognition. A simple pattern matching method which is both available and effective will be described below. This method measures the degree of dissimilarity (hereinafter called "similarity measure") between a reference pattern (hereinafter called "reference word pattern") prepared for each word to be recognized and an inputted unknown voice pattern (hereinafter called "input pattern"), thereby recognizing the input pattern as a word of the reference work pattern when the similarity measure is minimized.
A continuous speech recognition system operating according to the above-mentioned pattern matching method has been proposed in the U.S. Pat. No. 4,059,725. This system operates by matching a reference pattern of a continuous voice (hereinafter called "reference continuous voice pattern") obtained through connecting several reference word patterns in every order with the whole input pattern. The recognition is performed by specifying the number and order of the reference word patterns so that the whole similarity measure will be minimized. The above-mentioned minimization is divided practically into two stages those which minimized at a word unit level and those which minimize at a whole pattern level, and each minimization is carried out according to a dynamic programming (the matching using dynamic programming being called "DP matching" hereinafter.)
In minimization at a word unit level, the system divides the input pattern at every conceivable word unit and then performs DP matching with the reference word pattern for all of them. Assuming here that the length of the input pattern is M and that the number of reference word patterns is V, DP matching will be required M.V times.
A technique for reducing the above number of DP matchings to Lmax.V with one word being one digit and the maximum available digit number being Lmax has been proposed by Cory S. Myers and Lawrence R. Rabinar. Reference is made to a paper "A Level Building Dynamic Time Warping Algorithm for Connected Word Recognition" IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND PROCESSING, VOL. ASSP-29, No. 2, APRIL 1981, pp. 284-297. According to this technique, the similarity measure between the input pattern, given in a time series of feature vectors, and the reference continuous voice pattern, given in every combination of pluralities of words, each comprising a time series of feature vectors, will be obtained as follows. A time point m of the input pattern and a time point n of the reference continuous voice pattern are made to correspond with each other by a well-known optimum monotony increased nonlinear function (hereinafter called "time normalized function") n=n(m), and the accumulated value of the distance d(m, n) between feature vectors at the time is thus made corresponding along the time normalized function as the minimum similarity measure. A minimum value of the whole similarity measure obtainable along all matching routes passing a given point is given generally by the sum of minimum values of a partial similarity measure from the start to the given point and a partial similarity measure from the given point to the end. Now, therefore, regarding an end point on each digit of the reference continuous voice pattern as the foregoing given point, a minimum partial similarity measure may be obtained on each digit, and the mimimum whole similarity measure may be obtained by summing the minimum partial similarity measure for all digits. Namely, each reference word pattern on the first digit of the reference continuous voice pattern is subjected first to a matching with the input pattern to obtain a minimum value of the similarity measure, and then the result works as an initial value for the matching of the second digit to carry out a matching of each reference word pattern on the second digit with the input pattern. After matching as far as the Lmax-th digit, a minimum value of the similarity measure on each digit at an end point M of the input pattern is obtained, thus obtaining an optimum digit number L. A recognition category for each digit is obtained successively by following backwardly the matching path from a point of a similarity measure on the L-th digit.
Minimization is effected on each digit of the reference continuous voice pattern, as described above, according to the technique given by Myers et al., therefore the number of DP matching process for each digit is equal to the reference word number V, thereby reducing the number of the whole DP matching process to Lmax.V.
In a speech recognition system, a recognition response takes place in the time from the end of speech being detected to a recognized result being outputted. Then, according to the technique by Myers et al., a matching on the first digit is commenced after the input pattern necessary for matching on the first digit is obtained, and the matching follows successively up to the Lmax-th digit, thus obtaining a recognized result. In other words, a calculation for DP matching proceeds in the input pattern axis direction in the above technique, therefore a major part of the calculation cannot be commenced until the input voice comes to an end. For example, assuming the upper boundary of a range for matching (or well-known matching window) specified by a straight line of inclination 2, the lower boundary specified by a straight line of inclination 1/2 and the maximum length or reference pattern doubled as an average length of reference pattern, the calculation for the matching process only proceeds to 1/4 of the digits at the time point when the input voice comes to end, and the calculation of the remaining 3/4 of the digits will be left for processing after the voice is uttered. The processing time for 3/4 of the digits is the recognition response time with a large time lag, which is problematical is real time. To settle the problem, a complicated and expensive high-speed processor capable of parallel processing and pipeline processing will be required.