This invention relates to a continuous speech recognition system, and more particularly to an improvement thereof for reducing a false recognition due to an unnatural matching path.
A continuous speech recognition system is used for automatically recognizing a speech with two or more continuously spoken words. The pattern matching method for continuous speech recognition has been proposed in the U.S. Pat. No. 4,059,725. This method operates for connecting a plurality of reference word patterns in every order to obtain reference patterns of continuous voice with two or more reference words (hereinafter called "reference continuous voice patterns") and matching the reference continuous voice patterns with the whole input pattern. The recognition is performed by specifying the number and order of the reference word patterns included in the reference continuous voice patterns matched with the input pattern so that a whole similarity measure will be minimized. The above-mentioned minimization is divided practically into two stages, the first being the stage of minimization at word units (hereinafter referred to as "digits") which correspond to the levels of reference words and constituting a reference continuous voice pattern and the second being the stage of minimization as a whole, with each minimization carried out according to dynamic programming (the matching using dynamic programming being called "DP matching" hereinafter).
A technique to reduce the number of times for DP matching has been proposed by Cory S. Myers and Lawrence R. Rabinar. Reference is made to the paper "A Level Building Dynamic Time Warping Algorithm for Connected Word Recognition" IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND PROCESSING, VOL. ASSP-29, No. 2, APRIL 1981, pp. 284-297. According to this method (called LB method hereinafter), the similarity measure between the input pattern given in a time series of feature vectors and the reference continuous voice patterns also given in a time series of feature vectors will be obtained. The reference continuous voice patterns are constituted of every connected combination of a plurality of reference word patterns. In the minimization stage at digits, a minimum value of all similarity measures for a certain digit (a certain word unit) obtainable along all matching paths passing a certain point is given generally by the sum of the minimum value of partial similarity measures from the start point for that digit to the certain point and that of partial similarity measures from the certain point to the end. Now, if the end point for that digit is regarded as the mentioned "certain point", the minimum value of the similarity measures for the digits--that digit and the next digit--can be given by the sum of the minimum value of the similarity measures for that digit, i.e. from its start point to its end point (=the certain point in this case) and that of the similarity measures for the next digit, i.e. from the start point (=the certain point) to the end point of the next digit. Thus, the minimum whole similarity measure is obtained by summing the minimum similarity measures for all digits. Namely, possible reference word patterns for the first digit of the reference continuous voice pattern are subjected first to a matching with the input pattern to obtain a minimum value of the similarity measure for the first digit, and then the result works as an initial value for matching of the second digit to carry out a matching of reference word patterns on the second digit with the input pattern. After matching as far as the final digit permitted, a minimum value of the similarity measure for each digit at an end point of the input pattern is obtained, thus obtaining an optimum digit number. A recognition category on each digit is obtained successively by following backwardly the matching path from a point of a similarity measure on the optimum digit.
For the purpose of reducing the number of calculations in the DP matching method and avoiding a false recognition caused by taking an unnatural matching path, a matching window is given, generally as global constraints, limiting the matching path. The matching window is given by two straight lines U (i) and L (i) of fixed inclination which are extended from the origin (the starting time point of the input pattern and the reference pattern) or by a parallelogram whose vertexes are starting and ending points.
To the DP matching method, the matching window is applicable as it is. To the LB method, however, it is not directly applicable since the starting points fixed for each digit are different on each digit. Therefore, in the abovementioned paper by Myers et al., U (i) and L (i) are given by the following expressions. ##EQU1##
Here, .phi.(x) is the total length of the reference patterns of words recognized up to the (X-1)th digit (length of the concatenated super reference patterns). Before the similarity measure at each time point (i, j) is calculated, determination is made as to whether the time point (i, j) is located within the matching window given by the expressions (1) and (2), and the calculation is conducted only for time points located in the matching window. However, since the recognition result up to the (x-1)th digit is obtained through the decision processing (back tracking) conducted after the operation up to the final digit is completed, .phi. (x) is unknown in the course of the operation. Accordingly, it is necessary to set a large value for .phi. (x) and, generally, the length of the reference pattern of the longest word prepared for each digit is inevitably assigned. Namely, .phi. (x) for the x-th digit is expressed by EQU .phi.(x)=x.multidot.J.sub.max ( 3)
where J.sub.max shows the length of a pattern having the maximum length out of a plurality of reference patterns. As a result, a larger value must be set for .phi. (x) of a higher digit. Therefore, the difference between .phi. (x) and the true total length of the reference patterns is accumulated as the digit approaches the final digit, loosening the restriction by the matching window. Accordingly, the function of global constraints by the matching window is not fulfilled and a false recognition based on taking an unnatural matching path takes place. Especially in the case when numerals are continuously uttered without any restriction of digits, .phi. (x) becomes larger with the increase in digits, and thus the aforesaid drawback becomes more pronounced.
The following is an example of false recognition due to an unnatural matching path. When a certain sound element is uttered continuously over two words, the continuously uttered sound section may be recognized to be a single sound element, and therefore a sound element or a word may be omitted from the recognition result. In this case, the matching path is extended almost horizontally in the direction of the time axis of an input pattern. This means that the conventional loose window restriction allows such an unnatural matching path. To the contrary, it can also happen that a sound element, though being one sound element originally, comes into matching with a section comprised of a continuously spoken sound elements of reference pattern (i.e. the insertion of the sound element or the word). In this case, the matching path is nearly parallel to the time axis of the reference pattern, leading to false recognition.