The present invention relates to a speech recognition apparatus for performing a two-stage dynamic programing matching operation.
Generally speaking, the speech recognition apparatus extracts a characteristic parameter of inputted speech and performs pattern matching between the characteristic parameter of the inputted speech and the pattern data stored in advance, thereby selecting the stored pattern data which is the most similar to the inputted speech pattern for the purpose of speech recognition.
A speech frequency spectrum is generally used as the characteristic parameter of the aforementioned speech input data and the frequency spectrum can be obtained by using many band pass filters or a high speed Fourier conversion. The frequency spectrum obtained by these method is compared with the patterns registered in advance, thereby detecting the degree of resemblance between the inputted speech and the registered standard pattern. Then, the standard pattern nearest to the inputted speech data is outputted as the inputted data.
When the aforementioned inputted data is compared with the registered pattern, time axises of these data are not always conformed to each other one by one, and they vary, depending on the relationship with words coming before and after the particular speech to be detected or depending on the length of the long vowel. As a method of obtaining the degree of resemblance of the patterns the time axis of which do not accord with each other one by one, namely, a method of performing a pattern matching, a dynamic programing (hereinafter abbreviated to DP) method is employed. Generally, the degree of the resemblance of two patterns is expressed as a distance concept.
The DP method has such a disadvantage that it detects the minimum distance between two characteristic points and thus, the operational time becomes longer as the number of the characteristic points is increased.
In order to remove the aforementioned defect of the DP method, the following method is conventionally used. First of all, the pattern itself is subject to a linear expansion and depression, and is preliminarily selected by a linear matching. As to the standard pattern thus selected, the standard pattern of the minimum distance is obtained, using the DP method.
The operation speed of the linear matching method is higher than the DP method and thus, the operation speed of the aforementioned method is higher than the method of performing the DP matching for all of the standard patterns. However, as this method performs a linear depression of the pattern, it is disadvantageous in that a feature of a nonlinear depression of the time axis inherent to the DP method is weakened. Furthermore, the linear matching used for a preliminary selection causes more errors in recognition than the DP method, with regard to a word of a longer continuation period, thereby failing to select a desired standard pattern.
On the other hand, a speech recognition apparatus detects the maximum value of the speech data at a predetermined period (one frame) and normalizes the data within one frame period. The normalization enables the number of the effective bits to decrease (i.e. depression) and shortens the time for a recognition operation.
Where 8 bit data is depressed to 4 bit data for the operation purpose, first of all, the maximum value of the input 8 bit data is obtained and the data is divided by the input data. Thus, the normalization is conducted, thereby causing the maximum value to be "1" and the other data to be less than "1". By multiplying the normalized data by 24, for example, the bits of the required number namely, 4 bits, are obtained. To be multiplied by 2.sup.4 is equivalent to a four bit shifting of the normalized data toward an upper digit.
Furthermore, the above-recited normalization process normalizes data using the maximum value of the input data, and a sum of input data may be used for a normalization.
Supposing that input data is A(n) and that the normalized result is a(n), the normalization method of using the maximum value is expressed as follows: EQU a(n)=A(n)/A max (1)
The normalization method of using the sum value is expressed as follows: EQU a(n)=A(n)/.SIGMA.A(n) (2)
where A max designates the maximum value within one frame period.
The normalization methods expressed by the above-recited (1) and (2) equations have drawbacks such that the normalized result, namely, a(n) in the equations (1) and (2), does not reflect a level (power) information of speech at all.
In the speech recognition apparatus, a frame subject to normalization by using a low-level value, for example, a silent sound or consonant, and a frame of a produced sound are equally treated. Thus, the level information in the respective frames disappear, thereby resulting in an erroneous recognition.