The present invention generally relates to pattern matching systems, and more particularly to a pattern matching system for speech recognition.
Presently, the speech recognition is generally made according to the pattern matching system. According to the pattern matching system, standard patterns are registered in advance, and an unknown speech pattern which is input is collated with the registered standard patterns to find a certain registered standard pattern which most closely resembles or is the same as the unknown speech pattern. This certain registered standard pattern is output as the recognition result.
FIG. 1 is a diagram for explaining an example of a conventional pattern matching system. FIG. 1(A) shows an input pattern of an input speech which is pronounced "tu", and FIG. 1(B) shows a standard pattern which corresponds to "tu". The pattern matching system compares the input pattern with the standard pattern and obtains a degree of similarity (resemblance) of the standard pattern with respect to the input pattern.
There basically are two methods of collating the patterns depending on whether or not the length of speech varies, as described in Niimi, "Speech Recognition", Kyouritsu Publishing Co., for example. A first method carries out the time normalization of the pattern non-linearly, and will hereinafter be referred to as a non-linear matching method. The dynamic programming (DP) matching which is sometimes also referred to as the dynamic time warping is a typical non-linear matching method. On the other hand, a second method carries out the time normalization of the pattern linearly, and will hereinafter be referred to as a linear matching method.
The non-linear matching method requires a large number of operations when compared to the linear matching method. For this reason, it is desirable to use the linear matching method if a sufficiently high matching accuracy can be obtained thereby.
The linear matching method may be categorized into two types, that is, a first type which matches the length of one of two patterns which are collated to the length of the other by time normalization, and a second type which first converts the length of all of the patterns to a predetermined length by time normalization. The first type requires the time normalization process every time two patterns are collated. On the other hand, the second type also converts the length of the standard patterns to the predetermined length when registering the standard patterns, and once the length of the unknown speech pattern is converted into the predetermined length, there is no need to carry out calculations associated with the time normalization when collating the unknown speech pattern with the registered standard patterns. Hence, the number of operations required when collating the unknown speech pattern with the registered standard patterns can be reduced compared to the first type.
However, the problems described below exist in the conventional pattern matching system employing the second type of linear matching method.
For example, a speech pattern shown in portion B of FIG. 2 is obtained when a speech "utumuku" is sampled at a sampling rate of 10 ms to 20 ms. Short words are generally 5 ms to 600 ms long, while long words are generally in the rage of 1.5 s. Hence, the above described sampling at the sampling rate of 10 ms to 20 ms will result in 5 to 60 samples for the short words and approximately 150 samples for the long words, and the number of samples is in most cases converted into 8 or 16 samples by time normalization.
Hence, when the word "tu" is taken as an example of a short word and the word "utumuku" is taken as an example of a long word, 50 samples are obtained for the word "tu" while approximately 120 samples are obtained for the word "utumuku". But when the 50 samples of the word "tu" is converted into 8 samples by the time normalization, the number of samples for "t" is converted into 1 sample and the number of samples for "u" is converted into approximately 7 samples. But when the 120 samples of the word "utumuku" is converted into 8 samples, the consonant (sounds) "t", "m" and "k" virtually do not appear on the converted pattern as may be seen from portion A of FIG. 2. In other words, when the time normalization is carried out, the consonants are preserved for the short words, but the consonants are not preserved and only the vowels remain for the long words. As a result, the long word must be recognized using only the vowels. Therefore, there is a problem in that the words having the same arrangement of vowels cannot be distinguished from each other, and in an extreme case, the word "utumuku" may be recognized as the word "u" because the patterns of the two words become approximately the same after the time normalization of the samples.