This invention relates to a speech recognition system, more particularly to a speech recognition system and employing a pattern matching technology.
Pattern matching is a standard method of speech recognition. In pattern matching, an input speech segment is analyzed from its startpoint to its endpoint at fixed intervals called frames to extract the features of each frame. A common example of such analysis is bandpass filter analysis using a bank of filters with differing center frequencies, which analyze the input signal into numbered channels. The result is a speech pattern--a time series of feature vectors--which can be compared with a set of reference patterns belonging to the categories of speech the system recognizes. Each category has one or more reference patterns considered typical of it. The similarity is calculated between the input speech pattern and each reference pattern, and the name of the category to which the most closely-matching reference pattern belongs is the result of the recognition process.
The following are two examples of pattern matching algorithms.
The first example is called the linear matching algorithm. As described in Oki Kenkyu Kaihatsu No. 118, Vol. 49, pp. 53-58, the input speech pattern is first subjected to linear expansion or compression on the time axis to absorb differences in the speaking rate; then it is matched against the reference patterns.
The second example is a nonlinear matching algorithm known as the DP matching algorithm. As set forth in Japanese Patent Application Publication No. 23941/1975, it uses dynamic programming to align the input speech pattern with the reference patterns by a nonlinear "warping" of the time axis, thereby obtaining optimal compensation for distortion resulting from factors such as variations in speaking rate.
Thus pattern matching algorithms are methods that evaluate, in terms of similarity, the difference between an input speech pattern and reference patterns (that were generated in the same way as the input pattern), and select the category to which the most similar reference pattern belongs. The underlying assumption is that if the input pattern and a reference pattern belong to the same category, they will have a high similarity, while if they belong to different categories, they will show a lower similarity.
Due to factors such as individual speaker differences and differences in environmental conditions, however, many types of variations in speaking rate occur, with the result being that two patterns belonging to the same category do not necessarily score high in similarity. Also, when the speaking rate varies, vowels tend to shrink or expand by large amounts, while consonants do not, so that linear expansion or compression of the time axis does not work very well in matching an input speech pattern with reference patterns. Specifically, the vowels in the input pattern fail to align with the vowels in the reference pattern, lowering the similarity score.
To cope with such variations, in the linear matching algorithm given above as the first example of the prior art, multiple reference patterns are provided for each category. This creates a problem of memory space, however, because each of the many reference patterns must be stored.
The DP algorithm, which was the second example of the prior art given above, is considered to be one solution to the problem of multiple reference patterns in the linear matching algorithm. This algorithm gets by with a small number of reference patterns by using the dynamic programming method to perform nonlinear speech expansion and compression, but the process of determining the optimal alignment between the input pattern and the reference patterns is complex and requires a large amount of circuitry, leading to problems of device size.