The present invention relates to a speech recognition system which recognizes a speech by comparing the feature vector of an unknown speech with the feature vector of a reference speech which is stored in a dictionary, in particular relates to such a system which recognizes the variable speed speech.
In this specification, a feature vector means a plurality of speech feature at a sampling point, and a feature vector system means the sequence of a feature vector in a predetermined duration.
FIG. 1 shows a block diagram of a device for producing a feature vector system of unknown speech. In the figure, an analog unknown speech applied to an input terminal IN is applied to a plurality of narrow bandpass filters BPF.sub.1 through BPF.sub.n. The number of n is for instance 16, and the center frequency of each bandpass filters is in the range from 250 Hz to 5 kHz. Each bandpass filter detects the particular spectrum of an unknown speech. The outputs of the bandpass filters are applied to the low pass filters LPF, through the rectifiers REC. The cutoff frequency of the lowpass filters is for instance 50 Hz for removing the influence of a pitch which has the period of about 10 mS. The outputs of the lowpass filters are multiplexed by the multiplexer MPX, and the output of that multiplexer is applied to the analog-to-digital converter A/D, which converts the signal to a digital form. Next, the feature vector producing system VEC scans the output of the converter A/D in every 10 mS, and provides the feature vector having 16 elements in every 10 mS. Therefore, if the speech length is 300 mS, 480 (=16.times.30) of vector elements are obtained. Finally, the detector DET detects the speech duration in which a speech is actually spoken, and normalizes the feature of the speech source. The output of the detector DET is a feature vector system of unknown speech, having 16.times.(T/10) elements, where T is the speech length in mS. The feature vector system of unknown speech at the output of the output terminal OUT is compared with the feature vector systems of the reference speeches, and that unknown speech is recognized to be the same as the reference speech which provides the minimum length between the unknown speech and the reference speech.
By the way, in comparing an unknown speech with a reference speech, the speech length of the former must be the same as the latter. FIG. 2 shows a format of speech characterized by a sequence of feature vectors. Each feature vector lies along a predetermined time T on the vertical axis and is characterized by 16 channels ranging from 250 Hz to 5,000 Hz along the horizontal axis.
The curve of FIG. 2 is obtained by plotting the formant on the detector DET for 16 channels in every 10 mS.
In recognizing a speech, the speech length T must be normalized so that the speech length T.sub.1 is the same as the length T.sub.2 of the reference speeches.
A prior system for normalizing a speech length is a linear method, in which an element of an unknown speech corresponds to the element of a reference speech by multiplying the predetermined coefficients. In the example of FIG. 2, supposing that the elements t.sub.1 and t.sub.2 of the unknown speech correspond to the elements t.sub.1 ' and t.sub.2 ' of the reference speech, then, the relations t.sub.1 =t.sub.1 '.times.(T.sub.n /T.sub.m), and t.sub.2 =t.sub.2 '.times.(T.sub.n /T.sub.m) are satisfied in a linear method. However, a prior linear method has the disadvantage that the recognition performance is not good, because all the elements are expanded or shortened linearly without considering the feature of speech.
Another prior system for normalizing a speech length is a dynamic programming system, which is disclosed in, for instance, the Japanese patent publication 50--19227. In a dynamic programming system, the coefficient for multiplying to the time t.sub.1 of unknown element is not constant, but is variable, and the many sampling points of unknown speech (for instance more than 30%) correspond to all the sampling points of a reference speech. For that conversion of the sampling points, the calculation process is very complicated. Further, the prior dynamic programming system has the disadvantage that the recognition performance is not good, because the conversion of the sampling points is performed not only for the speech element but also for the coupling elements between speech elements. That coupling element is called co-articulation.