The present invention relates to a speech recognition system, in particular, relates to such a system which improves the recognition performance by modifying the distance between an input speech and a reference speech by weighting some elements of reference speech.
A prior speech recognition system is shown in FIG. 1, in which the reference numeral 11 is an input terminal for accepting an input speech signal to be recognized, 12 is a frequency analyzer, 13 is a detector for detecting the start point and the end point of the speech to be recognized, 14 is a start signal of a speech, 15 is an end signal of a speech, 16 is a spectrum converter, 17 is a distance calculation means between a reference speech and an input speech, and 18 is a decision circuit.
The frequency analyzer 12 is shown in FIG. 2, in which an input speech signal 21 to be recognized is applied to a plurality of bandpass filters 23-1 through 23-n through the pre-amplifier 22. The center frequency of those bandpass filters is, preferably, in the range between 200 Hz and 6000 Hz, and the duration of the adjacent two center frequencies is equal to other durations on a logarithmic scale. The outputs of those bandpass filters 23-1 through 23-n are applied to the multiplexer 26 through the rectifiers 24-1 through 24-n, and the lowpass filters 25-1 through 25-n. The output of the multiplexer 26 is applied to an analog-digital converter 27, which provides the digital outputs for every predetermined interval which is called a sampling period. The output of the converter 27 is applied to the output terminal 29 through the logarithmic converter 28. According to the preferred embodiment, the number of the bandpass filters is 16, and the sampling period is 10 msec.
The output of the frequency analyzer 12 is applied to the start-end detector 13, and the spectrum converter 16.
The start-end detector 13 detects the start point and the end point of the speech to be recognized, and the detected timing of the start point and the end point is applied to the distance calculation means 17 as the start signal 14 and the end signal 15. That detector 13 is implemented by calculating the average level of the outputs of the lowpass filters 25-1 through 25-n for every sampling period, providing the start timing when that average level exceeds the predetermined value, and providing the end timing when that average value becomes lower than that predetermined value.
The spectrum converter 16 normalizes the speech power and the speech source characteristics so that both weak speech and loud speech can be recognized. The spectrum converter is explained in accordance with FIG. 4A and FIG. 4B.
In FIG. 4A, the horizontal axis shows the frequency, or the channels that are the position of the bandpass filter (and the rectifier and the lowpass filter) in FIG. 2, and the vertical axis shows the power of that channel. The curve (a) shows the case that the speech is loud, and the curve (b) shows the case that the speech is weak. The curves (a) and (b) are approximated by the linear lines (P) and (Q), respectively. Those lines P and Q are obtained through the method of least squares. Then, the difference between the line P (or Q) and the curve a (or b) is obtained, and that difference is the converted spectrum. Therefore, the converted spectrum has the sign, that is to say, the converted spectrum is positive at the region (a1), (a3), (b1) and (b3), and is negative at the region (a2) and (b2). The converted spectrum is independent of the strength of the speech.
The calculation for that conversion is as follows.
Supposing that the output of the frequency analyzer 12 is the i'th channel (i is in the range between 1 and n, and in a preferred embodiment n=16), at some sampling time, is x.sub.i, then, the converted data x.sub.i is expressed as follows. EQU x.sub.i =x.sub.i -(Ai+B) (1)
where A and B are determined by the least squares fit line P or Q of FIG. 4A, and are obtained by the equations below. ##EQU1## In the equations (2) and (3), since the number N of data is constant, ##EQU2## are constant, and therefore, the denominator of the equations (2) and (3) is constant. Therefore, by putting ##EQU3## the equations (2) and (3) are expressed as follows. ##EQU4## where ##EQU5## As apparent from the equations (4) and (5), the values A and B are obtained by calculating ##EQU6## and further, the converted spectrum x.sub.i is obtained by using the equation (1).
FIG. 4B is a block diagram of a spectrum converter 16 for providing x.sub.i from x.sub.i.
The input data x.sub.i from the frequency analyzer 12 is applied to the input terminal 31 and is applied to the multiplier 33, which provides the product of x.sub.i and i which is generated by the counter 32. The counter 32 provides that value (i) which is synchronized with the input data. The adder 34 and the register 35 accumulate the output of the multiplicator 33, and then, the register 35 provides the value ##EQU7## Similarly, the adder 36 and the register 37 accumulate the value x.sub.i, and then, the register 37 provides the value ##EQU8##
The selector 38 selects one of the constants N and C.sub.1, and the selector 39 selects one of the constants C.sub.1 and C.sub.2. The selected constants are applied to the multipliers 40 and 41, respectively.
When the selectors 38 and 39 select N and C.sub.1, respectively, the multiplier 40 provides the product ##EQU9## the the multiplier 41 provides the value ##EQU10## Then, the subtract-divider 42 provides the ratio of the difference between the outputs of the multipliers 40 and 41, and the constant C.sub.3, and that ratio is: ##EQU11## That ratio is equal to the value A of the equation (4). The value A is stored in the register 43.
Similarly, when the selectors 38 and 39 select C.sub.1 and C.sub.2, respectively, the subtract-divider 44 provides the ratio of the difference between the outputs of the multipliers 40 and 41, and the constant C.sub.3. That ratio is equal to the value B of the equation (5) as follows. ##EQU12## The value B is stored in the register 45.
Then, the multiplier 47 provides the product of A and i, which is generated by the counter 46 synchronized with the input data. The adder 48 provides the sum of the outputs of the multiplier 47 and the register 45, and then, the sum is Ai+B.
Finally, the subtractor 50 provides the difference between the output Ai+B of the adder 48, and the input data x.sub.i which is supplied through the delay circuit 49, then the subtractor 50 provides the value: EQU x.sub.i =x.sub.i -(Ai+B)
The delay circuit 49 compensates the time for the calculation of Ai+B so that the subtractor 50 can receive both values Ai+B and x.sub.i with the synchronized condition.
Thus, the output 51 provides the converted spectrum x.sub.i of the equation (1), and that converted spectrum is the difference between the original spectrum x.sub.i and the least squares fit line as described in FIG. 4A.
The distance calculation means 17 of FIG. 1 is shown in FIG. 3. In FIG. 3, the reference numeral 14 is the speech start signal provided by the detector 13, 15 is the speech end signal provided by the detector 13, 103 is the input data from the spectrum converter 17, 104 is a memory control circuit for controlling the input memory 105, 105 is an input memory for storing the input data from the line 103 between the start and end of the speech, 106 is a reference memory control circuit, 107 is a reference memory which stores the reference speech information. The reference numeral 108 is a distance calculator, 109 is an adder, 110 is a register. The output of the register is applied to the decision circuit 18 of FIG. 1.
The input memory 105 stores the input data which is the converted spectrum of the input speech, between the start of the speech and the end of the speech. The converted spectrum is applied to that input memory 105 through the control circuit 104. The input memory stores the input data of all the channels (see FIG. 4A) for every sampling time. It should be noted that each of those data has a sign (positive or negative), and an absolute value.
When all the input data is stored in the input memory 105, the distance calculator 108 calculates the distance between the input data and each reference data. There are a linear time warping method and a dynamic time warping method for the distance calculation. For simplicity we assumed that the input speech and each reference speech are linearly warped to M (32) frames. FIG. 3 is an example of a known distance calculator.
The distance calculator 108 reads out each element of the input data from the input memory 105 through the control circuit 104, and each element of the reference data from the reference memory 107 through the control circuit 106, then, the distance calculator 108 calculates the absolute value of the difference between the elements of the input data and the selected reference data. The distance calculated by the calculator 108 is accumulated by the adder 109 and the register 110 for all the elements of the selected category. Therefore, the equation for the calculation in 108, 109 and 110 is shown below. ##EQU13## where i shows the channel, L shows the sampling number, M shows the number of linearly warped sampling points, N is the number of channels, R is the reference data, and I is the input data. Of course, it should be noted that the register 110 is cleared when the distance calculation begins.
The result of the distance calculation is applied to the decision circuit 18, which compares the distance of the particular input data with each of the reference categories, and determines that the input speech is the same as the reference category which gives the lowest distance.
However, a prior speech recognition system has the disadvantage that the result of the recognition is sometimes in error, or the recognition is even impossible. That disadvantage comes from the fact that the speech depends upon each speaker, and even the speech of the particular speaker changes for each pronunciation. Therefore, an error occurs when the converted spectrum is similar to another converted spectrum.