The present invention relates to a speech recognizer.
In response to recent developments in speech recognition technology, speech recognizers are almost ready to be put to practical use in various fields. However, in order to achieve such practical use of speech recognizers, many problems must yet be solved.
In actual use, the operating states in which the speech recognizers are used vary, thus causing variations in the voice patterns of speech to be recognized. For example, when background noise in the vicinity of the speech recognizers becomes significant, a speaker must speak loudly, thereby resulting in variations in voice patterns. Such variations in voice patterns caused by a speaker attempting to overcome a noisy environment are called the "Lombard effect". The voice patterns also vary when the speech recognizers are used for a long time and the speaker becomes tired. Furthermore, when the speakers themselves are changed, the voice patterns vary.
Therefore, if a state in which a speech recognizer has learned a reference voice pattern is different from a state in which the speech recognizer is used, a serious problem arises in that the reference voice pattern cannot account for the above mentioned variations in voice patterns, thereby resulting in erroneous speech recognition.
In order to solve this problem, a countermeasure is employed in which the speech recognizer is made to learn all foreseeable voice pattern variations in advance. However, such is not practical in that the learning time and capacity of the speech recognizer must be increased enormously and the operator must perform extremely troublesome operations.
Thus, in recent years, a method has been proposed in which voice pattern variations are calculated each time the speech recognizer is used and analytical conditions are changed in accordance with the voice pattern variations at the time of analyzing characteristic parameters of voice pattern variations. By employing this method, voice pattern variations can be accounted for using a shorter learning time and smaller speech recognizer capacity and requiring a lesser burden on the operator.
Hereinbelow, a known speech recognizer is described with reference to FIG. 1. The known speech recognizer is of a registration type in which the reference voice pattern is made by inputting the voice of a user. As one example in which voice patterns at the time of storing a reference voice pattern and at the time of voice pattern recognition differ from each other, a case where noise in the surrounding environment varies is adopted. In FIG. 1, the known speech recognizer includes a signal input terminal 1, a power measuring portion 20, analyzers 21 and 23, a vowel deciding portion 22, a matching portion 8, and output terminal 9 for outputting a recognition result, a buffer 10 for storing the reference voice pattern and switches 24, 25 and 26.
The known speech recognizer of the above described arrangement is operated as follows. Initially, at the time of storing the reference voice pattern, environmental noise in the vicinity of the speech recognizer, immediately before the input of the reference voice signal, is input to the signal input terminal 1 and the power level of the environmental noise is calculated by the power measuring portion 20. If the power level of the environmental noise exceeds a predetermined threshold value P1, the environment is regarded as being unsuitable for storing of the reference voice pattern and thus, registration of the reference voice pattern is suspended. On the contrary, if the power of the environmental noise is not more than the threshold value P1, a reference voice pattern signal is input to the signal input terminals 1 and is fed to the analyzer 21 where a characteristic parameter is calculated. At this time, the input signal is passed through a filter F1 expressed by the following equation (i). EQU F1(z)=1-0.9375.times.Z.sup.-1 (i)
In the equation (i), character Z denotes the Z-function of the FFT (fast Fourier transformation) X(f) for transforming a time function into a frequency function. Assuming that character t denotes time and character f denotes frequency, the FFT X(f) is given by: ##EQU1## where exp(-j2.pi.ft) is expressed by the Z-function Z, i.e. Z.ident.exp(-j2.pi.ft).
After a high-frequency band of the input signal has been emphasized by the filter F1, the input signal is analyzed. If the LPC cepstrum method is carried out in the analyzer 21, a predetermined number of LPC cepstral coefficients are calculated as characteristic parameters. When the Dower level of the voice pattern exceeds a detection threshold value within a predetermined voice pattern interval, the corresponding characteristic parameter is regarded as the reference voice pattern to be stored in the buffer 10. The above described processing starting from the input of the reference voice pattern signal is performed for all words to be recognized and the registration process is then complete.
Subsequently, at the time of speech recognition, the power level of environmental noise is measured in the same manner as in the reference voice pattern registration process and then, a voice signal is input via the signal input terminal 1. If the power level of the environmental noise is not more than the threshold value P1, a characteristic parameter of the input voice signal is calculated using the analyzer 21 in the same manner as in the registration process, and the thus calculated characteristic parameter is transmitted to the matching portion 8. At the matching portion 8, the variations between the reference voice patterns and the input voice pattern is calculated and a word exhibiting a minimum variation is output as a recognition result from the output terminal 9.
On the other hand, if the power level of the environmental noise exceeds the threshold value P1, the power level of the input voice signal is calculated for each frame by the power measuring portion 20 and then, the power level of the environmental noise and the power of the input voice signal are fed to the vowel deciding portion 22. At the vowel deciding portion 22, a vowel determination is made based on the following conditions (a) and (b).
(a) The signal level is higher than a sum of the noise level and a constant C.
(b) Five or more continuous frames satisfying the above condition (a).
It is determined that a frame which satisfies the conditions (a) and (b) is a vowel. If a frame is not determined to be a vowel, the input signal is fed to the analyzer 21, a high-frequency band of the frame is emphasized using the filter expressed by the above equation (i) and the characteristic parameter is calculated in the same manner as in the case of the reference voice pattern registration process. On the other hand, if a frame is so determined to be a vowel, the input signal is fed to the analyzer 23, a high-frequency band of the frame is emphasized by a filter F2 expressed by the following equation (ii). EQU F2(Z)=1-0.6375.times.Z.sup.-1 (ii)
The emphasis of the high-frequency band of the frame by the filter F2 is less than that of the filter F1 and the tilt of the equation (ii) is milder than that of the equation (i). When environmental noise becomes large, the voice state of a speaker changes such that a high-frequency band of the input voice signal becomes intense. Therefore, the tilt of a filter for emphasizing a high-frequency band in a noisy environment is required to be milder than that in a less noisy environment. After the input voice signal has been passed through the filter F2, the characteristic parameter thereof is calculated in the same manner as in the reference voice pattern registration process.
The calculated characteristic parameter is fed to the matching portion 8 and the recognition result is generated from the output terminal 9 in the same manner as in the case where the power level of the environmental noise is not more than the threshold value P1.
The switch 24 actuates to changed over to the vowel deciding portion 22 and to the analyzer 21 when the power level of the environmental noise exceeds the threshold value P1 and is not more than the threshold value P1, respectively. When no voice signal is being input, the switch 24 is in an OFF state. The switch 26 actuates to changed over to the analyzer 23 and to the analyzer 21 when the frame is determined to be a vowel and is not determined to be a vowel, respectively. Meanwhile, the switch 25 actuates to change over to the buffer 10 and to the matching portion 8 during the reference voice pattern registration process and voice recognition process, respectively.
In the above described known speech recognizer, changes in spectral tilt due to variations of voice patterns are initially compensated for, and then the parameter used for recognition of a voice signal is analyzed. Thus, the known speech recognizer suffers drawbacks in that, since the contents of the compensation are not accurately incorporated into the parameter through the analysis processing, the compensation efficiency is reduced and in some cases, the compensation does not contribute to an improvement of the recognition rate at all.
Furthermore, the known speech recognizer is disadvantageous in that, although it is possible to compensate for changes in spectral tilt, it is not possible to compensate for changes in the resonance frequency characteristic of a vocal sound (referred to as "formant frequency", hereinbelow) caused by variations of voice patterns, thereby resulting in a lowering of the recognition rate.