1. Field of the Invention
The present invention relates to a speech recognition apparatus, in particular, to technologies for equalizing environmental differences between an input speech and a reference pattern so as to improve environmental capability.
2. Description of the Related Art
It is known that when speech is recognized, if a speech generating environment of input speech is different from that of a reference pattern speech, the speech recognition ratio is deteriorated. Among major factors that deteriorate speech recognition ratios, there are additive noise and channel distortion. The former is for example a background noise, which mixes with the speech of a speaker and is additive in the spectra domain. The latter is channel distortion directly for example transmission characteristics of microphones, telephone lines, and so forth, and for which multiplicatively distort on a spectrum.
A technique that suppresses an additive noise such as background noise that is mixed with speech is known. This technique is known as spectral subtraction that is described in for example "Suppression of Acoustic Noise in Speech Using Spectral Subtraction", by S. F. Boll, IEEE Trans. on ASSP, Vol. ASSP-27, No. 2, 1979 (hereinafter referred to as reference [1]). A speech recognition apparatus using the spectral subtraction technique is constructed for example as shown in FIG. 2.
In FIG. 2, reference numeral 21 is a spectrum calculating portion. Input speech that is mixed with noise is sent to the spectrum calculating portion 21. The spectrum calculating portion 21 transforms the input speech into a time sequence of spectra. Reference numeral 22 is a noise estimating portion that estimates the spectrum of the noise component that is mixed with the time sequence of the spectra Of the input speech signal using only noise spectra (that does not contain speech in the input speech signal). Reference numeral 23 is a noise suppressing portion that subtracts the spectrum of the noise, which is estimated by the noise estimating portion 22, from the entire time sequence of the spectra of the input speech signal, which is obtained by the spectral calculating portion 21. Reference numeral 24 is a feature vector transforming portion that transforms the time sequence of the spectra of the speech, in which the noise is suppressed by the noise suppressing portion 23, into a time sequence of feature vectors that is used for recognizing the speech.
Reference numeral 26 is a matching portion that calculates the similarity between a time sequence of feature vectors of speech of a standard speaker that does not contain noise and the time sequence of the feature vectors of the input speech corresponding to a technique that obtains a time-alignment and calculates similarity such as DP matching technique or HMM (Hidden Markov Modelling) technique and outputs as a recognition result a dictionary alternative with the highest similarity. Such a speech recognition apparatus can precisely suppresses the additive noise and provides high recognition ratio even if noise component varies every input speech.
In addition, to prevent the recognition ratio from lowering due to the channel distortion, a construction as shown in FIG. 3 is known. In FIG. 3, reference numeral 32 is a reference pattern. The reference pattern 32 is formed in the following manner. Speech of a standard speaker is collected by a microphone with the same characteristics as a microphone that has collected input speech. The speech collected by the microphone is passed through a channel with the same characteristics as a channel through which the input speech has been passed. The resultant speech is analyzed in the same process as an analyzing portion 31 does. The analyzed speech is registered. The analyzing portion 31 transforms the input speech into a time sequence of feature vectors.
Reference numeral 33 is a matching portion that calculates the similarity between the time sequence of the feature vectors of the speech of the standard speaker registered in the reference pattern 32 and the time sequence of the feature vectors of the input speech corresponding to the technique that obtains the time-alignment and calculates the similarity and outputs as a recognition result a dictionary alternative with the highest similarity. When such a speech recognition apparatus is constructed, in the case that the microphone and the signal transmission line are known when speech is recognized and they can be used for collecting the training speech, the channel distortion due to the characteristics of the microphone and the transmission characteristics of the channel of the reference pattern can be matched with those of the input speech. Thus, the speech recognition apparatus can precisely recognize speech without influence by the channel distortion.
With the construction shown in FIG. 3, a speech recognition apparatus that deals with the additive noise can be provided. In this case, the reference pattern 32 is collected a speech in an environment where the background noise of the speech of the standard speaker matches with the background noise of the input speech and is registered the feature vectors by the same method of which the input speech is analyzed in the same process as the analyzing portion 31. When the background noise that takes place in recognizing speech is known and can be used for collecting the training speech, since the additive noise of the reference pattern can be matched with that of the input sound, the speech recognition apparatus can precisely recognize speech without influenced by the additive noise.
In addition, when the reference pattern 25 used in the speech recognition apparatus using the conventional spectral subtraction technique shown in FIG. 2 is substituted with the reference pattern 32 where the channel distortion of the reference pattern 32 is matched with that of the input speech, a speech recognition apparatus that deals with both the additive noise and the channel distortion can be provided.
However, the conventional speech recognition apparatus using the spectral subtraction technique has not dealt with the channel distortion due to the microphone and the channel transmission characteristics at all. Thus, when the channel distortion of the input speech is different from that of the speech of the reference pattern, the speech recognition ratio is seriously degraded.
In the speech recognition apparatus in which the channel distortion of the reference pattern is matched with that of the input speech, the speech of the standard speaker should be collected through a microphone and a transmission line that have the same characteristics as those of a microphone and a transmission line through which the input speech is collected. However, when speech through a telephone is recognized, the microphone and the transmission line in use vary for each input speech. In addition, the microphone and the transmission through which each input speech is collected are unpredictable. Thus, with such a microphone and a transmission line, it is impossible to collect training speech and generate a reference pattern. Thus, such a speech recognition apparatus cannot be provided. This problem has not been solved by the speech recognition apparatus in which the additive noise of the reference pattern is matched with that of the additive noise of the input speech.
In addition, when the reference pattern 25 of the speech recognition apparatus using the spectral subtraction technique is substituted with the reference pattern 32 in which the reference pattern of the channel distortion is matched with that of the input speech, if the channel distortion is unknown in training speech, such a problem is not solved.