1. Field of the Invention
The invention relates to a technique for performing speech recognition by using a feature of a speech time series, such as a cepstrum or the like.
The invention also relates to a technique for the removal of convolution distortion, such as line characteristics or the like.
The invention further relates to a technique for enabling an instantaneous or successive adaptation to noise.
2. Related Background Art
In the case of performing speech recognition in a real environment, problems can be caused by convolution distortion due to a distortion of line characteristics due to the influence of a microphone, telephone line characteristics, or the like and an additive noise such as an internal noise or the like. As a method of coping with the distortion of the line characteristics, among them, a Cepstrum Mean Subtraction (CMS) method has been proposed. The CMS method has been disclosed in detail in Rahim, et al., "Signal Bias Removal for Robust Telephone Based Speech Recognition in Adverse Environments", Proc. of ICASSP'94, 1994 or the like.
The CMS method is a method of compensating for the distortion of the line characteristics. According to such a method, on the basis of information extracted from input speech, the line distortion is corrected on the input time series side or the model side, such as HMM or the like, thereby making it adaptive to the input environment. Thus, even if the line characteristics fluctuate, it is possible to flexibly cope with such a situation.
The CMS method is a method of compensating for convolution distortion (line distortion) which acts due to a convolution of an impulse response. A long-time spectrum of input speech is subtracted from the input speech and a long-time spectrum of a speech used in the model formation is subtracted from a model, thereby normalizing a difference of the line characteristics. The normalizing process is generally performed in a logarithm spectrum region or a cepstrum region. Since the for convolution distortion appears as an additive distortion in those two regions, the for convolution distortion can be compensated for by a subtraction. A method of performing such a process in the cepstrum region is called a CMS method.
By using the CMS method as mentioned above, it is possible to cope with the distortion of the line characteristics due to the influence of the microphone, telephone line characteristics, or the like. In the case of using the CMS method, however, in order to compute a cepstrum long-time mean (CM) from the speech inputted as a recognition target, the user has to wait for completion of the input of speech as a recognition target. The recognizing process is performed after the CM was obtained, namely, after the end of the speech input. Therefore, a recognition algorithm cannot be made operative synchronously with the speech input. It is, consequently, impossible to perform a real-time process according to the conventional method.