1. Field of the Invention
The present invention relates to a speech recognition method. More specifically, the present invention relates to a speech recognition method in which automatic speech recognition by a machine such as electronic computer is effected by using distance or probability between an input speech spectrum time sequence and a template speech spectrum time sequence or its statistical model.
2. Description of the Background Art
Basically, in automatic speech recognition by an electronic computer or the like, the speech is converted to a spectrum time sequence and recognized. Cepstrum is often used as a feature parameter representing the spectrum. The cepstrum is defined as an inverse Fourier transform of the logarithmic spectrum. In the following, logarithmic spectrum will be simply referred to as a spectrum.
Recently, it has been reported that the reliability of speech recognition can be improved if a change of the spectrum in time or on a frequency axis is used as a feature together with the spectrum. Proposed are "delta cepstrum" utilizing time change of the spectrum [Sadaoki Furui: "Speaker-Independent Isolated Word Recognition Using Dynamic Features of Speech Spectrum," IEEE Trans., ASSP-34, No. 1, pp. 52-59, (1986-2).]; a "spectral slope" utilizing frequency change of the spectrum [D. H. Klatt: "Prediction of Perceived Phonetic Distance from Critical-Band Spectra: A First Step," Proc. ICASSP82 (International Conference on Acoustics Speech and Signal Processing), pp. 1278-1281, (May, 1982), Brian A. Hanson and Hisashi Wakita: "Spectral Slope Distance Measures with Linear Prediction Analysis for Word Recognition in Noise," IEEE Trans. ASSP-35, No. 7, pp. 968-973, (Jul., 1987)]; and "spectral movement function" capturing the movement of formant [Kiyoaki Aikawa and Sadaoki Furui: "Spectral Movement Function and its Application to Speech Recognition," Proc. ICASSP88, pp. 223-226, (Apr., 1988)].
"Delta cepstrum" is based on a time-derivative of the logarithmic spectrum time sequence and calculated by a time filter which does not depend on frequency. "Spectral slope" is based on frequency-derivative of the logarithmic spectrum and is calculated by a frequency filter not dependent on time. "Spectral movement function" is based on a time-frequency-derivative of the logarithmic spectrum and is calculated by operations of both the time filter and the frequency filter. Here, the frequency filter is constant regardless of time, and the time filter is constant for every frequency. The time filter addresses fluctuation of the spectrum on the time axis, while the frequency filter addresses fluctuation of the spectrum on the frequency axis.
However, a feature extraction mechanism of the human auditory system is considered to be different from any of these filters. The human auditory system has a masking effect. In a two dimensional spectrum on a time frequency plane, a speech signal of a certain frequency at a certain time point is masked by a speech signal which is close in time and in frequency. In other words, it is inhibited. As for the masking effect, when the speech at a certain time point masks a speech succeeding in time, this effect is referred to as forward masking. We can consider that forward masking serves to store the spectral shape of a preceding time point, and therefore we can assume that a dynamic feature not included in the preceding speech is extracted by this effect. According to an auditory-psychological study, frequency pattern of forward masking becomes smoother when a time interval between the masking sound and the masked sound (masker-signal time-interval) becomes longer [Eiichi Miyasaka, "Spatio-Temporal Characteristics of Masking of Brief Test-Tone Pulses by a Tone-Burst with Abrupt Switching Transients," J. Acoust. Soc. Jpn, Vol. 39, No. 9, pp. 614-623, 1983 (in Japanese)]. This masked speech is the effective speech perceived in the human auditory system. This signal processing mechanism can not be realized by a fixed frequency filter which is not dependent on time. In order to implement this signal processing mechanism, it is necessary to use a set of frequency filters the characteristics of which change dependent on time. The set of frequency filters have their characteristics as spectrum smoothing filters changed dependent on the time-interval from reception of the speech serving as a masker, and operation related to frequency is dependent on time. A mechanism for extracting feature parameters taking into consideration such auditory characteristics has not yet been reported.