It is generally known that performance of an automatic speech recognition apparatus is markedly degraded under an environment with long reverberation times. For this reason, it is desired that reverberation contained in observed speech should be eliminated in the form of preprocessing. Accordingly, various conventional dereverberation methods have been proposed as will be described below.
A first conventional dereverberation method deletes, from a speech power spectrum domain, a speech power spectrum of a previous frame multiplied by a coefficient. A method is disclosed on the basis of a general property that a sound power of reverberation exponentially attenuates. See reference to Nakamura, Takiguchi and Shikano, “Study on Reverberation Compensation in Short-Time Spectral Analysis,” Lecture Paper Collection of the Acoustical Society of Japan, 3-6-11, pp. 103-104, March 1998. In this method, reverberation is eliminated by subtracting, from a speech power spectrum of a current frame, a previous speech power spectrum of the frame (or previous several frames) immediately before the current frame, the previous speech power spectrum multiplied by a coefficient. Note that “a frame” means a width on which a Fourier transform is operated in speech power spectra.
Although this method itself does not involve a large computation amount, a method of determining a coefficient is a problem because the coefficient depends on reverberation characteristics of a room. For this reason, there is proposed a method of determining the coefficient through a Hidden Markov Model (HMM) and an Expectation Maximization (EM) algorithm by using an acoustic model. See reference to Japanese Patent Application Laid-open Publication No. 2004-347761. However, since this method requires “supervised training” in which text of correct answers is given at the time of learning, preparatory “adaption” is a burden on a user. Additionally, this method has a disadvantage that repetitive computations of the EM algorithm require a high computation cost.
A second conventional dereverberation method uses an inverse filter. On condition that an environment where an automatic speech recognition apparatus is used is known, a filter for dereverberation can be formed by previously finding a transfer function in a room, and then by finding an inverse filter thereof. See reference to Emura and Kataoka (NTT Laboratory), “Regarding Blind Dereverberation from Multi-channel Speech Signals,” Proceedings of the Acoustical Society of Japan Spring Meeting (March 2006).
When the automatic speech recognition apparatus is supposed to be an embedded apparatus, implementation of plural microphones is not realistic. Additionally, designing of an inverse filter is often difficult in reality because a phase of an impulse response measured or determined as propagation characteristics is not the minimum phase in some cases.
A third conventional dereverberation method forms a transfer function by regarding comb filter outputs as original sound. A method is disclosed in which a transfer function is determined by regarding speech in a segment having a harmonic structure, as original sound without reverberation, and also by regarding speech in a segment having no harmonic structure as reverberation. In this method, processing is repeated in order to enhance performance. See reference to Nakatani, T., and Miyoshi, M., “Blind Dereverberation of Single Channel Speech Signal Based on Harmonic Structure,” Proc. ICASSP-2003, vol. 1, pp. 92-95 (April 2003).
In preprocessing of automatic speech recognition, the method is considered to involve fundamental problems such as that existence of consonants is disregarded, and that fluctuation of F0 (a fundamental frequency) is premised. Additionally, a cost for computing a comb filter is large.
A fourth conventional dereverberation method shapes a power envelope by using a reverberation time. A method is disclosed in which a power envelope of a speech waveform is re-shaped into a precipitous form by using a reverberation time of a room as a parameter. See reference to Hirobayashi, Nomura, Koike, and Tohyama, “Speech Waveform Recovery from a Reverberant Speech Signal Using Inverse Filtering of the Power Envelope Transfer Function,” The IEICE Transactions Vol. J81-A, No. 10 (October 1998).
In this method, it is premised that the reverberation time of the room is known in advance as previous knowledge, or that the reverberation time of the room can be determined by means of another method.
A fifth conventional dereverberation method uses multi-step linear prediction. A method is disclosed in which a spectrum of a late reverberation component is subtracted from observed speech by whitening the observed speech in advance, forming linear prediction delayed by D sample in a time domain, and regarding a prediction component thereof as the late reverberation component. See reference to Kinoshita, Nakatani and Miyoshi (NTT Laboratory), “Study on Single Channel Dereverberation Method Using Multi-step Linear Prediction,” Proc. of the Acoustical Society of Japan Spring Meeting (March 2006).
This method has a problem that a computation cost is high because a filter having a long tap length (D=5000 taps in the example of Kinoshita, Nalkatani and Miyoshi (NTT Laboratory), “Study on Single Channel Dereverberation Method Using Multi-step Linear Prediction,” Proc. of the Acoustical Society of Japan Spring Meeting (March 2006)) corresponding to a reverberation time is used. Additionally, in principle, a linear prediction component delayed by D sample is not completely equal to a reverberation component. In addition, it is expected that the linear prediction component does not become zero in a part composed of long prolonged vowel sound even in an environment without reverberation. Consequently, a spectrum subtraction may cause not only dereverberation but also degradation of original sound. In the experiment shown in the document, it is considered that the above side-effect in the environment without reverberation is avoided by also applying speech, which is previously processed in the same manner, to learning of an acoustic model.
As has been described above, the conventional dereverberation methods require large computation amounts or previous knowledge (such as a reverberation time of a room). If a large computation amount is required, it is impossible in practice to implement any of the methods in an embedded type automatic speech recognition apparatus that must use a low CPU resource, and meet the need for real-time responses. Additionally, after an automatic speech recognition apparatus is delivered to a user, the previous knowledge such as a reverberation time of a room cannot be utilized.