In the subsequent description, symbols such as “˜” should be printed above a letter but will be placed after the letter because of the limitation of text notation. These symbols are printed in the correct positions in formulae, however. If an acoustic signal is picked up in a noisy environment, that acoustic signal includes the sound to be picked up (hereinafter also referred to as “desired sound”) on which noise is superimposed. If the desired sound is speech, the clarity of speech contained in the observed acoustic signal would be lowered greatly because of the superimposed noise. This would make it difficult to extract the properties of the desired sound, significantly lowering the recognition rate of automatic speech recognition (hereinafter also referred to simply as “speech recognition”) systems. If a noise estimation technology is used to estimate noise, and the estimated noise is eliminated by some method, the clarity of speech and the speech recognition rate can be improved. Improved minima-controlled recursive averaging (IMCRA hereinafter) in Non-patent literature 1 is a known conventional noise estimation technology.
Prior to a description of IMCRA, an observed acoustic signal model used in the noise estimation technology will be described. In general speech enhancement, an observed acoustic signal (hereinafter referred to briefly as “observed signal”) yn observed at time n includes a desired sound component and a noise component. Signals corresponding to the desired sound component and the noise component are respectively referred to as a desired signal and a noise signal and are respectively denoted by xn and vn. One purpose of speech enhancement processing is to restore the desired signal xn on the basis of the observed signal yn. Letting signals after short-term Fourier transformation of signals yn, xn, and be Yk,t, Xk,t, and Vk,t, where k is a frequency index having values of 1, 2, . . . , K (K is the total number of frequency bands), the observed signal in the current frame t is expressed as follows.Yk,t=Xk,t+Vk,t  (1)
In the subsequent description, it is assumed that this processing is performed in each frequency band, and for simplicity, the frequency index k will be omitted. The desired signal and the noise signal are assumed to follow zero-mean complex Gaussian distributions with variance σx2 and variance σv2 respectively.
The observed signal has a segment where the desired sound is present (“speech segment” hereinafter) and a segment where the desired sound is absent (“non-speech segment” hereinafter), and the segments can be expressed as follows with a latent variable H having two values H1 and H0.
                              Y          t                =                  {                                                                                          X                    t                                    +                                      V                    t                                                                                                                    if                    ⁢                                                                                  ⁢                    H                                    =                                      H                    1                                                                                                                        V                  t                                                                                                  if                    ⁢                                                                                  ⁢                    H                                    =                                      H                    0                                                                                                          (        2        )            
The conventional method will be explained next with the variables described above.
IMCRA will be described with reference to FIG. 1. In a conventional noise estimation apparatus 90, first a minimum tracking noise estimation unit 91 obtains a minimum value in a given time segment of the power spectrum of the observed signal to estimate a characteristic (power spectrum) of the noise signal (refer to Non-patent literature 2).
Then, a non-speech prior probability estimation unit 92 obtains the ratio of the power spectrum of the estimated noise signal to the power spectrum of the observed signal and calculates a non-speech prior probability by determining that the segment is a non-speech segment if the ratio is smaller than a given threshold.
A non-speech posterior probability estimation unit 93 next calculates a non-speech posterior probability p(H0|Yi;θi˜IMCRA) (1 or 0), assuming that the complex spectra of the observed signal and the noise signal after short-term Fourier transformation follow Gaussian distributions. The non-speech posterior probability estimation unit 93 further obtains a corrected non-speech posterior probability β0,iIMCRA from the calculated non-speech posterior probability p(H0|Yi;θi˜IMCRA) and an appropriately predetermined weighting factor α.β0,iIMCRA=(1−α)p(H0|Yi;{tilde over (θ)}iIMCRA)  (3)
A noise estimation unit 94 then estimates a variance σv,i2 of the noise signal in the current frame i by using the obtained non-speech posterior probability β0,iIMCRA, the power spectrum |Yi|2 of the observed signal in the current frame, and the estimated variance σv,i-12 of the noise signal in the frame i−1 immediately preceding the current frame i.σv,i2=(1−β0,iIMCRA)σv,i-12+β0,iIMCRA|Yi|2  (4)
By successively updating the estimated variance σv,i2 of the noise signal, varying characteristics of non-stationary noise can be followed and estimated.