Noise suppression devices for suppressing nonobjective signals such as noises mixed into speech signals are known, one of which has been disclosed in, for example, Japanese Patent Application Laid-Open No. 7-306695. The noise suppression device as disclosed by this Japanese application is based on what is called the spectral subtraction method, wherein noises are suppressed over an amplitude spectrum, as suggested by Steven F. Boll, “Suppression of Acoustic Noise in Speech using Spectral Subtraction,” IEEE Trans. ASSP, Vol. ASSP-27, No. 2, April 1979.
FIG. 1 is a block diagram showing a configuration of a conventional noise suppression device disclosed in the above-identified Japanese application. In the figure, reference numeral 111 denotes an input terminal; 112, a framing/windowing circuit; 113, an FFT circuit; 114, a frequency division circuit; 115, a noise estimation circuit; 116, speech estimation circuit; 117, a Pr(Sp) calculating circuit; 118, a Pr(Sp|Y) calculating circuit; 119, a maximum likelihood filter; 120, a soft decision suppression circuit; 121, a filter processing circuit; 122, band conversion circuit; 123, a spectrum correction circuit; 124, an IFFT circuit; 125, an overlap-and-add circuit; and 126 denotes an output terminal.
FIG. 2 is a block diagram showing a configuration of the noise estimation circuit 115 in the conventional noise suppression device. In the figure, reference numeral 115A denotes an RMS calculating circuit; 115B, a relative energy calculating circuit; 115C, a minimum RMS calculating circuit; and 115D denotes a maximum signal calculating circuit.
The operation will be explained below.
An input signal y[t] containing a speech component and a noise component is supplied to the input terminal 111. The input signal y[t], which is a digital signal having the sampling frequency of FS, is fed to the framing/windowing circuit 112 where it is divided into frames each having a length equal to FL samples, for example 160 samples, and windowing is performed prior to the subsequent FFT processing.
The FFT circuit 113 performs 256-point FFT processing to produce frequency spectral amplitude values which are divided by the frequency dividing circuit 114 into e.g., 18 bands.
The noise estimation circuit 115 distinguishes the noise in the input signal y[t] from the speech and detects a frame which is estimated to be the noise. The operation of the noise estimation circuit 115 is explained below by referring to FIG. 2.
In FIG. 2, the input signal y[t] is fed to a root-mean-square value (RMS) calculating circuit 115A where short-term RMS values are calculated on the frame basis. The short-term RMS values are supplied to the relative energy calculating circuit 115B, the minimum RMS calculating circuit 115C, the maximum signal calculating circuit 115D and the noise spectrum estimating circuit 115E. The noise spectrum estimating circuit 115E is fed with outputs of the relative energy calculating circuit 115B, the minimum RMS calculating circuit 115C and the maximum signal calculating circuit 115D, while being fed with an output of the frequency division circuit 114.
The RMS calculating circuit 115A calculates a RMS value RMS[k] for each frame according to the equation (1). The relative energy calculating circuit 115B calculates the current frame's relative energy dB_rel[k] to the decay energy (decay time 0.65 second) from the previous frame.
                                                                        RMS                ⁡                                  [                  k                  ]                                            =                              sqrt                ⁢                                                                  ⁢                                  (                                                            ∑                                              t                        =                        1                                                                    F                        ⁢                                                                                                  ⁢                        L                                                              ⁢                                                                                  ⁢                                                                  y                        2                                            ⁡                                              [                        t                        ]                                                                              )                                                                                                                        dB_rel                ⁡                                  [                  k                  ]                                            =                              10                ⁢                                                                  ⁢                log                ⁢                                                                  ⁢                10                ⁢                                  (                                                            E_dec                      ⁡                                              [                        k                        ]                                                              /                                          E                      ⁡                                              [                        k                        ]                                                                              )                                                                                                                        E                ⁡                                  [                  k                  ]                                            =                              ∑                                                      y                    2                                    ⁡                                      [                    t                    ]                                                                                                                                          E_dec                ⁡                                  [                  k                  ]                                            =                              max                (                                                      E                    ⁡                                          [                      k                      ]                                                        ,                                                            exp                      ⁡                                              (                                                                              -                            F                                                    ⁢                                                                                                          ⁢                                                      L                            /                            0.65                                                    *                          F                          ⁢                                                                                                          ⁢                          S                                                )                                                              ⁢                                                                                  ⁢                                          E_dec                      [                                              k                        -                        1                                            )                                                                      ]                                                                        (        1        )            
The minimum RMS calculating circuit 115C calculates the current frame's minimum noise RMS value MinNoise_short and a long-term minimum noise RMS value MinNoise_long which is updated every 0.6 second so as to evaluate the background noise level. The long-term minimum noise RMS value MinNoise_long is used alternatively when the minimum noise RMS value MinNoise_short cannot track or follow sharp changes in the noise level.
The maximum signal calculating circuit 115D calculates the current frame's maximum signal RMS value MaxSignal_short, and a long-term maximum signal RMS value MaxSignal_long which is updated every e.g., 0.4 second. The long-term maximum signal RMS value MaxSignal_long is used alternatively when the current frame's maximum signal RMS value cannot follow sharp changes in the signal level. The current frame signal's maximum SNR value MaxSNR may be estimated by employing the short-term maximum signal RMS value MaxSignal_short and the short-term minimum noise RMS value MinNoise_short. In addition, using the maximum SNR value MaxSNR, a normalized parameter NR_level in a range from 0 to 1 indicating the relative noise level is calculated.
Then, the noise spectrum estimation circuit 115E determines whether the mode of the current frame is speech or noise by using the values calculated by the relative energy calculating circuit 115B, minimum RMS calculating circuit 115C and maximum signal calculating circuit 115D. If the current frame is determined as noise, the time averaged estimated value of the noise spectrum N[w, k] is updated by the signal spectrum Y[w, k] of the current frame where w denotes the number of the bands produced through the band division.
The speech estimation circuit 116 in FIG. 1 calculates the SN ratio in each of the frequency bands w produced through the band division. First, a rough estimated value S′[w, k] of the speech spectrum is calculated in accordance with the following equation (2) by assuming a noise-free condition (clean condition). The rough estimated value S′[w, k] of the speech spectrum may be employed for calculating the probability Pr(Sp|Y) to be explained later. ρ in the equation (2) is a predetermined constant and set to e.g., 1.0.S′[w, k]=sqrt(max(0,Y[w, k]2−ρN[w, k]2))  (2)
Then, using the above described speech spectral rough estimated value S′[w, k] and the speech spectral estimated value S[w, k−1] of the immediately preceding frame, the speech estimation circuit 116 calculates the current frame's speech spectrum estimated value S[w, k]. Using the calculated speech spectrum estimated value S[w, k] and the noise spectrum estimated value N[w, k] fed from the noise spectrum estimation circuit 115E, the subband-based SN ratio SNR[w, k] is calculated in accordance with the following equation:
                              S          ⁢                                          ⁢          N          ⁢                                          ⁢                      R            ⁡                          [                              w                ,                k                            ]                                      =                  20          ⁢                                          ⁢          log          ⁢                                          ⁢          10          ⁢                      (                                                            0.2                  *                                      S                    ⁡                                          [                                                                        w                          -                          1                                                ,                        k                                            ]                                                                      +                                  0.6                  *                                      S                    ⁡                                          [                                              w                        ,                        k                                            ]                                                                      +                                  0.2                  *                                      S                    ⁡                                          [                                                                        w                          +                          1                                                ,                        k                                            ]                                                                                                                    0.2                  *                                      N                    ⁡                                          [                                                                        w                          -                          1                                                ,                        k                                            ]                                                                      +                                  0.6                  *                                      N                    ⁡                                          [                                              w                        ,                        k                                            ]                                                                      +                                  0.2                  *                                      N                    ⁡                                          [                                                                        w                          +                          1                                                ,                        k                                            ]                                                                                            )                                              (        3        )            
Then, to cope with a wide range of the noise/speech level, a variable value SN ratio SNR_new [w, k] is calculated in accordance with the following equation (4) by use of the SN ratio SNR[w, k] of each of subbands. MIN_SNR( ) in equation (3) is a function to determine the minimum value of SNR_new[w, k] and the argument snr is a synonym for the subband SN ratio SNR[w, k].SNR_new[w, k]=max(MIN—SNR(SNR[w, k]), S′[w, k]/N[w, k])
                              MIN_SNR          ⁢                      (            snr            )                          =                  {                                                    3                                                              snr                  <                  10                                                                                                      3                  -                                                                                    (                                                  snr                          -                          10                                                )                                            /                      35                                        *                    1.5                                                                                                10                  <=                  snr                  <=                  45                                                                                    1.5                                            else                                                                        (        4        )            
The value SNR_new[w, k] obtained above is an instantaneous subband SN ratio which limits the minimum value of the subband SN ratio in the current frame. For a speech portion signal having a high SN ratio on the whole, this SNR_new[w, k] allows the minimum value taken by the subband SN/ratio to decrease to 1.5 (dB). Meanwhile, the subband SN ratio cannot be lowered to below 3 (dB) for a noise portion signal having a low instantaneous SN ratio.
The Pr(Sp) calculating circuit 117 calculates a probability Pr(Sp) which indicates the probability that speech is present in the input signal which assumes a noise-free condition. This probability Pr(Sp) is calculated using the NR_level function obtained by the maximum signal calculating circuit 115D.
The Pr(Sp|Y) calculating circuit 118 calculates a probability Pr(Sp|Y) which indicates the probability that speech is present in the actual input signal y[t] having noise mixed thereinto. This probability Pr(Sp|Y) is calculated by using the probability Pr(Sp) supplied from the Pr(Sp) calculating circuit 117 and the subband SN ratio SNR_new[w, k] obtained in accordance with the equation (4). In the calculation of the probability Pr(Sp|Y), the probability Pr (H1|Y)[w, k] means the probability of a speech event H1 in each of the subbands w of the spectrum amplitude signal Y[w, k], wherein the speech event H1 is a phenomenon that in a case where the input signal y(t) of the current frame is a sum of the speech signal s(t) and the noise signal n(t), the speech signal s[t] exists therein. As the SNR_new[w, k] increases, for example, the probability Pr(H1|Y)[w, k] approaches 1.0.
In the maximum likelihood filter 119, using the spectral amplitude signal Y[w, k] from the band division circuit 114 and the noise spectral amplitude signal N[w, k] from the noise estimation circuit 115, the noise removed spectral signal H[w, k] is calculated by removing the noise signal N from the spectral amplitude signal Y in accordance with the following equation (5):
                              H          ⁡                      [                          w              ,              k                        ]                          =                  {                                                                                          α                    +                                                                  (                                                  1                          -                          α                                                )                                            ·                                                                        sqrt                          ⁡                                                      (                                                                                          Y                                2                                                            -                                                              N                                2                                                                                      )                                                                          /                        Y                                                                              ;                                                                          ⁢                                      Y                    >                                          0                      ⁢                                                                                          ⁢                      and                      ⁢                                                                                          ⁢                      Y                                        >=                    N                                                                                                                                            α                    ;                                                                                  ⁢                    else                                    ⁢                                                                                                                                                (        5        )            
In the soft decision suppression circuit 120, using the noise removed spectral signal H[w, k] from the maximum likelihood filter 119 and the probability Pr(H1|Y)[w, k] from the Pr(Sp|Y) calculating circuit 118, spectral amplitude suppression in accordance with the following equation (6) is given to the noise removed spectral signal H[w, k] so as to output a spectral suppressed signal Hs[w, k] on the subband basis. MIN_GAIN in the equation (6) is a predetermined constant meaning the minimum gain and set to, for example, 0.1 (−15 dB). According to the equation (6), amplitude suppression given to the noise removed spectral signal H[w, k] is lightened when the speech signal presence probability Pr(H1|Y) [w, k] is close to 1.0. Meanwhile, when the probability Pr(H1|Y)[w, k] is close to 0.0, the noise removed spectral signal H[w, k] is amplitude-suppressed to the minimum gain MIN_GAIN.Hs[w, k]=Pr(H1|Y)[W, k]*H[w, k]+(1−Pr(H1|Y)[w, k])*MIN_GAIN  (6)
In the filter processing circuit 121, the spectral suppressed signal Hs[w, k] from the soft decision suppression circuit 120 is smoothed along both the frequency axis and the time axis in order to reduce the perceivable discontinuities in the spectral suppressed signal Hs[w, k]. In the band conversion circuit 122, the smoothed signals fed from the filter processing circuit 121 are converted to extended bands through interpolation.
In the spectrum correction circuit 123, the imaginary part of the FFT coefficients of the input signal obtained at the FFT circuit 113 and the real part of FFT coefficients of obtained at the band conversion circuit 122 are multiplied by the output signal of the band division circuit 114 to carry out spectrum correction.
The IFFT circuit 124 executes inverse FFT processing on the signal obtained at the spectrum correction circuit 123. The overlap-and-add circuit 25 executes overlap processing on each frame's boundary portion of the IFFT output signal for each frame. The noise-reduced signal is output from the output terminal 126.
As described so far, the conventional noise suppression device is configured in such a way that even when the noise/speech level of the input signal changes, the amount of noise suppression can be optimized in response to the subband SN ratios. For a speech signal portion having a high SN ratio as a whole, for example, since the minimum value of each subband SN ratio is set to a low value, it is possible to reduce the amount of amplitude suppression in low SN ratio subbands and therefore prevent low level speech signals from being suppressed. In addition, for a noise portion signal having a low SN ratio as a whole, since the minimum value of each subband SN ratio is set to a high value, it is possible to give sufficient amplitude suppression to low SN ratio subbands and therefore suppress perceivable noise.
In the conventional noise suppression device configured as described above, the amount of noise suppression should be uniform along the frequency axis over the whole band so as not to cause residual noise. However, since the estimated noise spectrum of the current frame is obtained by averaging past noise spectrums, the estimated noise spectrum may not equal to the actual noise spectrum. This results in errors in estimated subband SN ratios, making it impossible to give a uniform amount of noise suppression along the frequency axis over the whole band.
Practically, if a noise frame has high power spectral components in a specific subband, this subband is considered to have a high SN ratio as speech and therefore not given sufficient noise suppression. This makes the suppression characteristics not uniform over the whole band and results in causing residual noise. In the conventional method, however, since control is performed depending on the estimated noise spectrum and the estimated subband SN ratios, appropriate noise suppression is impossible if the estimated noise spectrum is not correct.
The present invention is directed to the above-mentioned problem, and it is an object of the present invention to provide a noise suppression device which reduces residual noise in noise frames in a simple way and is free from quality deterioration in noisy environment regardless of noise level fluctuations.