1. Field of the Invention
The present invention relates to a method and apparatus for suppressing noise in a noisy speech signal.
2. Description of the Related Art
Noise suppression is a technique that involves estimating the power spectrum of a noise component introduced to an input noisy speech signal using a frequency-domain signal and subtracting the estimated power spectrum from the noisy speech signal. By continuously estimating the noise component, the noise suppression technique is also useful for suppressing nonstationary noise. The noise suppressor of this type is described in Japanese Patent Publication 2002-204175. FIG. 1 illustrates the noise suppressor of this patent publication. As illustrated, samples of a noisy speech signal are supplied to a frame decomposition and windowing circuit 1, which divides the signal into frames with K/2 samples where K represents an even number. The frames are multiplied by a window function w(t). A signal yn(t)=w(t)yn(t) is produced by windowing the nth-frame of the noisy speech signal yn(t) (t=0, 1, . . . , (K/2)−1). For real-numbers, symmetrical window functions are used. The window function is designed so that, when the noise suppression coefficient is 1, the input and output signals coincide with each other (i.e., w(t)+w(t+K/2)=1). If two consecutive frames are windowed as such, the well-known Hanning window w(t) is used:
      w    ⁡          (      t      )        =      {                                                      0.5              +                              0.5                ⁢                                                                  ⁢                                  cos                  ⁡                                      (                                                                  π                        ⁡                                                  (                                                      t                            -                                                          K                              /                              2                                                                                )                                                                                            K                        /                        2                                                              )                                                                        ,                                                0            ≤            t            <            K                                                            0            ,                                    otherwise                    The windowed speech frame yn(t) is supplied to a Fourier Transform converter 2 where the speech frame is converted to a vector of K frequency spectral speech components Yn=(Yn(0), Yn(1), . . . , Yn(K−1)). This vector of spectral speech components is separated into a vector of K phase components arg Yn=(arg Yn(0), arg Yn(1), . . . , arg Yn(K−1)) and a vector of K amplitude components |Yn|=(|Yn(0)|, |Yn(1) |, . . . , |Yn(K−1)|), the former being supplied to a multiplier 10 and the latter being fed to a squaring circuit 3 where the K amplitude spectral speech components are mutually squared in K multipliers 30˜3K-1. The squared values |Yn|2=(|Yn(0)|2, |Yn(1) |2, . . . , |Yn(K−1)|2) represents the power spectrum of a noisy speech. The outputs of the squaring circuit 3 are supplied to a power spectrum weighting circuit 4 (FIG. 2) where weighting is performed on the K frequency spectral speech components.
In FIG. 2, this power spectrum weighting is achieved first by calculating spectral signal-to-noise ratios using an array of dividers 410˜41K-1 to divide the K speech power components |Yn|2 by a vector of K noise power spectral components λn-1 which were estimated during a previous frame in a noise estimation circuit 5 and stored in a memory 42, producing a vector of SNR values {circumflex over (γ)}n=|Yn|2/λn-1. These SNR values are then subjected to a nonlinear processing through a vector of nonlinear weighting circuits 430˜43K-1 each having a nonlinear function of the form:
      f    2    =      {                                        1            ,                                                              f              1                        ≤            a                                                                                                            f                  1                                -                b                                            a                -                b                                      ,                                                a            <                          f              1                        <            b                                                            0            ,                                                b            <                          f              1                                          where, “a” and “b” are arbitrary real numbers. Each nonlinear weighting circuit 43 produces a weight value that equals 0 when the input SNR value is larger than “b” and 1 when the SNR is smaller than “a” and assumes a value anywhere between 0 and 1 that is inversely variable in proportion to the SNR value. Finally, the input K spectral speech power components |Yn|2 are multiplied respectively by the K weighting factors using a spectral multiplier 44 to produce a vector of weighted power spectral speech components. This vector of weighted power spectral speech components is supplied to a noise estimation circuit 5 (FIG. 3) to which the spectral power speech components |Yn|2 are also supplied from the squaring circuit 3. The nonlinear weighting by the circuits 43 is to reduce the adverse effect of the voiced components of the noisy speech power spectrum on estimating its noise components.
In FIG. 3, the K weighted spectral power speech components from the power spectrum weighting circuit 4 and the non-weighted K spectral power speech components from the squaring circuit 3 are respectively processed through noise calculators 500˜50K-1. In each noise calculator 50, the weighted component is passed through a gate 54 of a register update decision circuit 51 to a shift register 55 when the gate 54 is turned ON in response to a “1” from OR gate 511. This results in the shift register 55 being updated with a new spectral component. This shift-register update occurs when the initial period detector 512 supplies a “1” to OR gate 511 during the initial start-up time of the noise suppressor, or when the magnitude of the non-weighted power spectral components is low, indicating that it is a speech absence signal or a voiced low-level signal. In the latter case, the comparator 515 supplies a “1” to the OR gate 511 after comparison with a decision threshold that was stored in a memory 514 during the previous frame interval by a threshold calculator 513. A sample counter 59 increments its count value in response to a logical-1 output from the OR gate 511 to determine the number of weighed power spectral components stored in the shift register 55 during each frame interval. The counter is reset to zero when the count value becomes equal to the length of the shift register 55. The output of the counter 59 is compared in a minimum selector 57 with the length of the shift register 55. Minimum selector 57 selects the smaller of the two as a value M. The total sum of the M components Bn,0(k), Bn,1(k), . . . , Bn,M−1(k), which are stored in the shift register 55 during a frame “n” is calculated by an adder 56 and divided by the value M in a division circuit 58 to produce an output λn(k) as follows:
            λ      n        ⁡          (      k      )        =            1      M        ⁢                  ∑                  m          =          0                          M          -          1                    ⁢                        B                      n            ,            m                          ⁡                  (          k          )                    
Since the output of sample counter 59 increases monotonically from the instant the noise suppressor is started, the division operation proceeds using initially the sample counter output. As the process continues, the sample counter 59 increases its output and eventually becomes higher than the register length, whereupon the division operation proceeds using the register length as a divisor. When the register length is used, the division output λn represents an average power of the total sum of the weighted power spectral speech components. The quotient value λn of the division operation is supplied to the threshold calculator 513, which multiplies the input value by a predetermined number or by a high-order polynomial or non-linear function, to produce a decision threshold to be used in the comparator 515 during the next frame. The quotient λn is the estimated noise that is supplied as a feedback signal to the power spectrum weighting circuit 4 and stored in its memory 42 to update the weighted power spectral noise components for the next frame.
Returning to FIG. 1, in an a-posteriori SNR (signal-to-noise ratio) calculator 6, the speech power spectral components |Yn|2 of the squaring circuit 3 are respectively divided by the estimated noise power spectral components λn of the noise estimation circuit 5 to produce a vector of a-posteriori SNR values γn, which are in turn supplied to an a-priori (a priori) SNR estimation circuit 7 (FIG. 4).
In FIG. 4, the a-posteriori (a posteriori) SNR values γn are each summed with “−1” in adders 70, producing a vector of {γn(0)−1}, {γn(1)−1}, . . . , {γn(k−1)−1}, which are restricted in range in a range restriction circuit 71 using maximum selectors 710˜71K-1. The maximum selectors compare their input with a value “zero” and select the greater of the two according to the relation P[x]=x, if x>0 and 0 if x≦0 and deliver outputs P[γn(k)−1] to multiply-and-add circuits 770˜77K-1. The a-posteriori SNR values γn(k) from a-posteriori SNR calculator 6 are also stored in a memory 72 for a frame interval and then supplied to a multiplier 75 as a vector of previous-frame a-posteriori SNR values γn-1(0)˜γn-1(K−1). These previous frame a-posteriori SNR values are multiplied by a vector of squared corrected noise suppression coefficients of previous frame Gn-12 that is supplied from a squaring circuit 74 to produce and supply a vector of values γn-1 Gn-12 to the multiply-and-add circuits 770˜77K-1 as a vector of estimated SNR values of previous frame. To generate Gn-12 a vector of corrected noise suppression coefficients Gn is received from a noise suppression coefficients corrector 9 and stored in a memory 73 for a frame interval and squared in a squaring circuit 74 to produce Gn-12. In each multiply-and-add circuit 77, the input signal P[γn-1(k)−1] from the corresponding maximum selector 71 is multiplied in a multiplier 771 by a factor (1−α) (where α is a weight value), and the previous-frame estimated SNR values γn-1(k) Gn-12 from the multiplication circuit 75 are multiplied in a multiplier 772 by the weight value α and summed with the output of multiplier 771 to produce an estimated a-priori SNR value in {circumflex over (ξ)}n=αγn-1 Gn-12+(1−α)P[γn−1], where G−12γ−1=1. The estimated a-priori SNR values {circumflex over (ξ)}n(0)˜{circumflex over (ξ)}n(K−1) are supplied to a noise suppression coefficients calculator 8 (FIG. 5) and noise suppression coefficients corrector 9 (FIG. 6).
In FIG. 5, in addition to the estimated a-priori SNR vector {circumflex over (ξ)}n=({circumflex over (ξ)}n(0),{circumflex over (ξ)}n(1), . . . , {circumflex over (ξ)}n(K−0)) from the a-priori SNR calculator 7, the noise suppression coefficients calculator 8 receives the a-posteriori SNR vector γn=γn(0)˜γn(K−1) from the a-posteriori SNR calculator 6. Noise suppression coefficients calculator 8 includes a MMSE-STSA (Minimum Mean Sequence Error Short Time Spectral Amplitude) gain function value calculator 81 and a GLR (Generalized Likelihood Ratio) calculator 82. For each spectral component, the MMSE-STSA gain function calculator 81 uses the a-posteriori SNR values γn and the a-priori SNR values {circumflex over (ξ)}n and a speech absence probability “q” to calculate an MMSE-STSA gain function Gn as follows:
      G    n    =                    π            2        ⁢                            v          n                            γ        n              ⁢                  exp        ⁡                  (                      -                                          v                n                            2                                )                    ⁡              [                                            (                              1                +                                  v                  n                                            )                        ⁢                                          I                0                            ⁡                              (                                                      v                    n                                    2                                )                                              +                                    v              n                        ⁢                                          I                1                            ⁡                              (                                                      v                    n                                    2                                )                                                    ]            where, I0(z)=Zero-order modified Bessel function,
I1(z)=First-order modified Bessel function,νn=(ηnγn)/(1+ηn), andηn={circumflex over (ξ)}n/(1−q).Using the same values of a-posteriori and a-priori SNR and speech absence probability as those used in the calculator 81, the GLR calculator 82 calculates a vector of K generalized likelihood ratios Λn as follows:
      Λ    n    =                    1        -        q            q        ⁢                  exp        ⁢                                  ⁢                  v          n                            1        +                  η          n                    The gain function Gn and the GLR value Λn are used in a calculation circuit 83 to provide a noise suppression coefficients corrector 9 (FIG. 6) with a vector of noise suppression coefficients Gn given by:
            G      _        n    =                    Λ        n                              Λ          n                +        1              ⁢          G      n      
In FIG. 6, the noise suppression coefficients Gn and the a-priori SNR values ξn are supplied to noise suppression coefficient correction circuits 910˜91K-1. Each a-priori SNR value is compared in a comparator 911 with a threshold value to produce a control signal for a selector 912, through which the noise suppression coefficient is selectively coupled to a maximum selector 914 either via a multiplier 913 or a through-connection depending on the magnitude of the a-priori SNR value relative to the threshold value. When the a-priori SNR value is lower than the threshold value, the selector 912 is switched to the lower position, coupling the noise suppression coefficient to the multiplier 913 where it is scaled by a correction value. Otherwise, the selector 912 is switched to the upper position, coupling the noise suppression coefficient direct to the maximum selector 914. Maximum selector 914 compares the input signal with a lower limit value of correction and delivers the greater of the two to a multiplier 10.
Returning to FIG. 1, the multiplier 10 multiplies the corrected noise suppression coefficients Gn by the speech amplitude spectral components |Yn| supplied from the Fourier transform converter 2 to produce enhanced speech amplitude spectral components | Xn|= Gn|Yn|. The latter is multiplied by the phase components arg Yn in a multiplier 11 to produce enhanced speech spectral components Xn=| Xn|arg Yn. Inverse Fourier transform is performed on the enhanced speech components in an inverse Fourier transform converter 12 to produce a speech frame containing a series of K time-domain components xn(t), where t=0, 1, . . . , K−1. The K/2 time-domain components of successive two speech frames are combined in a frame synthesis 13 into enhanced speech samples of the form {circumflex over (x)}n(t)= xn-1(t+K/2)+ xn(t).
However, the noise suppression coefficients of the prior art noise suppressor are calculated using the same algorithm without distinction between speech sections and noise sections. As a result, speech distortions can occur in speech sections, while suppression in noise sections is insufficient.