FIG. 1 shows a block diagram of common sources of speech degradation. As may be appreciated, speech from the desired speaker (block 10) is degraded by environmental noise, namely voices of other speakers nearby (block 20) and background noise (block 30), and by communication channel noise and distortion (blocks 40 and 50). Noise reduction techniques (block 60) for automatic speech recognition (block 70) can reduce the (nearly stationary) background noise and the channel noise, whereas non-stationary noise and interfering voices are much more difficult to be eliminated.
FIG. 2 shows a block diagram of an automatic speech recognition system. As may be appreciated, the noisy speech to be recognized is inputted to a short-time spectral analysis (windowed FFT) block 100 which outputs short-time spectra which is in turn inputted to a noise reduction block 110. The de-noised short-time spectra are inputted to a RASTA-PLP Front-End 120, which outputs the total energy of the speech signal, the cepstral coefficients, and the first and second derivatives of the total energy and of the cepstral coefficients, which are all inputted to an automatic speech recognition block 130.
RASTA-PLP Front-End 120 implements a technique known as “RelAtive SpecTrAl Technique”, which is an improvement of the traditional PLP (Perceptual Linear Prediction) method and consists in a special filtering of the different frequency channels of a PLP analyzer. The previous filtering is done to make speech analysis less sensitive to the slowly changing or steady-state factors in speech. The RASTA method replaces the conventional critical-band short-term spectrum in PLP and introduces a less sensitive spectral estimate. For a more detailed description of a RASTA processing, reference may be made to H. Hermansky and N. Morgan, RASTA Processing of Speech, IEEE Transactions on Speech and Audio Processing, Vol. 2 No. 4, October 1994.
The noise reduction block 110 performs an environmental noise estimate 112 based on the short-time spectra and then an environmental noise reduction 114 based on the short-time spectra and the estimated noise, by using either a so-called “Spectral Subtraction Technique” or a so-called “Spectral Attenuation Technique”.
The aforementioned techniques will be described in detail hereinafter by denoting the power spectrum of the noisy speech by |Yk(m)|2, the power spectrum of the clean speech by |Xk(m)|2, the power spectrum of the additive noise by |Dk(m)|2, and the estimate of a quantity by symbol “^”, and wherein k indexes the spectral lines of the spectra and m indexes the time windows within which the noisy speech is processed for noise reduction.
Spectral Subtraction Technique is described in N. Virag, Single Channel Speech Enhancement Based on Masking Properties of the Human Auditory System, IEEE Transaction on Speech and Audio Processing, Vol. 7, No. 2, March 1999, which deals with the problem of noise reduction for speech recognition and discloses the use of a noise overestimation or oversubtraction factor and a spectral flooring factor.
In particular, the Spectral Subtraction Technique is based on the principle of reducing the noise by subtracting an estimate |{circumflex over (D)}k(m)|2 of the power spectrum of the additive noise from the power spectrum |Yk(m)|2 of the noisy speech, thus obtaining an estimate |{circumflex over (X)}k(m)|2 of the power spectrum of the clean speech:
                                                                                                      X                  ^                                k                            ⁡                              (                m                )                                                          2                =                  {                                                                                        ⁢                                                                                                                                                                  Y                            k                                                    ⁡                                                      (                            m                            )                                                                                                                      2                                        -                                                                  α                        ⁡                                                  (                          m                          )                                                                    ⁢                                                                                                                                                                                            D                                ^                                                            k                                                        ⁡                                                          (                              m                              )                                                                                                                                2                                                                                                                                                            ⁢                                                                                    if                        ⁢                                                                                                  ⁢                                                                                                                                                                        Y                                k                                                            ⁡                                                              (                                m                                )                                                                                                                                          2                                                                    -                                                                        α                          ⁡                                                      (                            m                            )                                                                          ⁢                                                                                                                                                                                                          D                                  ^                                                                k                                                            ⁡                                                              (                                m                                )                                                                                                                                          2                                                                                      >                                                                  β                        ⁡                                                  (                          m                          )                                                                    ⁢                                                                                                                                                            Y                              k                                                        ⁡                                                          (                              m                              )                                                                                                                                2                                                                                                                                                                                    ⁢                                                            β                      ⁡                                              (                        m                        )                                                              ⁢                                                                                                                                                Y                            k                                                    ⁡                                                      (                            m                            )                                                                                                                      2                                                                                                                                      ⁢                  otherwise                                                                                        (        1        )            wherein α(m) is the noise overestimation factor, β(m) is the spectral flooring factor.
In particular, the residual noise spectrum consists of peaks and valleys with random occurrences, and the overestimation factor α(m) and the spectral flooring factor β(m) have been introduced to reduce the spectral excursions.
In detail, the overestimation factor α(m) has been introduced to “overestimate” the noise spectrum, i.e., in other words the overestimation factor α(m) subtracts an overestimation of the noise over the whole spectrum, whereas the spectral flooring factor β(m) prevents the spectral lines of the estimate |{circumflex over (X)}k(m)|2 of the power spectrum of the clean speech from descending below a lower bound (β(m)|Yk(m)|2), thereby “filling-in” the deep valleys surrounding narrow peaks (from the enhanced spectrum). In fact, occasional negative estimates of the enhanced power spectrum can occur and in such cases, the negative spectral lines are floored to zero or to some minimal value (floor). Reducing the spectral excursions of noise peaks as compared to when the negative components are set to zero, reduces the amount of musical noise. Essentially by reinserting the broadband noise (noise floor), the remnants of the noise peaks are “masked” by neighboring components of comparable magnitude.
A variant of this technique in known as “Wiener Spectral Subtraction Technique”, which is similar to the previous one but is derived from the optimal filter theory. The estimate |{circumflex over (X)}k(m)|2 of the power spectrum of the clean speech is the following:
                                                                                                      X                  ^                                k                            ⁡                              (                m                )                                                          2                =                  {                                                                                        ⁢                                                                                    [                                                                                                                                                                                                            Y                                  k                                                                ⁡                                                                  (                                  m                                  )                                                                                                                                                    2                                                    -                                                                                    α                              ⁡                                                              (                                m                                )                                                                                      ⁢                                                                                                                                                                                                                                      D                                      ^                                                                        k                                                                    ⁡                                                                      (                                    m                                    )                                                                                                                                                              2                                                                                                      ]                                            2                                                                                                                                                                    Y                            k                                                    ⁡                                                      (                            m                            )                                                                                                                      2                                                                                                                                      ⁢                                                            if                      ⁢                                                                                          ⁢                                                                                                                                                            Y                              k                                                        ⁢                                                          (                              m                              )                                                                                                                                2                                                              -                                                                                                                                                                                                                          α                    ⁢                                          (                      m                      )                                        ⁢                                                                                                                                                                              D                              ^                                                        k                                                    ⁡                                                      (                            m                            )                                                                                                                      2                                                        >                                                            β                      ⁡                                              (                        m                        )                                                              ⁢                                                                                                                                                Y                            k                                                    ⁡                                                      (                            m                            )                                                                                                                      2                                                                                                                                                              ⁢                                                            β                      ⁡                                              (                        m                        )                                                              ⁢                                                                                                                                                Y                            k                                                    ⁡                                                      (                            m                            )                                                                                                                      2                                                                                                                                      ⁢                  otherwise                                                                                        (        2        )            
An improvement to the Spectral Subtraction Techniques is disclosed in V. Schless, F. Class, SNR-Dependent flooring and noise overestimation for joint application of spectral subtraction and model combination, ICSLP 1998, which proposes to make the noise overestimation factor α(m) and the spectral flooring factor β(m) functions of the global signal-to-noise ratio SNR(m).
Spectral Attenuation Technique, instead, is based on the principle of suppressing the noise by applying a suppression rule, or a non-negative real-valued gain Gk, to each spectral line k of the magnitude spectrum |Yk(m)| of the noisy speech, in order to compute an estimate |{circumflex over (X)}k(m)| of the magnitude spectrum of the clean speech according to the following formula: |{circumflex over (X)}k(m)|=Gk(m)|Yk(m)|.
Many suppression rules have been proposed, and probably one of the most important rules is the so-called Ephraim-Malah spectral attenuation log rule, which is described in Y. Ephraim and D. Malah, Speech Enhancement Using a Minimum Min-Square Error Log-Spectral Amplitude Estimator, IEEE Transaction on Acoustics, Speech, and Signal Processing, Vol. ASSP-33, No. 2, pp. 443-445, 1985.
Ephraim-Malah gain Gk(m) is defined as:
                                          G            k                    ⁡                      (            m            )                          =                                                            ξ                k                            ⁡                              (                m                )                                                    1              +                                                ξ                  k                                ⁡                                  (                  m                  )                                                              ⁢                      exp            ⁡                          (                                                1                  2                                ⁢                                                      ∫                    k                    ∞                                    ⁢                                                                                    ⅇ                                                  -                          t                                                                    t                                        ⁢                                          ⅆ                      t                                                                                  )                                                          (        3        )            where:                ξk(m) is a so-called a priori signal-to-noise ratio relating to the k-th spectral line and is defined as follows:        
                                          ξ            k                    ⁡                      (            m            )                          =                                                                                            X                  k                                ⁡                                  (                  m                  )                                                                    2                                                                                              D                  k                                ⁡                                  (                  m                  )                                                                    2                                              (        4        )                            vk(m) is defined as:        
                                          v            k                    ⁡                      (            m            )                          =                                                            ξ                k                            ⁡                              (                m                )                                                    1              +                                                ξ                  k                                ⁡                                  (                  m                  )                                                              ⁢                                    γ              k                        ⁡                          (              m              )                                                          (        5        )                            γk(m) is a so-called a posteriori signal-to-noise ratio relating to the k-th spectral line and is defined as follows:        
                                          γ            k                    ⁡                      (            m            )                          =                                                                                            Y                  k                                ⁡                                  (                  m                  )                                                                    2                                                                                              D                  k                                ⁡                                  (                  m                  )                                                                    2                                              (        6        )            
Computation of the a posteriori signal-to-noise ratio γk(m) requires the knowledge of the power spectrum |Dk(m)|2 of the additive noise, which is not available. An estimate |{circumflex over (D)}k(m)|2 of the power spectrum of the additive noise can be obtained with a noise estimate as described in H. G. Hirsch, C. Ehrlicher, Noise Estimation Techniques for Robust Speech Recognition, ICASSP 1995, pp. 153-156.
Thus, an estimate {circumflex over (γ)}k(m) of the a posteriori signal-to-noise ratio may be computed as follows:
                                                        γ              ^                        k                    ⁡                      (            m            )                          =                                                                                            Y                  k                                ⁡                                  (                  m                  )                                                                    2                                                                                                                  D                    ^                                    k                                ⁡                                  (                  m                  )                                                                    2                                              (        7        )            
Computation of the a priori signal-to-noise ratio ξk(m) requires the knowledge of the power spectrum |Xk(m)|2 of the clean speech, which is not available. An estimate {circumflex over (ξ)}k(m) of the a priori signal-to-noise ratio can be computed by using a decision-directed approach as described in Y. Ephraim and D. Malah, Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator, IEEE Trans. Acoustic, Speech, and Signal Processing, Vol. ASSP-32, Mo. 6, pp. 1109-1121, 1984, and as follows:
                                                                        ξ                ^                            k                        ⁡                          (              m              )                                =                                                    η                ⁡                                  (                  m                  )                                            ⁢                                                                                                                                                                X                          ^                                                k                                            ⁡                                              (                                                  m                          -                          1                                                )                                                                                                  2                                                                                                                                                                  D                          ^                                                k                                            ⁡                                              (                                                  m                          -                          1                                                )                                                                                                  2                                                      +                                          [                                  1                  -                                      η                    ⁡                                          (                      m                      )                                                                      ]                            ⁢                              max                ⁡                                  [                                      0                    ,                                                                                                                        γ                            ^                                                    k                                                ⁡                                                  (                          m                          )                                                                    -                      1                                                        ]                                                                    ,                                  ⁢                              η            ⁡                          (              m              )                                ∈                      [                          0              ,              1                        )                                              (        8        )            where η(m) is a weighting coefficient for appropriately weighting the two terms in the formula.
The Ephraim-Malah gain Gk(m) may then be computed as a function of the estimate {circumflex over (ξ)}k(m) of the a priori signal-to-noise ratio and of the estimate {circumflex over (γ)}k(m) of the a posteriori signal-to-noise ratio according to formula (3).
An application of the Spectral Attenuation Technique is disclosed in US-A-2002/0002455, which relates to a speech enhancement system receiving noisy speech characterized by a spectral amplitude spanning a plurality of frequency bins and producing enhanced speech by modifying the spectral amplitude of the noisy speech without affecting the phase thereof. In particular, the speech enhancement system includes a core estimator that applies to the noisy speech one of a first set of gains for each frequency bin; a noise adaptation module that segments the noisy speech into noise-only and signal-containing frames, maintains a current estimate of the noise spectrum and an estimate of the probability of signal absence in each frequency bin; and a signal-to-noise ratio estimator that measures a posteriori signal-to-noise ratio and estimates a priori signal-to-noise ratio based on the noise estimate. Each one of the first set of gains is based on a priori signal-to-noise ratio, as well as the probability of signal absence in each bin and a level of aggression of the speech enhancement. A soft decision module computes a second set of gains that is based on a posteriori signal-to-noise ratio and a priori signal-to-noise ratio, and the probability of signal absence in each frequency bin.
Another application of the Spectral Attenuation Techniques is disclosed in WO-A-01/52242, which relates to a multi-band spectral subtraction scheme which can be applied to a variety of speech communication systems, such as hearing aids, public address systems, teleconference systems, voice control systems, or speaker phones, and which comprises a multi band filter architecture, noise and signal power detection, and gain function for noise reduction. The gain function for noise reduction consists of a gain scale function and a maximum attenuation function providing a predetermined amount of gain as a function of signal-to-noise ratio and noise. The gain scale function is a three-segment piecewise linear function, and the three piecewise linear sections of the gain scale function include a first section providing maximum expansion up to a first knee point for maximum noise reduction, a second section providing less expansion up to a second knee point for less noise reduction, and a third section providing minimum or no expansion for input signals with high signal-to-noise ratio to minimize distortion. The maximum attenuation function can either be a constant or equal to the estimated noise envelope. When used in hearing aid applications, the noise reduction gain function is combined with the hearing loss compensation gain function inherent to hearing aid processing.
Automatic speech recognition performed by using the known noise reduction methods described above is affected by some technical problems which prevents it from being really effective. In particular, Spectral Subtraction Technique and Wiener Spectral Subtraction Technique are affected by the so-called “musical noise”, which is introduced in the power spectrum |Xk(m)|2 of the clean speech by the aforementioned flooring, according to which negative values are set to a flooring value β(m)|Yk(m)|2 in order to avoid occurrence of negative subtraction results. In particular, the flooring introduces discontinuities in the spectrum that are perceived as annoying musical noises and degrade the performances of an automatic speech recognition system.
Spectral Attenuation Technique implementing the Ephraim-Malah attenuation rule is a very good technique for the so-called speech enhancement, i.e., noise reduction for a human listener, but it introduces some spectral distortion on voice parts that are acceptable for humans but very critical for an automatic speech recognition system.