1. Field of the Invention
The present invention is generally in the field of speech coding. In particular, the present invention is in the field of noise suppression for speech coding purposes.
2. Background Art
Today, noise reduction has become the subject of many research projects in various technical fields. In the recent years, due the tremendous demand and growth in the areas of digital telephony, the Internet and cellular telephones, there has been an intense focus on the quality of audio signals, especially reduction of noise in speech 1d signals. The goal of an ideal noise suppressor system or method is to reduce the noise level without distorting the speech signal, and in effect, reduce the stress on the listener and increase intelligibility of the speech signal.
Technically, there are many different ways to perform the noise reduction. One noise reduction technique that has gained ground among the experts in the field is a noise reduction system based on the principles of spectral weighting. Spectral weighting means that different spectral regions of the mixed signal of speech and noise are attenuated or modified with different gain factors. The goal is to achieve a speech signal that contains less noise than the original speech signal. At the same time, however, the speech quality must remain substantially intact with a minimal distortion of the original speech. Another important design consideration is that the residual noise, i.e. the noise remaining in the processed signal, must not sound unnatural.
Typically, the spectral weighting technique is performed in the frequency domain using the well-known Fourier transform. To explain the principles of spectral weighting in simple terms, a clean speech signal is denoted with s(k), a noise signal is denoted with n(k), and an original speech signal is denoted with o(k), which may be formulated as o(k)=s(k)+n(k). Now, taking the Fourier transform of this equation leads to O(f)=S(f)+N(f). At this step, the actual spectral weighting may be performed by multiplying the spectrum O(f) with a real weighting function, such as W(f)>=0. As a result, P(f)=W(f) O(f), and the processed signal p(k) is obtained by transforming P(f) back into the time domain. Now, below, a more elaborate system 100, including a conventional noise suppression module 106 is discussed. The conventional noise suppression module 106 of the speech pre-processing system 100 is that of the Telecommunication Industry Association Interim Standard 127 (“IS-127”), which is known as Enhanced Variable Rate Coder (“EVRC”). The IS-127 specification is hereby fully incorporated by reference in the present application.
As stated above, FIG. 1a illustrates a conventional speech pre-processing system 100, which includes a noise suppression module 106. After reading and buffering samples of the input speech 101 for a given speech frame, an input speech signal 101 enters the speech preprocessor system 100. The input speech signal 101 samples are then analyzed by a silence enhancement module 102 to determine whether the speech frame is pure silence, in other words, whether only silence noise is present. Next, the silence enhanced input speech signal 103 is scaled down by the high-pass filter module 104 to condition the input speech 101 against excessive lose frequency that degrade the voice quality.
The high-pass filtered speech signal 105 is then routed to a noise suppression module 106. The noise suppression module 106 performs a noise attenuation of the environmental noise in order to improve the estimation of speech parameters.
The noise suppression module 106 performs noise processing in frequency domain by adjusting the level of the frequency response of each frequency band that results in substantial reduction in background noise. The noise suppression module 106 is aimed at improving the signal-to-noise ratio (“SNR”) of the input speech signal 101 prior to the speech encoding process. Although the speech frame size is 20 ms, the noise suppression module 106 frame size is 10 ms. Therefore, the following procedures must be executed two times per 20 ms speech frame. For the purpose of the following description, the current 10 ms frame of the high-pass filtered speech signal 105 is denoted m.
As shown, the high-pass filtered speech signal 105, denoted {Shp(n)}, enters the first stage of the noise suppression module 106, i.e. Frequency Domain Conversion stage 110. At the frequency domain conversion stage 110, Shp(n) is windowed using a smoothed trapezoid window, in which the first D samples of the input frame buffer {d(m)} are overlapped from the last D samples of the previous frame, where this overlap is described as: d(m,n)=d(m−1,L+n); 0≦n≦D, where m is the current frame, n is the sample index to the buffer {d(m)}, L=80 is the frame length, and D=24 is the overlap or delay in samples. The remaining samples of the input buffer {d(m)} are then pre-emphasized at the Frequency Domain Conversion stage 110 to increase the high to low frequency ratio with a pre-emphasized factor ζp=−0.8 according to the following: d(m,D+n)=Shp(n)+ζpShp(n−1); 0≦n<L. This results in the input buffer containing L+D=104 samples in which the first D samples are the pre-emphasized overlap from the previous frame, and the following L samples are pre-emphasized input from the current frame m.
Next, a smoothed trapezoidal window is applied to the input buffer {d(m)} to form a Discrete Fourier Transform (“DFT”) data buffer {g(n)}, defined as:       g    ⁢                   ⁢          (      n      )        =      {                                                      d              ⁢                                                           ⁢                              (                                  m                  ,                  n                                )                            ⁢                                                           ⁢                              sin                2                            ⁢                                                           ⁢                              (                                  π                  ⁢                                                                           ⁢                                                            (                                              n                        +                        0.5                                            )                                        /                    2                                    ⁢                  D                                )                                      ;                                                              0              ≤              n              <              D                        ,                                                                          d              ⁢                                                           ⁢                              (                                  m                  ,                  n                                )                                      ;                                                              D              ≤              n              <              L                        ,                                                                          d              ⁢                                                           ⁢                              (                                  m                  ,                  n                                )                            ⁢                                                           ⁢                              sin                2                            ⁢                                                           ⁢                              (                                  π                  ⁢                                                                           ⁢                                                            (                                              n                        -                        L                        +                        D                        +                        0.5                                            )                                        /                    2                                    ⁢                  D                                )                                      ;                                                              0              ≤              n              <              D                        ,                                                            0            ;                                                                              D                +                L                            ≤              n              <              M                        ,                              where M=128 is the DFT sequence length. At this point, a transformation of g(n) to the frequency domain is performed using the DFT to obtain G(k). A transformation age technique, such as a 64-point complex Fast Fourier Transform (“FFT”) may be used to convert the time domain data buffer g(n) to the frequency domain data buffer spectrum G(k). Thereafter, G(k) is used to computer noise reduction parameters for the remaining blocks, as explained below.
The frequency domain data buffer spectrum G(k) resulting from the Frequency Domain Conversion stage 110 is used to estimate channel energy Ech(m) for the current frame m at Channel Energy Estimator stage 115. At this stage, the 64-point energy bands are computed from the FFT results of stage 101, and are quantized into 16 bands (or channels). The quantization is used to combine low, mid, and high frequency components and to simplify the internal computation of the algorithm. Also, in order to maintain accuracy, the quantization uses a small step size for low frequency ranges, increased the step size for higher frequencies, and uses the highest step size for the highest frequency ranges.
Next, at Channel SNR Estimator stage 120, quantized 16-channel SNR indices σq(i) are estimated using the channel energy Ech(m) from the Channel Energy Estimator stage 115, and current channel noise energy estimate En(m) from Background Noise Estimator 140 which continuously tracks the input spectrum G(k). In order to avoid undervaluing and overvaluing of the SNR, the final SNR result is also quantized at the Channel SNR Estimator 120. Then, a sum of voice metrics v(m) at Voice Metric Calculation stage 130 is determined based upon the estimated quantized channel SNR indices σq(i) from the Channel SNR Estimator stage 120. This process involves a transformation of the actual sum of all sixteen signal-to-noise ratios from a predetermined voice metric table with the quantized channel SNR indices σq(i). The higher the SNR, the higher the voice metric sum v(m). Because the value of the voice metric v(m) is also quantized, the maximum and the minimum values are always ascertainable.
Thereafter, at Spectral Deviation Estimator stage 125, changes from speech to noise and vice versa are detected which can be used to indicate the presence of speech activity of a noise frame. In particular, a log power spectrum Edb(m,i) is estimated based upon the estimated channel energy Ech(m), from the Channel Energy Estimator stage 115, for each of the sixteen channels. Then, an estimated spectral deviation ΔE(m) between a current frame power spectrum Edb(m) and an average long-term power spectral estimate Edb(m) is determined. The estimated spectral deviation ΔE(m) is simply a sum of the difference between the current frame power spectrum Edb(m) and the average long-term power spectral estimate Edb(m) at each of the sixteen channels. In addition, a total channel energy estimate Etot(m) for the current frame is determined by taking the logarithm of the sum of the estimated channel energy Ech(m) at each frame. Thereafter, an exponential windowing factor α(m) as a function of the total channel energy Etot(m) is determined, and the result of that determination is limited to a range determined by a predetermined upper and lower limits αH and αL, respectively. Then, an average long-term power spectral estimate for the subsequent frame Edb(m+1,i) is updated using the exponential windowing factor Δ(m), the log power spectrum Edb(m), and the average long-term power spectral estimate for the current frame Edb(m).
With the above variables determined at the Spectral Deviation Estimator stage 125, noise estimate is updated at Noise Update Decision stage 135. At this stage 135, a noise frame indicator update_flag indicating the presence of a noise frame can be determined by utilizing the voice metrics v(m) from the Voice Metric Calculation stage 130, and the total channel energy Etot(m) and the spectral deviation ΔE(m) from the Spectral Deviation Estimator stage 125. Using these three pre-computed values coupled with a simple delay decision mechanism, the noise frame indicator update_flag is ascertained. The delay decision is implemented using counters and a hysterisis process to avoid any sudden changes in the noise to non-noise frame detection. The pseudo-code demonstrating the logic for updating the noise estimate is set forth in the above-incorporated IS-127 specification and shown in FIG. 1b. 
Now, having updated the background noise at the Noise Update Decision stage 135, at Channel Gain Calculation stage 150, it is determined whether channel SNR modification is necessary and whether to modify the appropriate channel SNR indices σq(i). In some instances, it is necessary to modify the SNR value to avoid classifying a noise frame as speech. This error may stem from distorted frequency spectrum. By analyzing the mid and high frequency bands at Channel SNR Modifier stage 145, the pre-computed SNR can be modified if it is determined that a high probability of error exists in the processed signal. This process is set forth in the above-incorporated IS-127 specification, as shown in FIG. 1c. 
Referring to FIG. 1c, the quantized channel SNR indices σq(i) determined at the Channel SNR Estimator 120 are verified to be greater or equal to a predetermined channel SNR index threshold level, i.e. INDEX_THLD, which is set at 12. Thereafter, if it is determined that the index counter is less than a predetermined index counter threshold level (INDEX_CNT_THLD=5), a channel SNR modification flag may be set to indicate that the channel SNR must be modified and the channel SNR indices σq(i) are modified to obtain modified channel SNR indices σq(i) or the channel SNR modification flag may be reset to indicate that the modification is not necessary, and the modified channel SNR indices σq(i) are not changed from the original values σ′q(i)=σq(i).
Now, if the voice metric sum v(m) determined at the Voice Metric Calculation stage 130 is determined to be less than or equal to a predetermined metric threshold level, i.e. METRIC_THLD=45, or if the channel SNR indices σq(i) are less than or equal to a predetermined setback threshold level, i.e. SETBACK_THLD=12, the modified channel SNR indices σ′q(i) are set to one. Else, the modified channel SNR indices σ′q(i) are not changed from the original values σ′q(i)=σq(i). In the following segment, in order to limit the modified channel SNR indices σq(i) to an SNR threshold level σth, it is first determined whether the modified channel SNR indices σ′q(i) are less than the SNR threshold level σth. If so, the threshold limited and modified channel SNR σ″q(i) indices are set to the threshold level σth, i.e. (σ″q(i)=σth). Else, the SNR indices σ″q(i) are not changed, i.e., σ″q(i)=σ′q(i).
Turning back to FIG. 1a, the threshold limited, modified channel SNR indices σ″q(i) are provided to the Channel Gain Calculation stage 150 to determine an overall gain factor γn for the current frame based upon a pre-set minimum overall gain γmin a noise floor energy Efloor, and the estimated noise spectrum of the previous frame En(m−1). Next, the channel gain in the db domain, i.e. γdb(i), is determined based on the following equation:γdb(i)=μg(σ″q(i)−σth)+γn;0≦i<Nc where the gain slope μg is constant factor, set to 0.39. In the following stage, the channel gain γdb(i) is converted from the db domain to linear channel gains, i.e. γch(i), by taking the inverse logarithm of base 10, i.e. γch(i)=min{1, 10γdb(t)/20}. Therefore, for a given channel, γch has a value less than or equal to one, but greater than zero, i.e. 0<γch(i)≦1. The gain γch should be higher or closer to 1.0 to preserve the speech quality for strong voiced areas and, on the other hand, the gain γch should be lower or closer to zero to suppress noise in noisy areas. Next, the linear channel gains γch(i) are applied to the G(k) signal by a gain modifier 155 producing a noise-reduced signal spectrum H(k). Finally, H(k) signal is converted back into time domain at Time Domain Conversion stage 160 resulting in noise reduced signal S′(n) in the time domain.
The above-described conventional approach, however, is a simplistic approach to noise suppression, which only considers one dynamic parameter, i.e. the dynamic change in the SNR value, in determining the channel gains Ych(i). This simplistic approach introduces various downfalls, which may in turn cause a degradation in the perceptual quality of the voice signal that is more audible than the noise signal. The shortcomings and inaccuracies of the conventional system 100, which are due to its sole reliance on the SNR value, stem from the facts that the SNR calculation is merely an estimation of the noise to signal, and that the SNR value is only an average, which by definition may be more or less than the true SNR value for specific areas of each channel. As a result of its mere reliance on the SNR value, the conventional approach suffers from improperly altering the voiced areas of the speech, and thus, causes degradation in the voice quality.
Accordingly, there is an intense need in the art for a new and improved approach to noise suppression that can overcome the shortcomings in the conventional approach and produce a noise-reduced speech signal with a superior voice quality.