Generally, various kinds of methods for enhancing a quality of speech have been proposed. A spectral subtraction method (SSM) is representative one of the various kinds of methods. The spectral subtraction method (SSM) is explained with reference to FIG. 1 as follows.
The SMM is a method of estimating a short-time spectral magnitude directly. In the SSM, speech is modeled into a form to which a noise, represented by an uncorrelated random variable, is added. The speech modeling is expressed by Formula 1 as follows.y[n]=s[n]+d[n]  [Formula 1]
In Formula 1, y[n] is an input speech. Furthermore, it is assumed that d[n] is an uncorrelated noise to s[n]. Hence, power spectral density is found according to Formula 2 as follows.Sy(eiω)=Ss(eiω)+Sd(eiω)  [Formula 2]
In Formula 2, Sy(ejω) is represented by Formula 3 via a short-time Discrete-Time Fourier Transform (DTFT).Sy(ejω)=|Y(ejω)|2  [Formula 3]
A phase is known to find a spectrum of a speech frame itself. Moreover, it is proven that there is no large difference in determining the phase of the speech frame using a phase of noisy speech that is substantially mixed with noise. D. L. Wang and J. S. Lim, “The unimportance of phase in speech enhancement,” IEEE Trans. on Acoust. Speech, and Signal Processing, vol-ASSP. 30, pp. 679-681, 1982.
In case of determining the phase of the speech frame using the phase of the noisy speech, the short-time DTFT to be sought can be found by Formula 4.Ŝ(ejω)=|Sy(ejω)−Ŝd(ejω)|1/2ejφt(ω)  [Formula 4]
Sy(ejω) in Formula 4 is found from Formula 2. And φy(ejω) uses the phase of the noisy speech. Therefore, an estimated value of ŝ[n] to be sought is found from Formula 4. If there is no speech, Ŝd(ejω) is estimated from the noise.
One of the various speech quality enhancing methods such as an Adaptive Line Enhancer (ALE) is explained with reference to FIG. 2 as follows. First, use of a general adaptive filter is explained because of the ALE's evolution from a scheme using the adaptive filter.
When using the adaptive filter, after receiving inputs of two microphones, i.e., receiving a noise speech as an input of one microphone and a pure noise as an input of the other microphone, a transfer function and the like are generated due to a distance between the two microphones and the like. However, the adaptive filter removes the transfer function to attain a clean speech.
The method using the adaptive filter is very effective in some cases and has been successfully used for a practical purpose. Yet, the method requires installation of a pair of microphones. Also, there is a structural difficulty in deciding how far the pair of microphones should be spaced apart from each other. Hence, it is difficult to apply the method to a user equipment such as a mobile terminal.
The ALE (Adaptive Line Enhancer) is an improvement of the method employing the adaptive filter and is a scheme for performing adaptive filtering on signals s[n] and d[n] attained from the same microphone by leaving a difference equivalent to a pitch period in between the signals. Here, the pitch period corresponds to a period of a voiced speech part of a speech signal.
For the voiced speech, a periodic impulse train excites a vocal tract. Hence, the ALE exerts a considerable effect on the voiced speech. However, for an unvoiced speech, the corresponding speech is crushed.
One of the various speech quality enhancing methods such as a scheme for using an adaptive comb filter is explained as follows. First, when using an adaptive comb filter, a corresponding scheme similar to the ALE has a better effect on a voiced speech.
In case of the voiced speech, an excitation signal is a periodic signal. Even if a Fourier Transform is performed on an impulse train, the result indicates that the impulse train appears in a frequency domain. Hence, in case of the voiced speech, a peak periodically appears at a portion where a pitch frequency becomes multiple. It is a matter of course that a contour of an overall spectrum is represented by a resonance of a vocal tract called a formant.
When a noisy speech is represented by y[n], a speech is represented by s[n], and the speech of which noise is removed is estimated to be represented by ŝ[n], the speech enhanced by an adaptive comb filter is expressed by Formula 5.
                                          s            ^                    ⁡                      [            n            ]                          =                              ∑                          i              =                              -                L                                      L                    ⁢                                    c              i                        ⁢                          y              ⁡                              (                                  n                  -                                      iT                    0                                                  )                                                                        [                  Formula          ⁢                                          ⁢          5                ]            
In Formula 5, T0 represents an extracted pitch period and ci represents a comb filter coefficient. Here, a small value (1˜6) is generally used as a value of L. Meanwhile, since a noise is not generally periodic, the adaptive comb filter is effective in removing the noise. However, the related art speech quality enhancing methods have the following problems or disadvantages.
First, if there is no speech, Ŝd(ejω) is estimated from the noise in the SSM. However, it is unable to measure the Ŝd(ejω) reliably. Namely, it is able to estimate the Ŝd(ejω) only if it is assumed that the noise d[n] is a stationary signal. Even if it is actually so, it is unable to avoid a spectrum variation according to a time. Specifically, in case of a mobile terminal or the like, it is unable to measure the Ŝd(ejω) reliably since circumferential environments keep changing.
Second, the ALE or the scheme using the adaptive comb filter shows excellent performance on the voiced speech. However, these schemes or methods are applicable to the voiced signal only. In case of applying the ALE or the scheme using the adaptive comb filter to an unvoiced signal, performance is reduced due to a slight misalignment of a voiced/unvoiced (V/UV) decision.
Third, in case of a certain speech, a voiced characteristic appears in a low frequency or an unvoiced characteristic appears in a high frequency, whereby the performance of the ALE is degraded.