Since speaker phones allow easy communication among a plurality of people and can separately provide a handsfree structure, the speaker phones are essentially included in various communication devices. Currently, communication devices for video telephony become popular due to the development of wireless communication technology. As communication devices capable of reproducing multimedia data or media reproduction devices such as portable multimedia players (PMPs) and MP3 players become popular, local-area wireless communication devices such as bluetooth devices also become popular. Furthermore, hearing aids for those who cannot hear well due to bad hearing have been developed and provided. Such speaker phones, hearing aids, communication devices for video telephony, and bluetooth devices include a noisy speech signal processing apparatus for recognizing speech data in a noisy speech signal, i.e., a speech signal including noise or for extracting an enhanced speech signal from the noisy speech signal by removing or weakening background noise.
The performance of the noisy speech signal processing apparatus decisively influences the performance of a speech-based application apparatus including the noisy speech signal processing apparatus, because the background noise almost always contaminates a speech signal and thus can greatly reduce the performance of the speech-based application apparatus such as a speech codec, a cellular phone, and a speech recognition device. Thus, research has been actively conducted on a method of efficiently processing a noisy speech signal by minimizing influence of the background noise.
Speech recognition generally refers to a process of transforming an acoustic signal obtained by a microphone or a telephone, into a word, a set of words, or a sentence. A first step for increasing the accuracy of the speech recognition is to efficiently extract a speech component, i.e., an acoustic signal from a noisy speech signal input through a single channel. In order to extract only the speech component from the noisy speech signal, a method of processing the noisy speech signal by, for example, determining which one of noise and speech components is dominant in the noisy speech signal or accurately determining a noise state, should be efficiently performed.
Also, in order to improve sound quality of the noisy speech signal input through a single channel, only the noise component should be weakened or removed without damaging the speech component. Thus, the method of processing the noisy speech signal input through a single channel basically includes a noise estimation method of accurately determining the noise state of the noisy speech signal and calculating the noise component in the noisy speech signal by using the determined noise state. An estimated noise signal is used to weaken or remove the noise component from the noisy speech signal.
Various methods for improving sound quality by using the estimated noise signal exist. One of the methods is a spectral subtraction (SS) method. The SS method subtracts a spectrum of the estimated noise signal from a spectrum of the noisy speech signal, thereby obtaining an enhanced speech signal by weakening or removing noise from the noisy speech signal.
A noisy speech signal processing apparatus using the SS method should accurately estimate noise more than anything else and the noise state should be accurately determined in order to accurately estimate the noise. However, it is not easy at all to determine the noise state of the noisy speech signal in real time and to accurately estimate the noise of the noisy speech signal in real time. In particular, if the noisy speech signal is contaminated in various non-stationary environments, it is very hard to determine the noise state, to accurately estimate the noise, or to obtain the enhanced speech signal by using the determined noise state and the estimated noise signal.
If the noise is inaccurately estimated, the noisy speech signal may have two side effects. First, the estimated noise can be smaller than actual noise. In this case, annoying residual noise or residual musical noise can be detected in the noisy speech signal. Second, the estimated noise can be larger than the actual noise. In this case, speech distortion can occur due to excessive SS.
A large number of methods have been suggested in order to determine the noise state and to accurately estimate the noise of the noisy speech signal. One of the methods is a voice activation detection (VAD)-based noise estimation method. According to the VAD-based noise estimation method, the noise state is determined and the noise is estimated, by using statistical data obtained in a plurality of previous noise frames or a long previous frame. A noise frame refers to a silent frame or a speech-absent frame which does not include the speech component, or to a noise dominant frame where the noise component is overwhelmingly dominant in comparison to the speech component.
The VAD-based noise estimation method has an excellent performance when noise does not greatly vary based on time. However, for example, if the background noise is non-stationary or level-varying, if a signal to noise ratio (SNR) is low, or if a speech signal has a weak energy, the VAD-based noise estimation method cannot easily obtain reliable data regarding the noise state or a current noise level. Also, the VAD-based noise estimation method requires a high cost for calculation.
In order solve the above problems of the VAD-based noise estimation method, various new methods have been suggested. One well-known method is a recursive average (RA)-based weighted average (WA) method. The RA-based WA method estimates the noise in the frequency domain and continuously updates the estimated noise, without performing VAD. According to the RA-based WA method, the noise is estimated by using a forgetting factor that is fixed between a magnitude spectrum of the noise speech signal in a current frame and the magnitude spectrum of the noise estimated in a previous frame. However, since the fixed forgetting factor is used, the RA-based WA method cannot reflect noise variations in various noise environments or a non-stationary noise environment and thus cannot accurately estimate the noise.
Another noise estimation method suggested in order to cope with the problems of the VAD-based noise estimation method, is a method of using a minimum statistics (MS) algorithm. According to the MS algorithm, a minimum value of a smoothed power spectrum of the noisy speech signal is traced through a search window and the noise is estimated by multiplying the traced minimum value by a compensation constant. Here, the search window covers recent frames in about 1.5 seconds. In spite of a generally excellent performance, since data of a long previous frame corresponding to the length of the search window is continuously required, the MS algorithm requires a large-capacity memory and cannot rapidly trace noise level variations in a noise dominant signal that is mostly occupied by a noise component. Also, since data regarding the estimated noise of a previous frame is basically used, the MS algorithm cannot obtain a reliable result when a noise level greatly varies or when a noise environment changes.
In order to solve the above problems of the MS algorithm, various corrected MS algorithms have been suggested. Two most common characteristics of the corrected MS algorithms are as described below. First, the corrected MS algorithms use a VAD method of continuously verifying whether a current frame or a frequency bin, which is a target to be considered, includes a speech component or is a silent sub-band. Second, the corrected MS algorithms use an RA-based noise estimator.
However, although the problems of the MS algorithm, for example, a problem of time delay of noise estimation and a problem of inaccurate noise estimation in a non-stationary environment, can be solved to a certain degree, such corrected MS algorithms cannot completely solve those problems, because the MS algorithm and the corrected MS algorithms intrinsically use the same method, i.e., a method of estimating noise of a current frame by reflecting and using an estimated noise signal of a plurality of previous noise frames or a long previous frame, thereby requiring a large-capacity memory and a large amount of calculation.
Thus, the MS algorithm and the corrected MS algorithms cannot rapidly and accurately estimate background noise of which level greatly varies, in a variable noise environment or in a noise dominant frame. Furthermore, the VAD-based noise estimation method, the MS algorithm, and the corrected MS algorithms not only require a large-capacity memory in order to determine the noise state but also require a high cost for a quite large amount of calculation.