Field of the Invention
The present invention relates to a noise estimator and a noise estimating method, for instance, which are applied to a noise suppressor or a speech enhancer for suppressing a noise added onto speech by frequency domain process.
Description of the Background Art
Because noise are present all around natural environments, sounds generally observed in the practical world includes the noises coming from various sources. To enhance the speech from input signals consisting of the speech and the noises, various methods of suppressing the noises are developed. Almost all those methods estimate the noise to be suppressed and then suppress the noise included in the input signals. The invention relates to the noise estimation, particularly to intend estimating power of the noise in the frequency domain.
The simplest conventional noise estimating method averages input spectra within speech absent periods. However, this method needs to estimate the speech absent periods in advance. On the other hand, a technique of estimating speech active periods, such as voice activity detection (VAD), is actively researched, but a perfect VAD is not yet achieved. An estimation error of the speech active periods involves the speech in the estimated noise. As a result, a problem of distorting the enhanced speech and remained noise is occurred. In such a method, because the noise is estimated only in the noise periods, the noise may not be estimated according to noise variation in a long speech active period.
By contrast, other noise estimating methods of estimating the noise consecutively even in the speech active periods are developed, for example, as referred to in Rainer Martin, “Spectral Subtraction Based on Minimum Statistics”, in Proceedings of 7th European Signal Processing Conference, 1994, pp. 1182-1185, and in Mehrez Souden et al., “Noise Power Spectral Density Tracking: A Maximum Likelihood Perspective”, IEEE Signal Processing Letters, Vol. 19, No. 8, August 2012, pp. 495-498, as well as in U.S. Pat. No. 7,590,528 B1 to Kato et al. With regard to a conventional noise suppressor applying the noise suppressing methods taught by Martin, Souden et al., and Kato et al., its configuration and operations will be briefly illustrated below.
The conventional noise suppressor includes a sub-band divider for dividing an input signal into sub-band input signals, sub-band processors as many as the number of the divided sub-band input signals for processing the divided sub-band signals (for example, when the input signal is divided into 256 sub-band input signals, the number of sub-band processors included in the noise suppressor is 256) and a signal reconstructor for reconstructing a temporal waveform on the basis of the sub-band enhanced signals processed by the sub-band processors.
The sub-band divider divides an input signal into K (e.g. K is equal to 256) sub-bands by an optional sub-band division way, such as a filter bank, or an optional frequency analysis way, such as Fourier transform, to respectively transmit the resultant K sub-band input signals to the sub-band processors. A digital signal such as the input signal may be processed for each sample or, if necessary, processed for each frame, e.g. at 10 milliseconds intervals. Hereinafter, this specification may describe various signals and various components so that the words “signal” and “component” are omitted.
The sub-band processors carry out processes in respective different sub-bands. However, the processes for the sub-bands perform much the same. The respective sub-band processors include a sub-band noise estimator and a noise suppressor. The sub-band noise estimator estimates the noise power for each sub-band to transmit the resultant sub-band noise power to the noise suppressor. The noise suppressor enhances the speech component in the sub-band input signal on the basis of the sub-band input signal and the sub-band noise power to transmit the resultant sub-band enhanced signal to the signal reconsturctor.
The signal reconstructor reconstructs temporal waveformat from the sub-band enhanced signal by a signal decoding way corresponding to the sub-band division way or frequency analysis way used in the sub-band divider to output the resultant enhanced signal.
Now, a conventional noise estimating method carried out in the sub-band noise estimator will be described below in detail. The sub-band noise estimator corresponds to, for example, the noise suppressing method taught by Martin, Souden et al., and Kato et al. In the following, for simplification, the sub-band input signal power and the sub-band noise power are called as an “input power” and a “noise power”, respectively. Furthermore, the sub-band number is omitted.
The noise estimating method taught by Martin is based on a discovery that a peak in the time direction of the input power indicates an existence of the object speech, and that valley information in the time direction of the input power is useful for estimation of smoothed noise power. For instance, a minimum value of the input power from the present time to a predetermined time (T second) before is determined as a first estimated value of the noise power. However, the first noise power estimated value has a bias, and accordingly, has a characteristic becoming smaller than a true noise power. This bias is estimated on the basis of an expected value of the first estimated value. By correcting the first estimated value using the resultant bias estimated value, a second estimated value (a final estimated value) of the noise power is obtained.
The noise estimating method taught by Souden et al., is on the basis of the hypothesis that both distributions of complex spectra of the object speech and noise depend on complex normal distribution averaged to zero, to determine the Maximum Likelihood (ML) estimate of dispersion of the complex spectrum of the noise as the estimated value of the noise power. On the basis of the hypothesis, the distribution of the complex spectrum of the input signal is determined as complex normal distribution averaged to zero having the sum of dispersions of the complex spectra of the speech and noise. In the method, a hidden variable relating to whether the present input is a degraded signal or the noise can be introduced. Furthermore, an online Expectation Maximization (EM) algorithm with forgetting coefficient is applied. Accordingly, the ML estimate of the complex spectrum of the noise can be calculated.
In the noise estimating method taught by Kato et al., the input power is multiplied by a suitable weight coefficient. The resultant weighted input power is stored for a predetermined time (T second). An average of stored weighted input power is determined as the estimated value of the noise power. The suitable weight coefficient is calculated by a posteriori signal-to-noise ratio (SNR), which is determined by dividing the present input power by the previous estimated value of the noise power. For instance, the weight coefficient is determined as 1 when the a posteriori SNR is a predetermined value G1 or less, and so as to be inversely proportional to the a posteriori SNR when the a posteriori SNR is greater than the predetermined value G1. Moreover, the weight coefficient is determined as zero when the a posteriori SNR is greater than another predetermined value G2. If the weight coefficient is zero, the weighted input power is not stored.
However, in the conventional noise estimating method, there are problems as mentioned below. In the noise estimating method taught by Martin, there is a problem that the unpleasant noise is remained by the noise suppression at the latter step when the noise is rapidly increased. For instance, the estimated value of the noise power is kept small for a predetermined time after the noise begins to increase. When the predetermined time is elapsed after the noise is increased, the estimated value of the noise power is rapidly increased. If the estimated value is used for the noise suppressing method, the remained noise is rapidly increased at the moment the noise is increased, and then, the remained noise is rapidly decreased after the predetermined time. The rapid variation of volume of the remained noise gives auditors unpleasantness on auditory sensation.
In the noise estimating method taught by Mehrez Souden et al., there is a problem that the estimated value of the noise power is over- and under-estimation, if a noise level is varied. The online EM algorithm used in the noise estimating method has trade-off between quickness of the convergence and stability of the ML estimation, as described below. When the forgetting coefficient is increased, the stability is improved and the convergence is slowed. On the contrary, the forgetting coefficient is decreased, the convergence is speeded up and the stability is deteriorated. As a result, regardless of the increase or decrease of the forgetting coefficient, the estimated value of the noise power is incorrect. In the noise suppressing method at the latter step, the distortion of the enhanced speech is increased and the remained noise is increased.
In the noise estimating method taught by Masanori Kato et al., the estimated value of the noise power is relatively less to follow the speech in mistake and become instability by following non-stationary noise. Moreover, this method may relatively immediately follow the noise variation. However, in the noise period after the speech active periods with the weight coefficient not becoming zero are continued, the estimated value of the noise power rapidly decreases after approximately T second from switching from the successive speech active periods to the noise period. If the estimated value is used for the noise suppressing method at the latter step, the enhanced signal becomes unnatural on the auditory sensation. This is because the remained noise rapidly increases in the noise period.
As mentioned above, the conventional noise estimating methods have the problems that the estimated value of the noise power becomes instability and rapidly varies.