The present invention relates to noise cancellation and reduction and, more specifically, to noise cancellation and reduction using spectral subtraction.
Ambient noise added to speech degrades the performance of speech processing algorithms. Such processing algorithms may include dictation, voice activation, voice compression and other systems. In such systems, it is desired to reduce the noise and improve the signal to noise ratio (S/N ratio) without effecting the speech and its characteristics.
Near field noise canceling microphones provide a satisfactory solution but require that the microphone in the proximity of the voice source (e.g., mouth). In many cases, this is achieved by mounting the microphone on a boom of a headset which situates the microphone at the end of a boom proximate the mouth of the wearer. However, the headset has proven to be either uncomfortable to wear or too restricting for operation in, for example, an automobile.
Microphone array technology in general, and adaptive beamforming arrays in particular, handle severe directional noises in the most efficient way. These systems map the noise field and create nulls towards the noise sources. The number of nulls is limited by the number of microphone elements and processing power. Such arrays have the benefit of hands-free operation without the necessity of a headset.
However, when the noise sources are diffused, the performance of the adaptive system will be reduced to the performance of a regular delay and sum microphone array, which is not always satisfactory. This is the case where the environment is quite reverberant, such as when the noises are strongly reflected from the walls of a room and reach the array from an infinite number of directions. Such is also the case in a car environment for some of the noises radiated from the car chassis.
The spectral subtraction technique provides a solution to further reduce the noise by estimating the noise magnitude spectrum of the polluted signal. The technique estimates the magnitude spectral level of the noise by measuring it during non-speech time intervals detected by a voice switch, and then subtracting the noise magnitude spectrum from the signal. This method, described in detail in Suppression of Acoustic Noise in Speech Using Spectral Subtraction, (Steven F Boll, IEEE ASSP-27 NO.2 April, 1979), achieves good results for stationary diffused noises that are not correlated with the speech signal. The spectral subtraction method, however, creates artifacts, sometimes described as musical noise, that may reduce the performance of the speech algorithm (such as vocoders or voice activation) if the spectral subtraction is uncontrolled. In addition, the spectral subtraction method assumes erroneously that the voice switch accurately detects the presence of speech and locates the non-speech time intervals. This assumption is reasonable for off-line systems but difficult to achieve or obtain in real time systems.
More particularly, the noise magnitude spectrum is estimated by performing an FFT of 256 points of the non-speech time intervals and computing the energy of each frequency bin. The FFT is performed after the time domain signal is multiplied by a shading window (Hanning or other) with an overlap of 50%. The energy of each frequency bin is averaged with neighboring FFT time frames. The number of frames is not determined but depends on the stability of the noise. For a stationary noise, it is preferred that many frames are averaged to obtain better noise estimation. For a non-stationary noise, a long averaging may be harmful. Problematically, there is no means to know a-priori whether the noise is stationary or non-stationary.
Assuming the noise magnitude spectrum estimation is calculated, the input signal is multiplied by a shading window (Hanning or other), an FFT is performed (256 points or other) with an overlap of 50% and the magnitude of each bin is averaged over 2-3 FFT frames. The noise magnitude spectrum is then subtracted from the signal magnitude. If the result is negative, the value is replaced by a zero (Half Wave Rectification). It is recommended, however, to further reduce the residual noise present during non-speech intervals by replacing low values with a minimum value (or zero) or by attenuating the residual noise by 30 dB. The resulting output is the noise free magnitude spectrum.
The spectral complex data is reconstructed by applying the phase information of the relevant bin of the signal""s FFT with the noise free magnitude. An IFFT process is then performed on the complex data to obtain the noise free time domain data. The time domain results are overlapped and summed with the previous frame""s results to compensate for the overlap process of the FFT.
There are several problems associated with the system described. First, the system assumes that there is a prior knowledge of the speech and non-speech time intervals. A voice switch is not practical to detect those periods. Theoretically, a voice switch detects the presence of the speech by measuring the energy level and comparing it to a threshold. If the threshold is too high, there is a risk that some voice time intervals might be regarded as a non-speech time interval and the system will regard voice information as noise. The result is voice distortion, especially in poor signal to noise ratio cases. If, on the other hand, the threshold is too low, there is a risk that the non-speech intervals will be too short especially in poor signal to noise ratio cases and in cases where the voice is continuous with little intermission.
Another problem is that the magnitude calculation of the FFT result is quite complex. This involves square and square root calculations which are very expensive in terms of computation load. Yet another problem is the association of the phase information to the noise free magnitude spectrum in order to obtain the information for the IFFT. This process requires the calculation of the phase, the storage of the information, and applying the information to the magnitude dataxe2x80x94all are expensive in terms of computation and memory requirements. Another problem is the estimation of the noise spectral magnitude. The FFT process is a poor and unstable estimator of energy. The averaging-over-time of frames contributes insufficiently to the stability. Shortening the length of the FFT results in a wider bandwidth of each bin and better stability but reduces the performance of the system. Averaging-over-time, moreover, smears the data and, for this reason, cannot be extended to more than a few frames. This means that the noise estimation process proposed is not sufficiently stable.
It is therefore an object of this invention to provide a spectral subtraction system that has a simple, yet efficient mechanism, to estimate the noise magnitude spectrum even in poor signal-to-noise ratio situations and in continuous fast speech cases.
It is another object of this invention to provide an efficient mechanism that can perform the magnitude estimation with little cost, and will overcome the problem of phase association.
It is yet another object of this invention to provide a stable mechanism to estimate the noise spectral magnitude without the smearing of the data.
In accordance with the foregoing objectives, the present invention provides a system that correctly determines the non-speech segments of the audio signal thereby preventing erroneous processing of the noise canceling signal during the speech segments. In the preferred embodiment, the present invention obviates the need for a voice switch by precisely determining the non-speech segments using a separate threshold detector for each frequency bin. The threshold detector precisely detects the positions of the noise elements, even within continuous speech segments, by determining whether frequency spectrum elements, or bins, of the input signal are within a threshold set according to a minimum value of the frequency spectrum elements over a preset period of time. More precisely, current and future minimum values of the frequency spectrum elements. Thus, for each syllable, the energy of the noise elements is determined by a separate threshold determination without examination of the overall signal energy thereby providing good and stable estimation of the noise. In addition, the system preferably sets the threshold continuously and resets the threshold within a predetermined period of time of, for example, five seconds.
In order to reduce complex calculations, it is preferred in the present invention to obtain an estimate of the magnitude of the input audio signal using a multiplying combination of the real and imaginary parts of the input in accordance with, for example, the higher and the lower values of the real and imaginary parts of the signal. In order to further reduce instability of the spectral estimation, a two-dimensional (2D) smoothing process is applied to the signal estimation. A two-step smoothing function using first neighboring frequency bins in each time frame then applying an exponential time average effecting an average over time for each frequency bin produces excellent results.
In order to reduce the complexity of determining the phase of the frequency bins during subtraction to thereby align the phases of the subtracting elements, the present invention applies a filter multiplication to effect the subtraction. The filter function, a Weiner filter function for example, or an approximation of the Weiner filter is multiplied by the complex data of the frequency domain audio signal. The filter function may effect a full-wave rectification, or a half-wave rectification for otherwise negative results of the subtraction process or simple subtraction. It will be appreciated that, since the noise elements are determined within continuous speech segments, the noise estimation is accurate and it may be canceled from the audio signal continuously providing excellent noise cancellation characteristics.
The present invention also provides a residual noise reduction process for reducing the residual noise remaining after noise cancellation. The residual noise is reduced by zeroing the non-speech segments, e.g., within the continuous speech, or decaying the non-speech segments. A voice switch may be used or another threshold detector which detects the non-speech segments in the time-domain.
The present invention is applicable with various noise canceling systems including, but not limited to, those systems described in the U.S. patent applications incorporated herein by reference. The present invention, for example, is applicable with the adaptive beamforming array. In addition, the present invention may be embodied as a computer program for driving a computer processor either installed as application software or as hardware.