Common noise reduction algorithms make assumptions to the type of noise present in a noisy signal. The Wiener filter for example introduces the mean of squared errors (MSE) cost function as an objective distance measure to optimally minimize the distance between the desired and the filtered signal. The MSE however does not account for human perception of signal quality. Also, filtering algorithms are usually applied to each of the frequency bins independently. Thus, all types of signals are treated equally. This allows for good noise reduction performance under many different circumstances.
However, mobile communication situations in an automobile environment are special in that they contain speech as their desired signal. The noise present while driving is mainly characterized by increasing noise levels with lower frequency. Speech signal processing starts with an input audio signal from a speech-sensing microphone. The microphone signal represents a composite of multiple different sound sources. Except for the speech component, all of the other sound source components in the microphone signal act as undesirable noise that complicates the processing of the speech component. Separating the desired speech component from the noise components has been especially difficult in moderate to high noise settings, especially within the cabin of an automobile traveling at highway speeds, when multiple persons are simultaneously speaking, or in the presence of audio content.
In speech signal processing, the microphone signal is usually first segmented into overlapping blocks of appropriate size and a window function is applied. Each windowed signal block is then transformed into the frequency domain using a fast Fourier transform (FFT) to produce noisy short-term spectra signals. In order to reduce the undesirable noise components while keeping the speech signal as natural as possible, SNR-dependent (SNR: signal-to-noise ratio) weighting coefficients are computed and applied to the spectra signals. However, existing conventional methods use an SNR-dependent weighting rule which operates in each frequency independently and which does not take into account the characteristics of the actual speech sound being processed.
FIG. 1 shows a typical arrangement for noise reduction of speech signals. An analysis filter bank 102 receives in the microphone signal y(i) from microphone 101. y(i) includes both the speech components (i) and a noise component n(i) that is received by the microphone. The parameter (i) is the sample index, which identifies the time-period for the sample of the microphone signal y. The analysis filter bank 102 converts the time-domain-microphone sample into a frequency-domain representation frame by applying an FFT. The analysis filter bank 102 separates the filter coefficients into frequency bins. As noted in the figure, the frequency domain representation of the microphone signal is Y(k,μ) wherein k represents the frame index and μ represents the frequency bin index. The frequency domain representation of the microphone signal is provided to a noise reduction filter 103. Signal to noise ratio weighting coefficients are calculated in the noise reduction filter resulting in the filter coefficients H(k μ) and the filter coefficients and the frequency domain representation are multiplied resulting in a reduced noise signal Ŝ(k,μ). noise reduced frequency domain signals are collected in the synthesis filter bank for all frequencies of a frame and the frame is passed through an inverse transform (e.g. an inverse FFT).