In speech signal processing for automatic speech recognition (ASR) and hands free speech communication, a microphone signal is usually first segmented into overlapping blocks of appropriate size and a window function is applied. The speech signal processing can be performed in the time domain and/or in the frequency domain. Time domain processing operates directly on speech waveform signals while frequency domain processing operates on spectral representations of the speech signal.
Operations carried out in the frequency domain are achieved by a short-term Fourier transform (STFT). In this process, the sequence of sampled amplitude values x in dependence of the sample index i is multiplied with a window sequence w for every Rth sample and then discretely transformed to the frequency domain. This step is called analysis, and its realization is often referred to as an analysis filter bank:
      X    ⁡          (              k        ,        μ            )        =            ∑              i        =        0                    L        -        1              ⁢                  ⁢                  x        ⁡                  (                      i            +            Rk                    )                    ⁢              w        ⁡                  (          i          )                    ⁢              ⅇ                              -            j                    ⁢                                    2              ⁢              πμι                        N                              for each sample i, frame k, frequency bin μ, frame shift R, window length L, and DFT size N. After processing in the frequency domain, the resulting spectrum is transformed to the time domain again by an inverse STFT. In analogy to the previous step, this one is called synthesis, and its implementation is referred to as the synthesis filter bank.
Frequency domain processing produces noisy short-term spectra signals. In order to reduce the undesirable noise components while keeping the speech signal as natural as possible, SNR-dependent (SNR: signal-to-noise ratio) weighting coefficients are computed and applied to the spectra signals. Common noise reduction algorithms make assumptions to the type of noise present in a noisy signal. The Wiener filter for example introduces the mean of squared errors (MSE) cost function as an objective distance measure to optimally minimize the distance between the desired and the filtered signal. The MSE however does not account for human perception of signal quality. Also, filtering algorithms are usually applied to each of the frequency bins independently. Thus, all types of signals are treated equally. This allows for good noise reduction performance under many different circumstances.
However, mobile communication situations in an automobile environment are special in that they contain speech as their desired signal. The noise present while driving is mainly characterized by increasing noise levels with lower frequency. Speech signal processing starts with an input audio signal from a speech-sensing microphone. The microphone signal represents a composite of multiple different sound sources. Except for the speech component, all of the other sound source components in the microphone signal act as undesirable noise that complicates the processing of the speech component.
Separating the desired speech component from the noise components has been especially difficult in moderate to high noise settings, especially within the cabin of an automobile traveling at highway speeds, when multiple persons are simultaneously speaking, or in the presence of audio content. Often in high noise conditions, only a low output quality is achieved. Usually, speech signal components are heavily distorted to such an extent that desired speech components are masked by the background noise. Standard noise suppression rules classify these parts as noise; as a consequence, a maximum attenuation is applied.