An automatic speech processing engine, including, but not limited to, an automatic speech recognition (ASR) engine, in an audio device may be used to recognize spoken words or phonemes within the words in order to identify spoken commands by a user is described. Conventional automatic speech processing is sensitive to noise present in audio signals including user speech. Various noise reduction or noise suppression pre-processing techniques may offer significant benefits to operations of an automatic speech processing engine. For example, a modified frequency domain representation of an audio signal may be used to compute speech-recognition features without having to perform any transformation to the time-domain. In other examples, automatic speech processing techniques may be performed in the frequency-domain and may include applying a real, positive gain mask to the frequency domain representation of the audio signal before converting the signal back to a time-domain signal, which may be then fed to the automatic speech processing engine.
The gain mask may be computed to attenuate the audio signal such that background noise is decreased or eliminated to an extent, while the desired speech is preserved to an extent. Conventional noise suppression techniques may include dynamic noise power estimation to derive a local signal-to-noise ratio (SNR), which may then be used to derive the gain mask using either a formula (e.g., spectral subtraction, Wiener filter, and the like) or a data-driven approach (e.g., table lookup). The gain mask obtained in this manner may not be an optimal mask because an estimated SNR is often inaccurate, and the reconstructed time-domain signal may be very different from the clean speech signal.