In speech enhancement, the goal is to obtain “enhanced speech” which is a processed version of the noisy speech that is closer in a certain sense to the underlying true “clean speech” or “target speech”.
Note that clean speech is assumed to be only available during training and not available during the real-world use of the system. For training, clean speech can be obtained with a close talking microphone, whereas the noisy speech can be obtained with a far-field microphone recorded at the same time. Or, given separate clean speech signals and noise signals, one can add the signals together to obtain noisy speech signals, where the clean and noisy pairs can be used together for training.
Speech enhancement and speech recognition can be considered as different but related problems. A good speech enhancement system can certainly be used as an input module to a speech recognition system. Conversely, speech recognition might be used to improve speech enhancement because the recognition incorporates additional information. However, it is not clear how to jointly construct a multi-task recurrent neural network system for both the enhancement and recognition tasks.
In this document, we refer to speech enhancement as the problem of obtaining “enhanced speech” from “noisy speech.” On the other hand, the term speech separation refers to separating “target speech” from background signals where the background signal can be any other non-speech audio signal or even other non-target speech signals which are not of interest. Our use of the term speech enhancement also encompasses speech separation since we consider the combination of all background signals as noise.
In speech separation and speech enhancement applications, processing is usually done in a short-time Fourier transform (STFT) domain. The STFT obtains a complex domain spectro-temporal (or time-frequency) representation of the signal. The STFT of the observed noisy signal can be written as the sum of the STFT of the target speech signal and the STFT of the noise signal. The STFT of signals are complex and the summation is in the complex domain. However, in conventional methods, the phase is ignored and it is assumed that the magnitude of the STFT of the observed signal equals to the sum of the magnitudes of the STFTs of the target audio and the noise signals, which is a crude assumption. Hence, the focus in the prior art has been on magnitude prediction of the “target speech” given a noisy speech signal as input. During reconstruction of the time-domain enhanced signal from its STFT, the phase of the noisy signal is used as the estimated phase of the enhanced speech's STFT. This is usually justified by stating that the minimum mean square error (MMSE) estimate of the enhanced speech's phase is the noisy signal's phase.