In conventional noise cancellation or conventional audio signal enhancement, the goal is to obtain an “enhanced audio signal” which is a processed version of a noisy audio signal that is closer in a certain sense to an underlying true “clean audio signal” or “target audio signal” of interest. In particular, in the case of speech processing, the goal of “speech enhancement” is to obtain “enhanced speech” which is a processed version of a noisy speech signal that is closer in a certain sense to the underlying true “clean speech” or “target speech”.
Note that clean speech is conventionally assumed to be only available during training and not available during the real-world use of the system. For training, clean speech can be obtained with a close talking microphone, whereas the noisy speech can be obtained with a far-field microphone recorded at the same time. Or, given separate clean speech signals and noise signals, one can add the signals together to obtain noisy speech signals, where the clean and noisy pairs can be used together for training.
In conventional speech enhancement applications, speech processing is usually done using a set of features of input signals, such as short-time Fourier transform (STFT) features. The STFT obtains a complex domain spectro-temporal (or time-frequency) representation of a signal, also referred to here as a spectrogram. The STFT of the observed noisy signal can be written as the sum of the STFT of the target speech signal and the STFT of the noise signal. The STFTs of signals are complex-valued and the summation is in the complex domain. However, in conventional methods, the phase is ignored and the focus in conventional approaches has been on magnitude prediction of the “target speech” given a noisy speech signal as input. During reconstruction of the time-domain enhanced signal from its STFT, the phase of the noisy signal is typically used as the estimated phase of the enhanced speech's STFT. Using the noisy phase in combination with an estimate of the magnitude of the target speech leads in general to a reconstructed time-domain signal (i.e. obtained by inverse STFT of the complex spectrogram consisting of the product of the estimated magnitude and the noisy phase) whose magnitude spectrogram (the magnitude part of its STFT) is different from the estimate of the magnitude of the target speech that one intended to reconstruct a time-domain signal from. In this case, the complex spectrogram consisting of the product of the estimated magnitude and the noisy phase is said to be inconsistent.
Accordingly, there is need for improved speech processing methods to overcome the conventional speech enhancement applications.