In some conventional speech separation and speech enhancement applications, processing is done in a time-frequency representation such as the short-time Fourier transform (STFT) domain. The STFT obtains a complex domain spectro-temporal (or time-frequency) representation of the signal. The STFT of the observed noisy signal can be written as the sum of the STFT of the target speech signal and the STFT of the noise signal. The STFT of signals are complex and the summation is in the complex domain.
However, most of these conventional speech separation and speech enhancement applications only conduct separation on the magnitude in the time-frequency (T-F) domain and directly use mixture phase for time-domain re-synthesis, largely because phase itself is very random and hard to be enhanced. It is well-known that this approach incurs a phase inconsistency problem, especially for speech processing, where there is typically at least half overlap between consecutive frames. This overlap makes the STFT representation of a speech signal highly redundant. As a result, the enhanced STFT representation obtained using the estimated magnitude and mixture phase would not be in the consistent STFT domain, meaning that it is not guaranteed that there exists a time-domain signal having that STFT representation.
In other words, with these conventional methods, the phase is ignored and these conventional methods assumed that the magnitude of the STFT of the observed signal, equals to the sum of the magnitudes of the STFTs of the target audio and the noise signals, which is a crude or poor assumption. Hence, the focus in the conventional speech separation and speech enhancement applications has been on magnitude prediction of the “target speech” given a noisy speech signal as input, or on magnitude prediction of the “target sources” given a mixture of audio sources as input. During reconstruction of the time-domain enhanced signal from its STFT, the phase of the noisy signal is used as the estimated phase of the enhanced speech's STFT, by these conventional speech separation and speech enhancement applications.
Accordingly, there is need to improve speech separation and speech enhancement applications using an end-to-end approach for single-channel speaker-independent multi-speaker speech separation.