For audio conference calls and for applications requiring automatic speech recognition (ASR), speech enhancement algorithms are generally employed to improve the quality of the service. While high background noise can reduce the intelligibility of the conversation in an audio call, interfering noise can drastically degrade the accuracy of automatic speech recognition.
Among many proposed approaches to improve recognition, multichannel speech enhancement based on beamforming or demixing has shown to be a promising method due to the inherent ability to adapt to the environmental conditions and suppress non-stationary noise signals. Nevertheless, the ability of multichannel processing is often limited by the number of observed mixtures and by the reverberation which reduces the separability between target speech and noise in the spatial domain.
On the other hand, various single channel methods based on supervised machine-learning systems have also been proposed. For example, non-negative matrix factorization and neural networks have shown to be the most promising successful approaches to data-dependent supervised single channel speech enhancement. Although unsupervised spatial processing makes few assumptions regarding the spectral statistic of the speech and noise sources, supervised processing requires prior training on similar noise conditions in order to learn the latent invariant spectro-temporal factors composing the mixture in their time-frequency representation. The advantage of the first is that it does not require any specific knowledge on the source statistic and it exploits only the spatial diversity of the mixture which is intrinsically related to the position of each source in the space. On the other hand, the supervised methods do not rely on the spatial distribution and therefore they are able to separate speech in diffuse noise, where the noise spatial distribution highly overlaps that of the target speech.
One of the main limitations on data-based enhancement is the assumption that the machine learning system learns invariant factors from the training data which will be observed also at test time. However, the spatial information is not invariant by definition since it is related to the position of the acoustic sources which may vary over time.
The use of a deep neural network (DNN) for source enhancement has been proposed in various literature, such as: Jonathan Le Roux, John R. Hershey, Felix Weninger, “Deep NMF for Speech Separation,” in Proc. ICASSP 2015 International Conference on Acoustics, Speech, and Signal Processing, April 2015; Huang, Po-Sen, et al., “Deep learning for monaural speech separation,” Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014; Weninger, Felix, et al., “Discriminatively trained recurrent neural networks for single channel speech separation,” Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on. IEEE, 2014; and Liu, Ding, Paris Smaragdis, and Minje Kim, “Experiments on deep learning for speech denoising,” Proceedings of the annual conference of the International Speech Communication Association (INTERSPEECH), 2014.
However, such literature focuses on the learning of discriminative spectral structures to identify and extract speech from noise. The neural net training (either for the DNNs or for the recurrent networks) is carried out by minimizing the error between the predicted and ideal oracle time-frequency masks or, in the alternative, by minimizing the error between the reconstructed masked speech and the clean reference. The general assumption is that at training time the DNN will encode some information related to the speech and noise which is invariant over different datasets and therefore could be used to predict the right gains at the test time.
Nevertheless, there are practical limitations for real-world applications of such “black-box” approaches. First, the ability of the network to discriminate speech from noise is intrinsically determined by the nature of the noise. If the noise is of speech nature, its time-spectral representation will be highly correlated to the target speech and the enhancement task is by definition ambiguous. Therefore, the lack of separability of the two classes in the feature domain will not permit a general network to be trained to effectively discriminate between them, unless done by overfitting the training data which does not have any practical usefulness. Second, in order to generalize to unseen noise conditions, a massive data collection is required and a huge network is needed to encode all the possible noise variations. Unfortunately, resource constraints can render such approaches impractical for real-world low footprint and real-time systems.
Moreover, despite the various techniques proposed in the literature, large networks are more prone to overfit the training data without learning useful invariant transformation. Also, for commercial applications, the actual target speech may depend on specific needs which could be set on the fly by a configuration script. For example, a system might be configured to extract a single speaker in a particular spatial region or having some specific ID (e.g., by using speaker ID identification), while cancelling any other type of noise including other interfering speakers. In another modality, the system might be configured to extract all the speech and cancel only non-speech type noise (e.g., for a multispeaker conference call scenario). Thus, different application modalities could actually contradict to each other and a single trained network cannot be used to accomplish both tasks.