Audio source separation is concerned with decomposing an audio mixture into its constituent sound sources. It provides a wide range of applications in audio/speech enhancement, post-production, 3D audio, etc. Among others, Blind Source Separation (BSS) assumes that the audio source separation is performed without information about the sources, the mixture, and/or the mixing process generating the mixture. On the other hand, Informed Source Separation (ISS) allows the audio source separation to be performed with guidance from some auxiliary information.
Most of the existing approaches for supervised audio source separation are example-based methods. A prerequisite for such approaches is to beforehand acquire some audio samples similar to target audio sources, which is normally cumbersome and not always possible. When audio examples are not available beforehand, in alternative, simple text queries can be used to search for audio files. This text query based approach for audio source separation is easier for a user and more efficient, since the user only needs to listen to the audio mixture and provide words describing what they want to separate for instance. However, while a text query based approach is described in [XII], so far there is no practical solution that would be able dealing efficiently with noisy or non-representative retrieved examples.
For example-based audio source separation, single channel source separation is an underdetermined problem and thus among the most challenging ones. Several algorithms propose to take into account the pre-learned spectral characteristics of individual sound sources in order to separate them from the audio mixture. To achieve this, there is a need to acquire preliminary training data to learn and indicate the spectral characteristics of the individual target sources. A class of supervised algorithms are proposed based on non-negative matrix factorization (NMF) [I, II, III] or its probabilistic formulation known as probabilistic latent component analysis (PLCA) [IV, V]. Nevertheless, when the training data are unavailable or not representative enough for the audio sources, the above methods become inapplicable without other supplementary information about the sources. The supplementary information, for example, includes “hummed” sounds that mimic the ones in the mixture [V], or text transcriptions of the corresponding audio mixture [VI].
User-guided approaches based on NMF for audio source separation have been proposed recently [VII], whereby an overall audio source separation process might comprise several interactive separation steps. These approaches allow end-users to manually annotate information about activity of each sound source. The annotated information is used, instead of the above mentioned training data, to guide the source separation process. In addition, the user is able to review the separation result and correct the errors thereof by annotating the spectrogram displays of intermediate separation results during the separation process.
However, for the above user-guided and interactive approaches, it is required that the user has some minimum knowledge about audio source spectrograms and audio signal processing in order to manually specify characteristics of the audio sources and thus interact with the separation process. In other words, the optional interaction and interference of the audio source separation is not easy and not practical for an end-user. In addition, the annotation process is time consuming even for a professional operator.