Traditionally, audio content of multi-channel format (e.g., stereo, 5.1, 7.1, and the like) are created by mixing different audio signals in a studio, or generated by recording acoustic signals simultaneously in a real environment. The mixed audio signal or content may include a number of different sources. Source separation is a task to identify information of each of the sources in order to reconstruct the audio content, for example, by a mono signal and metadata including spatial information, spectral information, and the like.
When recording an auditory scene using one or more microphones, it is preferred that the sound source dependent information be separated such that it may be suitable for use among a variety of subsequent audio processing tasks. Some examples may include, spatial audio coding, remixing/re-authoring, 3D sound analysis and synthesis, signal enhancement/noise suppression for a variety of purposes (e.g., automatic speech recognition). Therefore, improved versatility and better performance can be achieved by a successful source separation. When no prior information of the sources involved in the capturing process is available (for instance, the properties of the recording devices, the acoustic properties of the room, and the like), the separation process can be called blind source separation (BSS).
Conventionally, some statistical models for source separation such as Gaussian Mixture Model (GMM) and Non-negative Matrix Factorization (NMF) have been widely applied in order to realize source separation. However, these algorithms (e.g., GMM or NMF model), only convergence to a stationary point of the objective function. Accordingly, these algorithms are sensitive to parameter initialization in terms of the following aspects: 1) the final result depends strongly on the parameter initialization; 2) the convergence speed varies significantly depending on the parameter initialization; and 3) the algorithms cannot identify the actual number of source signals, so they usually require prior information such as source number, spectral base, and the like. In a conventional system, original source information is used for oracle initializations, which is not practical for most real-world applications because such information is usually not available. Moreover, in some applications, training data may be required. However difficulties arises in practice due to the fact that the source models which are learned from training data tend to perform poorly in realistic cases. This is due to the fact that there is generally a mismatch between the models and the actual properties of the sources in the mix.
In view of the foregoing, there is a need in the art for a solution for separating sources from audio content without knowing any prior information.