In the field of digital signal processing, audio source separation is a problem where several audio signals have been mixed together into a combined signal and the goal is to recover the original component signals from the combined signal. Audio source separation has many practical applications including, but not limited to speech enhancement, speech recognition, speech denoising, voice recognition, audio post-production and remastering, spatial audio upmixing, and other audio functions. Denoising includes separating noise from speech, removing background music from speech, and removing bleed from other instruments. In the context of music production, audio source separation is sometimes referred to as “unmixing” or “de-mixing”.
Traditional audio equalizers have historically been used to emphasize or deemphasize certain content present in an audio signal. They work by boosting or attenuating certain frequencies bands present in the audio signal. The main limitation however is that they cannot discriminate and/or separate two signals that have frequencies in common. Thus, what is needed is the ability to separate mixed signals that have frequencies in common.
Several approaches have been proposed over the years to solve the audio source separation problem, however, these approaches are inadequate.
Beamforming is a technique for source separation that uses a microphone array to listen to a particular direction to capture desired signals while minimizing interfering ones. A shortcoming of the existing beamforming techniques is that they require multiple versions of the same recording to be captured with multiple microphones.
Adaptive signal processing is a technique for source separation that filters unwanted parts of a signal by self-adjusting parameters of the filter. A shortcoming of the adaptive signal processing technique is that it requires prior knowledge about the statistics of the interfering signal. For example, in order to perform adaptive signal processing on a combined signal, it would be necessary to have prior knowledge regarding the types of expected noise in the combined signal.
Independent component analysis, also known as blind source separation (BSS), is a technique for source separation that uses a measure of statistical independence of the component signals in the combined signal to identify and separate the sources. This technique does not require prior knowledge about the sound sources, except for their mutual statistical independence. A shortcoming of this approach is that it assumes, among other things, that the sources have been linearly mixed into a combined signal. However, in the context of professionally produced music, processes like mastering apply several nonlinearities to a combined signal. Another shortcoming is that independent component analysis cannot separate or discriminate arbitrary sources (e.g., different types of musical instruments, such as drums, bass, piano) without prior knowledge about the sources or their semantics. Another shortcoming in the context of professionally produced music is that, in most cases, the sources are not fully statistically independent. Overall, in practice, this approach performs very poorly in the context of professional music recordings.
Classical denoising and enhancement is a technique for source separation that uses a Wiener filtering and spectral subtraction to separate the audio sources. A shortcoming of this approach is that it assumes prior knowledge of the spectral properties of the original signal and the noise.
Non-Negative Matrix Factorization (NMF) is a popular technique for processing audio, image and text. This technique has been widely used by itself or in combination with other techniques for audio source separation. A shortcoming of existing NMF techniques in the context of source separation is that it relies on a fixed set of spectral templates describing the spectral characteristics of the underlying sources in a given combined signal. Because commercial music has so much variety in terms of sounds, the effectiveness of this approach is highly dependent on the degree to which these default spectral templates represent the spectral characteristics of the underlying sources in a combined signal. In summary, the existing approach to source separation using NMF requires the arbitrary selection of a number of spectral templates and it does not use information from the input combined signal to adapt these spectral templates to better match the underlying sources, which is the key to accurate discrimination and reconstruction of the sources. What is needed is improved techniques to audio source separation that dynamically adapt to the combined signal with real-time performance.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
While each of the figures illustrates a particular embodiment for purposes of illustrating a clear example, other embodiments may omit, add to, reorder, and/or modify any of the elements shown in the figures.