Most musical signals, for example as might be found in a recording, comprise a plurality of individual sound sources including both instrumental and vocal sources. These sources are typically combined into a two channel stereo recording with a Left and a Right Signal.
There are several applications where it would be advantageous if the original sound sources could be individually extracted from the Left and Right Signals. Traditionally, one area where a form of sound source separation has been used is in the field of karaoke entertainment. In karaoke a singer performs live in front of an audience with background music. One of the challenges of this activity is to come up with the background music, i.e. get rid of the original singer's voice to retain only the instruments so the amateur singer's voice can replace that of the original singer and be superimposed with the backing track. One way in which this can be achieved uses a stereo recording and the assumption (usually true) that the voice is panned in the centre (i.e. that the voice was recorded in mono and added to the Left and Right channels with equal level). In such cases, the voice content may be significantly reduced by subtracting the Left channel from the Right channel, resulting in a mono recording from which the voice is nearly absent. It will be appreciated that the voice signal is not completely removed because as stereo reverberation is usually added after the mix, a faint reverberated version of the voice remains in the difference signal. There are however several drawbacks to this technique including that the output signal is always monophonic. It also does not facilitate the separation of individual instruments from the original recording.
U.S. Pat. No. 6,405,163 describes a process for removing centrally panned voice in stereo recordings. The described process utilizes frequency domain techniques to calculate a frequency dependent gain factor based on the difference between the frequency-domain spectra of the stereo channels. The described process also provides for the limited separation of a centrally panned voice component from other centrally panned sources, e.g. drums, using typical frequency characteristics of voice. A drawback of the system is that it is limited to the extraction of centrally panned voice in a stereo recording.
Another known technique is that of DUET (Degenerate Unmixing and Estimation Technique) described inter alia in A. Jourjine, S. Rickard and O. Yilmaz. “Blind Separation of Disjoint Orthoganal Signals: Demixing N Sources from 2 mixtures” Proc. ICASSP 2000, Istanbul, Turkey, A. Jourjine, S. Rickard and O. Yilmaz. “Blind Separation of Disjoint Orthoganal Sources” Technical Report SCR-98-TR-657, Siemens Corporate Research, 755 College Road East, Princeton, N.J., September 1999 and S. Rickard, R. Balan, J. Rosca. “Real-Time Time-Frequency Based Blind Separation” Presented at the ICA2001 Conference, 2001 San Diego Calif. DUET is an algorithm, which is capable of separating N sources which meet the condition known as “W-Disjoint Orthoganality”, (further information about which can be found in S. Rickard and O. Yilmaz, “On the Approximate W-Disjoint Orthoganality of Speech” IEEE International Conference on Acoustics, Speech and Signal Processing, Florida, USA, May 2002, vol. 3, pp. 3049-3052) from two mixtures. This condition effectively means that the sources do not significantly overlap in the time and frequency domain. Speech generally approximates this condition and so DUET is suitable for the separation of one person's speech from multiple simultaneous speakers. Musical signals however do not adhere to the W-Disjoint Orthoganality condition. As such, DUET is not suitable for the separation of musical instruments.
The present invention is directed at conventional studio based stereo recordings. Studio based stereo recordings account for the majority of popular music recordings. Studio recordings are (usually) made by first recording N sources to N independent audio tracks, the independent audio tracks are then electrically summed and distributed across two channels using a mixing console. Image localisation, referring to the apparent location of a particular instrument/vocalist in the stereo field, is achieved by using a panoramic potentiometer (pan pot). This device allows a single sound source to be divided into two channels with continuously variable intensity ratios. By using this technique, a single source may be virtually positioned at any point between the speakers. The localisation is achieved by creating an Interaural Intensity Difference, (IID), and this is a well known phenomenon. The pan pot was devised to simulate IID's by attenuating the source signal fed to one reproduction channel, causing it to be localised more in the opposite channel. This means that for any single source in such a recording, the phase of a source is coherent between Left and Right channels, and only its intensity differs.
C. Avendano, “Frequency-Domain Source Identification and Manipulation in Stereo Mixes for Enhancement, Suppression and Re-Panning Applications” IEEE WASPAA'03 describes a method which is directed at studio based recordings. The method uses a similarity measure between the Short-time Fourier Transforms of the Left and Right input signals to identify time-frequency regions occupied by each source based on the panning coefficient assigned to it during the mix. Time-frequency components are then clustered based on a given panning coefficient, and re-synthesised.
The Avendano method assumes that the mixing model is linear, which is the case for “studio” or “artificial” recordings which, as discussed above, account for a large percentage of commercial recordings since the advent of multi-track recording. The method attempts to identify a source based on its lateral placement within the stereo mix. The method describes a cross channel metric referred to as the “panning index” which is a measure of the lateral displacement of a source in the recording. The problem with the panning index is that it returns all positive values, which leads to “lateral ambiguity”, meaning that the lateral direction of the source is unknown, i.e. a source panned 60 degrees Left will give an identical similarity measure if it was panned 60 degrees Right. To address this shortcoming, the Avendano paper proposes the use of a partial similarity measure and a difference function.
Despite the solutions provided, a significant problem with this approach is that a single time frequency bin is considered as belonging to either a source on the Left or a source on the Right, depending on its relative magnitude. This means that a source panned hard Left will interfere considerably with a source panned hard Right. Furthermore, the technique uses a masking method that means that the original STFT bin magnitudes are used in the re-synthesis which will cause significant interference from any other signal whose frequencies overlap with the source of interest.
Accordingly, there is a need for an alternative method of stereo analysis, which facilitates sound source separation, and which overcomes at least some of the previously described problems.