Traditionally, the audio content is created and stored in channel-based formats. In a channel-based format, the audio content is usually represented, stored, conveyed and distributed by the vehicle of the channel. As used herein, the term “audio channel” or “channel” refers to the audio content that usually has a predefined physical location. For example, stereo, surround 5.1, surround 7.1 and the like are all in channel-based formats for the audio content. Each channel corresponds to a fixed-position physical speaker. When multi-channel content is played back, multiple speakers create a live and immersive sound field around a listener. Recently, several conventional multichannel systems have been extended to support a new format that includes both channels and audio objects. As used herein, the term “audio object” or “object” refers to an individual audio element that exists for a defined duration of time in a sound field. For example, audio objects may represent dialogues, gunshots, thunders, and the like. These objects are usually used by mixers to create their desired sound effects. Each object has its positions in the sound field. For example, a dialogue is usually located in the central front and the sound of thunder is usually emanating from overhead. The perception of the positions of an object by human being is the result of firing of multiple speakers which are playing the audio signals of the same object. For example, when an object is played by a front-left speaker and a front-right speaker with similar energy levels, a person will perceive a phantom from the central front.
As mentioned above, when content is created in a channel-based format, it usually means that a perception experience is optimized by mixers for the specific playback setting. However, when played by a different playback setting, its listening experience could degrade due to the mismatch between playback settings. An example of degradation is that the positions of an object could be changed. Thus, the channel-based format is inefficient to adapt to a variety of speaker playback configurations. Another aspect of inefficiency lies in binaural rendering, in which the channel-based format can only use a limited number of head-related transfer functions (HRTFs) specific to speaker positions; while for other positions, the interpolation of HRTFs is used, degrading the binaural listening experience.
One potential way to address this issue is to recover the original sources (or objects), including their positions and mono clean waveform, from channel-based representations, and then use the positions as metadata to steer the panning algorithm of a speaker playback device to re-render the objects on the fly and create a similar sound image to the original ones. For a binaural rendering setting (instead of using a limited number of HRTFs), the positions could be used to choose the most appropriate HRTFs to further enhance the listening experience.
However, an object in channel-based representations, which is to be rendered with metadata, is not always clean. It could be mixed simultaneously with other objects, within some channels. For example, in order to implement an artistic intention, a mixer could put two objects simultaneously in front of a listener, one object appearing between the center and the front left and the other at some position between the center and the front right. This could make the central front channel contain two objects. If no source separation techniques are used, these two objects would be regarded as one object, which would make their position estimations incorrect.
Thus, in order to get a clean object and estimate its positions, source separation techniques are needed to separate the object from its multi-channel mixture to produce a clean multi-channel or mono representation. In the above-mentioned example, a single multi-channel input is desired to be split by the source separation component, for example, into two multi-channels or mono outputs, each only containing one clean object.