There are a number of audio processing algorithms modifying audio signals in either temporal domain or spectral domain. Various audio processing algorithms are developed so as to improve overall quality of audio signals and thus enhance users' experience on the playback. By way of example, existing processing algorithms may include a surround virtualizer, a dialog enhancer, a volume leveler, a dynamic equalizer and the like.
The surround virtualizer can be used to render a multi-channel audio signal over a stereo device such as a headphone because it creates a virtual surround effect for the stereo device. The dialog enhancer aims at enhancing dialogs in order to improve the clarity and intelligibility of human voices. The volume leveler aims at modifying an audio signal so as to make the loudness of the audio content more consistent over time, which may lower the output sound level for a very loud object at some time but enhance the output sound level for a whispered object at some other time. The dynamic equalizer provides a way to automatically adjust the equalization gains at each frequency bands in order to keep the overall consistency of the spectral balance with regard to a desired timbre or tone.
Traditionally, existing audio processing algorithms are developed for processing channel-based audio signals such as stereo, 5.1 and 7.1 surround signals. Because a sound field is constructed by a number of endpoints, such as front left, front right, center, surround left, surround right and even height loudspeakers, the sound field can be defined by all of the endpoints. A channel-based audio signal can therefore be spatially rendered in the sound field. The input audio channels are firstly down-mixed into a number of submixes, such as front, center and surround submixes in order to reduce the computational complexity on the subsequent audio processing algorithms. In the context, the sound field can be divided into several coverage zones in relation to endpoint arrangements and the submix represents a sum of components of the audio signal in relation to a particular coverage zone. An audio signal is typically processed and rendered as a channel-based audio signal, meaning that metadata associated with position, velocity, size and the like of an audio object is absent in the audio signal.
Recently, more and more object-based audio contents are created, which may include audio objects and metadata associated with the audio objects. The audio content of this kind provides a better 3D immersive audio experience through more flexible rendering of the audio objects in comparison to the traditional channel-based audio content. At playback time, a rendering algorithm may, for example, render the audio objects to an immersive speaker layout including speakers all around as well as above the listener.
However, by using the typical audio processing algorithms as mentioned above, the object-based audio signals needs to be first rendered as the channel-based audio signals in order to be down-mixed into submixes for audio processing. This means that metadata associated with these object-based audio signals are discarded, and the resulting rendering is thus compromised in terms of playback performance.
In view of the foregoing, there is a need in the art for a solution for processing and rendering the object-based audio signals without discarding their metadata.