Traditionally, audio content is created and stored in channel-based formats. As used herein, the term “audio channel” or “channel” refers to the audio content that usually has a predefined physical location. For example, stereo, surround 5.1, 7.1 and the like are all channel-based formats for the audio content. Recently, with developments in the multimedia industry, three-dimensional (3D) movies and television content are getting more and more popular in cinema and home. In order to create more immersive sound fields and to control discrete audio elements accurately (irrespective of specific playback speaker configurations) many conventional multichannel systems have been extended to support a new format that includes both channels and audio objects.
As used herein, the term “audio object” refers to an individual audio element that exists for a defined duration in time in the sound field. An audio object may be dynamic or static. For example, audio objects may be dialogue, gunshots, thunder, etc. As an important element, audio objects are usually used by mixers to create their desired sound effects.
Conventionally, an audio content or audio signal based on multi-channel format includes separate signals for at least two channels. For example, there can be five different signals included in a surround 5.1 speaker system. Each of the separated audio signals is used for driving its corresponding speaker positioned in a stage defined by each and every physical speaker. Since energy allocated to each channel for a single audio object is distinct, the speakers or transducers may be driven differently and reproduce a same audio object in different loudness, which results in a particular position perceived by a listener in the stage. In addition, the audio signal based on multi-channel format may itself include an inter-channel correlation coefficient (ICC) represented, for example, in the form of differences on phase and amplitude among channels. The information on the energy allocation and the ICC of a particular audio object may allow the plurality of speakers representing the audio object with its position and size being able to be perceived by the listener.
Presently, a particular audio signal in multi-channel format adapted for a certain multi-channel surround system needs to be rendered by professionals. That is, rendered in a studio using panning tools and properties (e.g., such as positions and sizes of different audio objects) which can only be tailored in the studio for a specific format (e.g., a fixed number of channels corresponding to a fixed playback setting). As such, the properties cannot be manipulated once they have been created. As a result, if one would like to play well rendered audio content in 5.1 format on a 7.1 speaker system or an ordinary stereo system, interpretation by such a playback system is not optimized. Also, properties, such as positions and sizes of the audio objects may not be played precisely by the speakers. In other words, when the audio content is created with a multi-channel format, the listening experience perceived by listeners is optimized by mixers for a specific playback setting. When the audio content is played by a distinct playback setting, the performance may degrade due to a mismatch between playback settings, such as a position change of an audio object.
In view of the foregoing, there is a need in the art for a solution for generating metadata containing the properties of an audio object.