Traditionally, audio content is created and stored in channel-based formats. As used herein, the term “audio channel” or “channel” refers to the audio content that usually has a predefined physical location. For example, stereo, surround 5.1, 7.1 and the like are the channel-based formats of audio content. Recently, several conventional multichannel systems have been extended to support a new format that includes both channels and audio objects. As used herein, the term “audio object” or “object” refers to an individual audio element that exists for a defined duration of time in the sound field. An audio object may be dynamic or static. For example, audio objects may be human, animals or any other elements serving as sound sources. Audio objects and channels may be sent separately, and then used by a reproduction system on the fly to recreate the artistic intent adaptively based on configurations of the playback devices. As an example, in a format known as “adaptive audio content,” there may be one or more audio objects and one or more “channel beds” which are channels to be reproduced in predefined, fixed locations.
Object-based audio content represents a significant improvement over traditional channel-based audio content. That is, object-based audio content creates a more immersive sound field and controls discrete audio elements accurately, irrespective of specific configurations of the playback devices. For example, cinema sound tracks may comprise many different sound elements corresponding to the images on the screen, dialog, noises, and sound effects that emanate from different places on the screen and combine with background music and ambient effects to create the overall auditory experience.
However, the large number of audio signals (channel beds and audio objects) in object-based audio content poses new challenges for coding and distribution of the audio content. It would be appreciated that in many cases such as distributions via Blue-ray disc, broadcast (cable, satellite and terrestrial), mobile networks, over-the-top (OTT) or the Internet, the bandwidth and/or other resources available for transmitting and processing all the channel beds, audio objects and relevant information may be limited. Although audio coding and compression technologies may be applied to reduce the amount of information to be processed; they do not work in some cases especially for those complexity scenes and networks with very limited bandwidth like mobile networks. Moreover, audio coding/compression technologies are only capable of reducing the bit rate by considering the redundancy within mono channel or channel pairs. That is, various types of spatial redundancy (e.g., the spatial position overlap and spatial masking effect among the audio objects), are not taken into account in the object-based audio content.
Clustering has been proposed to process audio objects such that each resulting cluster may represent one or more audio objects. That is, a clustering process applied to the audio objects to makes use of spatial redundancy to further reduce the resource requirements. Usually, a cluster may contains/combines several audio objects that are proximate enough to each other (the channel beds may be processed as special audio objects with predefined positions.) Generally speaking, in the audio object clustering, several fundamental criteria should be taken into account. For example, the spatial characteristics of the original content should be accurately characterized and modeled in order to maintain the overall spatial perception. Moreover, the audible artifacts or any other issues/challenges for the subsequent processes should be avoided in the clustering process. Currently, audio object clustering involves clustering performed on the basis of individual frames. For example, centroids of the clustering are separately determined for each frame, without considering variations of the audio objects over the time. As a result, the inter-frame stability of the clustering process is relatively low, which is likely to introduce the risk of audible artifacts when rendering the audio object clusters.
In view of the foregoing, there is a need in the art for a solution enabling more stable clustering of audio objects.