In the MPEG-4 standard ISO/IEC 14496, in particular in part 1 Systems, an audio/video (AV) scene can be composed from several audio, video and synthetic 2D/3D objects that can be coded with different MPEG-4 format coding types and can be transmitted as binary compressed data in a multiplexed bitstream comprising multiple substreams. A substream is also referred to as Elementary Stream (ES), and can be accessed through a descriptor. ES can contain AV data, or can be so-called Object Description (OD) streams, which contain configuration information necessary for decoding the AV substreams. The process of synthesizing a single scene from the component objects is called composition, and means mixing multiple individual AV objects, e.g. a presentation of a video with related audio and text, after reconstruction of packets and separate decoding of their respective ES. The composition of a scene is described in a dedicated ES called ‘Scene Description Stream’, which contains a scene description consisting of an encoded tree of nodes called Binary Information For Scenes (BIFS). ‘Node’ means a processing step or unit used in the MPEG-4 standard, e.g. an interface that buffers data or carries out time synchronization between a decoder and subsequent processing units. Nodes can have attributes, referred to as fields, and other information attached. A leaf node in the BIFS tree corresponds to elementary AV data by pointing to an OD within the OD stream, which in turn contains an ES descriptor pointing to AV data in an ES. Intermediate nodes, or scene description nodes, group this material to form AV objects, and perform e.g. grouping and transformation on such AV objects. In a receiver the configuration substreams are extracted and used to set up the required AV decoders. The AV substreams are decoded separately to objects, and the received composition instructions are used to prepare a single presentation from the decoded AV objects. This final presentation, or scene is then played back.
According to the MPEG-4 standard, audio content can only be stored in the ‘audioBuffer’ node or in the ‘mediaBuffer’ node. Both nodes are able to store a single data block at a time. When storing another data block, the previously stored data block is overwritten.
The ‘audioBuffer’ node can only be loaded with data from the audio substream when the node is created, or when the ‘length’ field is changed. This means that the audio buffer can only be loaded with one continuous block of audio data. The allocated memory matches the specified amount of data. Further, it may happen that the timing of loading data samples is not exactly due to the timing model of the BIFS decoder.
For loading more than one audio sample, it is possible to build up an MPEG-4 scene using multiple ‘audioBuffer’ nodes. But it is difficult to handle the complexity of the scene, and to synchronize the data stored in the different ‘audioBuffer’ nodes. Additionally, for each information a new stream has to be opened.