In the MPEG-4 standard ISO/IEC 14496, in particular in part 1 Systems, an audio/video (AV) scene can be composed from several audio, video and synthetic 2D/3D objects that can be coded with different MPEG-4 format coding types and can be transmitted as binary compressed data in a multiplexed bitstream, comprising audio, video and other substreams, e.g. text to be displayed. A substream is also referred to as Elementary Stream (ES), and can be accessed through a descriptor. A scene is generally understood as an audio-visual space that a user can interact with.
The process of synthesizing a single scene from the component objects is called composition, and means mixing multiple individual AV objects, e.g. a presentation of a video with related audio and text, after reconstruction of packets and separate decoding of their respective ES. User interaction, terminal capability, and terminal configuration may be used when determining how to compose a scene at a receiver terminal. The bitstream defined in the mentioned MPEG-4 standard contains an ES called ‘Scene Description Stream’ being a set of general instructions for composition of scenes, and further contains other substreams, so called Object Description Streams, that contain configuration information being necessary for decoding the AV substreams. In a receiver the configuration substreams are extracted and used to set up the required AV decoders. Then the AV substreams are decoded separately to objects, and the received composition instructions are used to prepare a single presentation from the decoded AV objects. This final presentation, or scene, which is no more under full control of the broadcaster or content provider due to terminal dependent composition, is then played back.
In ISO/IEC 14496-1:2002, which is the current version of the MPEG-4 Systems standard, a hierarchical model for presenting AV scenes is described, using a parametric approach. An InitialObjectDescriptor (IOD) contains descriptors for the Scene Description Stream and a dedicated OD Stream. The Scene Description Stream contains a scene description consisting of an encoded tree of nodes. ‘Node’ means a processing step or unit used in the MPEG-4 standard, e.g. an interface carrying out time synchronisation between a decoder and subsequent processing units. Nodes can have attributes, referred to as fields, and other information attached. A leaf node in this tree corresponds either to elementary AV data by pointing to an OD within the OD stream, which in turn contains an ES Descriptor pointing to AV data in an ES, or to a graphical 2D/3D synthetic object, e.g. a cube; a curve or text. Intermediate nodes, or scene description nodes, group this material to form AV objects, and perform e.g. grouping and transformation on such AV objects.
Text to be displayed is contained in a Scene Description ES. The reproduction of text is described with the FontStyle node. In the FontStyle node semantics, the family field permits to the content creator to select the font the terminal uses to display, or render, the text. If the font is available on the client platform, it is used to render the given text strings. Otherwise a default font has to be used. In contrast to many other nodes of the Scene Description, the FontStyle node fields are static, i.e. they cannot be modified by updates.