The present invention is concerned with an audio data processor according to claim 1, a method for processing audio data according to claim 14 and a computer program according to claim 15 for performing the method of processing audio data.
In home Consumer Electronics (CE) installations, functionality is spread over several devices connected via standardized interfaces. Further, (high quality) equipment is often build not only into a single device, but sophisticated single devices are available (consider Set-Top Boxes, TV-Set, AVR-Receiver). These devices communicate via standardized interfaces (such as HDMI).
While a first device extracts the desired streams and offers all interfaces to the user, a second device often performs decoding in “slave mode” without any interface to the user. When it comes to user interaction and control of the decoder, it is essential to convey this user information from device #1 to device #2 in this scenario.
For instance, as shown in FIG. 9, a television program is often received by a first device such as a set-top box, which selects the appropriate transmission channel and extracts relevant elementary streams containing desired coded essence. These extracted streams may be fed to a second device such as an Audio-Video-Receiver for reproduction. The transmission between these two devices may be accomplished by either transmitting a decoded/decompressed representation (PCM audio), or in an encoded representation, especially if bandwidth restrictions apply on the used interconnection line.
Further, as selecting desired streams and/or optionally user interaction is accomplished in device #1 (e.g. set-top box), in most cases only this device offers a control interface to the user. The second device (e.g. A/V Receiver) only provides a configuration interface which is usually accessed only once by the user when setting up the system and acts in “slave mode” at normal operation times.
Modern audio codec schemes do not only support encoding of audio signals, but also provide means for user interactivity to adapt the audio play-out and rendering to the listener's preferences. The audio data stream consists of a number of encoded audio signals, e.g. channel signals or audio objects, and accompanying meta-data information that describes how these audio signals form an audio scene that is rendered to loudspeakers.
Examples for audio objects are:                dialogue in different languages,        additional dialogue like audio description, or        music and effects background.        
Examples for meta-data information are:                the default volume level of each object signal (i.e. how loud it has to be mixed into the mixed signal for loudspeaker presentation),        the default spatial position (i.e. where it has to be rendered),        information, if user interaction is allowed for a specific object, or        information how the user is allowed to interact, e.g. minimum/maximum volume levels or restrictions on the positions the user may re-pan the objects to.        classification and/or description of audio objects        
To accomplish the user interactivity, audio decoders/renderers (e.g. device #2) need to provide an additional (input or interaction) interface for control information for the desired user interaction.
It might alternatively also be desirable to implement user control for audio object selection and manipulation in device #1 and feed this data to device #2 when decoding and rendering is implemented in device #2 and not in device #1.
However, transmission of such data is restricted due to the fact that existing standardized connections do not support transmission of user control data and/or renderer information.
Alternatively, the selection of streams and the user interaction as described above for device #1, and the decoding as described above for device #2 may be processed by two separate functional components contained within the same device and with the same restrictions on the data transmission between both components, namely that only one interface for coded data and user interaction data is available, advantageously the interaction interface of device #1, while a second interface for user interaction data, i.e. an interface usually provided by device #2, can be omitted. Even though both device #1 and device #2 are contained or implemented within the same (hardware) device, this leads to the same situation as described for the case of separated devices #1 and #2.
In order to accomplish the described use case and to overcome above described limitations, it is proposed to embed the user control information data, or interaction data in general, into the encoded audio data stream.