In the art of audio processing, audio transmission and audio storage there is an increasing desire to handle multi-channel contents in order to improve the hearing impression. Usage of a multi-channel audio content brings along significant improvements for the user. For example, a 3-dimensional hearing impression can be obtained, which brings along an improved user satisfaction in entertainment applications. However, multi-channel audio contents are also useful in professional environments, for example, telephone conferencing applications, because the speaker intelligibility can be improved by using a multi-channel audio playback.
However, it is also desirable to have a good trade-off between audio quality and bitrate requirements in order to avoid excessive resource consumption in low-cost or professional multi-channel applications.
Parametric techniques for the bitrate-efficient transmission and/or storage of audio scenes containing multiple audio objects have recently been proposed. For example, a binaural cue coding, which is described, for example, in reference [1], and a parametric joint-coding of audio sources, which is described, for example, in reference [2], have been proposed. Also, an MPEG spatial audio object coding (SAOC) has been proposed, which is described, for example, in references [3] and [4]. MPEG spatial audio object coding is currently under standardization, and described in non-pre-published reference [5].
These techniques aim at perceptually reconstructing the desired output scene rather than by a wave form match.
However, in combination with user interactivity at the receiving side, such techniques may lead to a low audio quality of the output audio signals if extreme object rendering is performed. This is described, for example, in reference [6].
In the following, such systems will be described, and it should be noted that the basic concepts also apply to the embodiments of the invention.
FIG. 8 shows a system overview of such a system (here: MPEG SAOC). The MPEG SAOC system 800 shown in FIG. 8 comprises an SAOC encoder 810 and an SAOC decoder 820. The SAOC encoder 810 receives a plurality of object signals x1 to xN, which may be represented, for example, as time-domain signals or as time-frequency-domain signals (for example, in the form of a set of transform coefficients of a Fourier-type transform, or in the form of QMF subband signals). The SAOC encoder 810 typically also receives downmix coefficients d1 to dN, which are associated with the object signals x1 to xN. Separate sets of downmix coefficients may be available for each channel of the downmix signal. The SAOC encoder 810 is typically configured to obtain a channel of the downmix signal by combining the object signals x1 to xN in accordance with the associated downmix coefficients d1 to dN. Typically, there are less downmix channels than object signals x1 to xN. In order to allow (at least approximately) for a separation (or separate treatment) of the object signals at the side of the SAOC decoder 820, the SAOC encoder 810 provides both the one or more downmix signals (designated as downmix channels) 812 and a side information 814. The side information 814 describes characteristics of the object signals x1 to xN, in order to allow for a decoder-sided object-specific processing.
The SAOC decoder 820 is configured to receive both the one or more downmix signals 812 and the side information 814. Also, the SAOC decoder 820 is typically configured to receive a user interaction information and/or a user control information 822, which describes a desired rendering setup. For example, the user interaction information/user control information 822 may describe a speaker setup and the desired spatial placement of the objects which provide the object signals x1 to xN.
The SAOC decoder 820 is configured to provide, for example, a plurality of decoded upmix channel signals ŷ1 to ŷM. The upmix channel signals may for example be associated with individual speakers of a multi-speaker rendering arrangement. The SAOC decoder 820 may, for example, comprise an object separator 820a, which is configured to reconstruct, at least approximately, the object signals x1 to xN on the basis of the one or more downmix signals 812 and the side information 814, thereby obtaining reconstructed object signals 820b. However, the reconstructed object signals 820b may deviate somewhat from the original object signals x1 to xN, for example, because the side information 814 is not quite sufficient for a perfect reconstruction due to the bitrate constraints. The SAOC decoder 820 may further comprise a mixer 820c, which may be configured to receive the reconstructed object signals 820b and the user interaction information/user control information 822, and to provide, on the basis thereof, the upmix channel signals ŷ1 to ŷM. The mixer 820 may be configured to use the user interaction information/user control information 822 to determine the contribution of the individual reconstructed object signals 820b to the upmix channel signals ŷ1 to ŷM. The user interaction information/user control information 822 may, for example, comprise rendering parameters (also designated as rendering coefficients), which determine the contribution of the individual reconstructed object signals 822 to the upmix channel signals ŷ1 to ŷM.
However, it should be noted that in many embodiments, the object separation, which is indicated by the object separator 820a in FIG. 8, and the mixing, which is indicated by the mixer 820c in FIG. 8, are performed in single step. For this purpose, overall parameters may be computed which describe a direct mapping of the one or more downmix signals 812 onto the upmix channel signals ŷ1 to ŷM. These parameters may be computed on the basis of the side information and the user interaction information/user control information 820.
Taking reference now to FIGS. 9a, 9b and 9c, different apparatus for obtaining an upmix signal representation on the basis of a downmix signal representation and object-related side information will be described. FIG. 9a shows a block schematic diagram of a MPEG SAOC system 900 comprising an SAOC decoder 920. The SAOC decoder 920 comprises, as separate functional blocks, an object decoder 922 and a mixer/renderer 926. The object decoder 922 provides a plurality of reconstructed object signals 924 in dependence on the downmix signal representation (for example, in the form of one or more downmix signals represented in the time domain or in the time-frequency-domain) and object-related side information (for example, in the form of object meta data). The mixer/renderer 924 receives the reconstructed object signals 924 associated with a plurality of N objects and provides, on the basis thereof, one or more upmix channel signals 928. In the SAOC decoder 920, the extraction of the object signals 924 is performed separately from the mixing/rendering which allows for a separation of the object decoding functionality from the mixing/rendering functionality but brings along a relatively high computational complexity.
Taking reference now to FIG. 9b, another MPEG SAOC system 930 will be briefly discussed, which comprises an SAOC decoder 950. The SAOC decoder 950 provides a plurality of upmix channel signals 958 in dependence on a downmix signal representation (for example, in the form of one or more downmix signals) and an object-related side information (for example, in the form of object meta data). The SAOC decoder 950 comprises a combined object decoder and mixer/renderer, which is configured to obtain the upmix channel signals 958 in a joint mixing process without a separation of the object decoding and the mixing/rendering, wherein the parameters for said joint upmix process are dependent both on the object-related side information and the rendering information. The joint upmix process depends also on the downmix information, which is considered to be part of the object-related side information.
To summarize the above, the provision of the upmix channel signals 928, 958 can be performed in a one step process or a two step process.
Taking reference now to FIG. 9c, an MPEG SAOC system 960 will be described. The SAOC system 960 comprises an SAOC to MPEG Surround transcoder 980, rather than an SAOC decoder.
The SAOC to MPEG Surround transcoder comprises a side information transcoder 982, which is configured to receive the object-related side information (for example, in the form of object meta data) and, optionally, information on the one or more downmix signals and the rendering information. The side information transcoder is also configured to provide an MPEG Surround side information (for example, in the form of an MPEG Surround bitstream) on the basis of a received data. Accordingly, the side information transcoder 982 is configured to transform an object-related (parametric) side information, which is relieved from the object encoder, into a channel-related (parametric) side information, taking into consideration the rendering information and, optionally, the information about the content of the one or more downmix signals.
Optionally, the SAOC to MPEG Surround transcoder 980 may be configured to manipulate the one or more downmix signals, described, for example, by the downmix signal representation, to obtain a manipulated downmix signal representation 988. However, the downmix signal manipulator 986 may be omitted, such that the output downmix signal representation 988 of the SAOC to MPEG Surround transcoder 980 is identical to the input downmix signal representation of the SAOC to MPEG Surround transcoder. The downmix signal manipulator 986 may, for example, be used if the channel-related MPEG Surround side information 984 would not allow to provide a desired hearing impression on the basis of the input downmix signal representation of the SAOC to MPEG Surround transcoder 980, which may be the case in some rendering constellations.
Accordingly, the SAOC to MPEG Surround transcoder 980 provides the downmix signal representation 988 and the MPEG Surround bitstream 984 such that a plurality of upmix channel signals, which represent the audio objects in accordance with the rendering information input to the SAOC to MPEG Surround transcoder 980 can be generated using an MPEG Surround decoder which receives the MPEG Surround bitstream 984 and the downmix signal representation 988.
To summarize the above, different concepts for decoding SAOC-encoded audio signals can be used. In some cases, a SAOC decoder is used, which provides upmix channel signals (for example, upmix channel signals 928, 958) in dependence on the downmix signal representation and the object-related parametric side information. Examples for this concept can be seen in FIGS. 9a and 9b. Alternatively, the SAOC-encoded audio information may be transcoded to obtain a downmix signal representation (for example, a downmix signal representation 988) and a channel-related side information (for example, the channel-related MPEG Surround bitstream 984), which can be used by an MPEG Surround decoder to provide the desired upmix channel signals.
In the MPEG SAOC system 800, a system overview of which is given in FIG. 8, the general processing is carried out in a frequency selective way and can be described as follows within each frequency band:                N input audio object signals x1 to xN are downmixed as part of the SAOC encoder processing. For a mono downmix, the downmix coefficients are denoted by d1 to dN. In addition, the SAOC encoder 810 extracts side information 814 describing the characteristics of the input audio objects. For MPEG SAOC, the relations of the object powers with respect to each other are the most basic form of such a side information.        Downmix signal (or signals) 812 and side information 814 are transmitted and/or stored. To this end, the downmix audio signal may be compressed using well-known perceptual audio coders such as MPEG-1 Layer II or III (also known as “.mp3”), MPEG Advanced Audio Coding (AAC), or any other audio coder.        On the receiving end, the SAOC decoder 820 conceptually tries to restore the original object signal (“object separation”) using the transmitted side information 814 (and, naturally, the one or more downmix signals 812). These approximated object signals (also designated as reconstructed object signals 820b) are then mixed into a target scene represented by M audio output channels (which may, for example, be represented by the upmix channel signals ŷ1 to ŷM) using a rendering matrix. For a mono output, the rendering matrix coefficients are given by r1 to rN         Effectively, the separation of the object signals is rarely executed (or even never executed), since both the separation step (indicated by the object separator 820a) and the mixing step (indicated by the mixer 820c) are combined into a single transcoding step, which often results in an enormous reduction in computational complexity.        
It has been found that such a scheme is tremendously efficient, both in terms of transmission bitrate (it is only needed to transmit a few downmix channels plus some side information instead of N discrete object audio signals or a discrete system) and computational complexity (the processing complexity relates mainly to the number of output channels rather than the number of audio objects). Further advantages for the user on the receiving end include the freedom of choosing a rendering setup of his/her choice (mono, stereo, surround, virtualized headphone playback, and so on) and the feature of user interactivity: the rendering matrix, and thus the output scene, can be set and changed interactively by the user according to will, personal preference or other criteria. For example, it is possible to locate the talkers from one group together in one spatial area to maximize discrimination from other remaining talkers. This interactivity is achieved by providing a decoder user interface:
For each transmitted sound object, its relative level and (for non-mono rendering) spatial position of rendering can be adjusted. This may happen in real-time as the user changes the position of the associated graphical user interface (GUI) sliders (for example: object level=+5 dB, object position=−30 deg).
However, it has been found that the decoder-sided choice of parameters for the provision of the upmix signal representation (e.g. the upmix channel signals ŷ1 to ŷM) brings along audible degradations in some cases.