Recently, parametric techniques for the bitrate-efficient transmission/storage of audio scenes containing multiple audio objects have been proposed in the field of audio coding [BCC, JSC, SAOC, SAOC1, SAOC2] and informed source separation [ISS1, ISS2, ISS3, ISS4, ISS5, ISS6]. These techniques aim at reconstructing a desired output audio scene or audio source object based on additional side information describing the transmitted/stored audio scene and/or source objects in the audio scene. This reconstruction takes place in the decoder using a parametric informed source separation scheme.
Here, we will focus mainly on the operation of the MPEG Spatial Audio Object Coding (SAOC) [SAOC], but the same principles hold also for other systems. The main operations of an SAOC system are illustrated in FIG. 5. Without loss of generality, in order to improve readability of equations, for all introduced variables the indices denoting time and frequency dependency are omitted in this document, unless otherwise stated. The system receives N input audio objects S1, . . . , SN and instructions how these objects should be mixed, e.g., in the form of a downmixing matrix D. The input objects can be represented as a matrix S of size N×NSamples. The encoder extracts parametric and possibly also waveform-based side information describing the objects. In SAOC the side information consists mainly from the relative object energy information parameterized with Object Level Differences (OLDS) and from information of the correlations between the objects parameterized with Inter-Object Correlations (IOCs). The optional waveform-based side information in SAOC describes the reconstruction error of the parametric model. In addition to extracting this side information, the encoder provides a downmix signal X1, . . . , XM with M channels, created using the information within the downmixing matrix D of size M×N. The downmix signals can be represented as a matrix X of size M×NSamples with the following relationship to the input objects: X=DS. Normally, the relationship M<N holds, but this is not a strict requirement. The downmix signals and the side information are transmitted or stored, e.g., with the help of an audio codec such as MPEG-2/4 AAC. The SAOC decoder receives the downmix signals and the side information, and additional rendering information often in the form of a rendering matrix M of size K×N describing how the output Y1, . . . , YK with K channels is related to the original input objects.
The main operational blocks of an SAOC decoder are depicted in FIG. 6 and will be briefly discussed in the following. First, the side information is decoded and interpreted appropriately. The (Virtual) Object Separation block uses the side information and attempts to (virtually) reconstruct the input audio objects. The operation is referred to with the notion of “virtual” as usually it is not necessary to explicitly reconstruct the objects, but the following rendering stage can be combined with this step. The (virtual) object reconstructions Ŝ1, . . . , ŜN may still contain reconstruction errors. The (virtual) object reconstructions can be represented as a matrix Ŝ of size N×NSamples. The system receives the rendering information from outside, e.g., from user interaction. In the context of SAOC, the rendering information is described as a rendering matrix M defining the way the object reconstructions Ŝ1, . . . , ŜN should be combined to produce the output signals Y1, . . . YK. The output signals can be represented as a matrix Y of size K×NSamples being the result of applying the rendering matrix M on the reconstructed objects Ŝ through Y=MŜ.
The (virtual) object separation in SAOC operates mainly by using parametric side information for determining un-mixing coefficients, which it then will apply on the downmix signals for obtaining the (virtual) object reconstructions. Note, that the perceptual quality obtained this way may be lacking for some applications. For this reason, SAOC provides also an enhanced quality mode for up to four original input audio objects. These objects, referred to as Enhanced Audio Objects (EAOs), are associated with time-domain correction signals minimizing the difference between the (virtual) object reconstructions and the original input audio objects. An EAO can be reconstructed with very small waveform differences from the original input audio object.
One main property of an SAOC system is that the downmix signals X1, . . . , XM can be designed in such a way that they can be listened to and they form a semantically meaningful audio scene. This allows the users without a receiver capable of decoding the SAOC information to still enjoy the main audio content without the possible SAOC enhancements. For example, it would be possible to apply an SAOC system as described above within radio or TV broadcast in a backward compatible way. It would be practically impossible to exchange all the receivers deployed only for adding some non-critical functionality. The SAOC side information is normally rather compact and it can be embedded within the downmix signal transport stream. The legacy receivers simply ignore the SAOC side information and output the downmix signals, and the receivers including an SAOC decoder can decode the side information and provide some additional functionality.
However, especially in the broadcast use case, the downmix signal produced by the SAOC encoder will be further post-processed by the broadcast station for aesthetic or technical reasons before being transmitted. It is possible that the sound engineer would want to adjust the audio scene to fit better his artistic vision, or the signal is manipulated to match the trademark sound image of the broadcaster, or the signal should be manipulated to comply with some technical regulations, such as the recommendations and regulations regarding the audio loudness. When the downmix signal is manipulated, the signal flow diagram of FIG. 5 is changed into the one seen in FIG. 7. Here, it is assumed that the original downmix manipulation of downmix mastering applies some function ƒ(⋅) on each of the downmix signals Xi, 1≤i≤M, resulting to the manipulated downmix signals ƒ(Xi), 1≤i≤M. It is also possible that the actually transmitted downmix signals are not stemming from the ones produced by the SAOC encoder, but are provided from outside as a whole, but this situation is included in the discussion as being also a manipulation of the encoder-created downmix.
The manipulation of the downmix signals may cause problems in the SAOC decoder in the (virtual) object separation as the downmix signals in the decoder may not necessarily anymore match the model transmitted through the side information. Especially when the waveform side information of the prediction error is transmitted for the EAOs, it is very sensitive towards waveform alterations in the downmix signals.
It should be noted, that the MPEG SAOC [SAOC] is defined for the maximum of two downmix signals and one or two output signals, i.e., 1≤M≤2 and 1≤K≤2 However, the dimensions are here extended to a general case, as this extension is rather trivial and helps the description.
It has been proposed in [PDG, SAOC] to route the manipulated downmix signals also to the SAOC encoder, extract some additional side information, and use this side information in the decoder to reduce the differences between the downmix signals complying with the SAOC mixing model and the manipulated downmix signals available in the decoder. The basic idea of the routing is illustrated in FIG. 8a with the additional feedback connection from the downmix manipulation into the SAOC encoder. The current MPEG standard for SAOC [SAOC] includes parts of the proposal [PDG] mainly focusing on the parametric compensation. The estimation of the compensation parameters is not described here, but the reader is referred to the informative Annex D.8 of the MPEG SAOC standard [SAOC].
The correction side information is packed into the side information stream and transmitted and/or stored alongside. The SAOC decoder decodes the side information and uses the downmix modification side information to compensate for the manipulations before the main SAOC processing. This is illustrated in FIG. 8b. The MPEG SAOC standard defines the compensation side information to consist of gain factors for each downmix signal. These are denoted with PDGi wherein 1≤i≤M is the downmix signal index. The individual signal parameters can be collected into a matrix
      W    PDG    =            (                                                  PDG              1                                            …                                0                                                ⋮                                ⋱                                ⋮                                                0                                …                                              PDG              M                                          )        .  When the manipulated downmix signals are denoted with the matrix Xpostprocessed, the compensated downmix signals to be used in the main SAOC processing can be obtained with X=WXpostprocessed.
In [PDG] it is also proposed to include waveform residual signals describing the difference between the parametrically compensated manipulated downmix signals and the downmix signals created by the SAOC encoder. These, however, are not a part of the MPEG SAOC standard [SAOC].
The benefit of the compensation is that the downmix signals received by the SAOC (virtual) object separation block are closer to the downmix signals produced by the SAOC encoder and match the transmitted side information better. Often, this leads into reduced artifacts in the (virtual) object reconstructions.
The downmix signals used by the (virtual) object separation approximate the un-manipulated downmix signals created in the SAOC encoder. As a result, the output after the rendering will approximate the result that would be obtained by applying the often user-defined rendering instructions on the original input audio objects. If the rendering information is defined to be identical or very close to the downmixing information, in other words, M≈D, the output signals will resemble the encoder-created downmix signals: Y≈X. Remembering that the downmix signal manipulation may take place due to well-grounded reasons, it may be desirable that the output would resemble the manipulated downmix, instead, Y≈ƒ(X).
Let us illustrate this with a more concrete example from the potential application of dialog enhancement in broadcast.
The original input audio objects S consist of a (possibly multi-channel) background signal, e.g., the audience and ambient noise in a sports broadcast, and a (possibly multi-channel) foreground signal, e.g., the commentator.
The downmix signal X contains a mixture of the background and the foreground.
The downmix signal is manipulated by ƒ(X) consisting in a real-word case of, e.g., a multiband equalizer, a dynamic range compressor, and a limiter (any manipulation done here is later referred to as “mastering”).
In the decoder, the rendering information is similar to the downmixing information. The only difference is that the relative level balance between the background and the foreground signals can be adjusted by the end-user. In other words, the user can attenuate the audience noise to make the commentator more audible, e.g., for an improved intelligibility. As an opposite example, the end-user may attenuate the commentator to be able to focus more on the acoustic scene of the event.
If no compensation of the downmix manipulation is used, the (virtual) object reconstructions may contain artifacts caused by the differences between the real properties of the received downmix signals and the properties transmitted as the side information.
If compensation of the downmix manipulation is used, the output will have the mastering removed. Even in the case when the end-user does not modify the mixing balance, the default downmix signal (i.e., the output from receivers not capable of decoding the SAOC side information) and the rendered output will differ, possibly quite considerably.
In the end, the broadcaster has then the following sub-optimal options:
accept the SAOC artifacts from the mismatch between the downmix signals and the side information;
do not include any advanced dialog enhancement functionality; and/or
lose the mastering alterations of the output signal.