The present invention relates to audio signal processing and, in particular, to a decoder, an encoder, a system, methods and a computer program for spatial audio object coding employing hidden objects for signal mixture manipulation.
Audio signal processing becomes more and more important. Recently, parametric techniques for bitrate-efficient transmission and/or storage of audio scenes containing multiple audio objects have been proposed in the field of audio coding [BCC, JSC, SAOC, SAOC1, SAOC2] and, moreover, in the field of informed source separation [ISS1, ISS2, ISS3, ISS4, ISS5, ISS6]. These techniques aim at reconstructing a desired output audio scene or a desired audio source object on the basis of additional side information describing the transmitted and/or stored audio scene and/or the audio source objects in the audio scene.
FIG. 11 depicts a system according to the state of the art illustrating the example of MPEG SAOC (MPEG=Moving Picture Experts Group; SAOC=Spatial Audio Object Coding). In particular, FIG. 11 illustrates an MPEG SAOC system overview.
According to the state of the art, general processing is often carried out in a frequency selective way and can, for example, be described as follows within each frequency band:
N input audio object signals s1 . . . sN are mixed down to P channels x1 . . . xP as part of the processing of a mixer 912 of a state-of-the-art SAOC encoder 910. A downmix matrix may be employed comprising the elements d1,1, . . . , dN,P. In addition, a side information estimator 914 of the SAOC encoder 910 extracts side information describing the characteristics of the input audio objects. For MPEG SAOC, the relations of the object powers with respect to each other are a basic form of such a side information.
Subsequently, downmix signal(s) and side information may be transmitted and/or stored. To this end, the downmix audio signal may be encoded, e.g. compressed, by a state-of-the-art perceptual audio coder 920, such as an MPEG-1 Layer II or III (also known as mp3) audio coder or an MPEG Advanced Audio Coding (AAC) audio coder, etc.
On the receiving end, the encoded signals may, at first, be decoded, e.g., by a state-of-the-art perceptual audio decoder 940, such as an MPEG-1 Layer II or III audio decoder, an MPEG Advanced Audio Coding (AAC) audio decoder.
Then, a state-of-the-art SAOC decoder 950 conceptually tries to restore the original object signals, e.g., by conducting “object separation”, from the (decoded) downmix signals using the transmitted side information which, e.g., may have been generated by a side information estimator 914 of a SAOC encoder 910, as explained above. For the purpose of restoring the original object signals by conducting object separation, the SAOC decoder 950 comprises an object separator 952, e.g. a virtual object separator.
The object separator 952 may then provide the approximated object signals ŝ1, . . . , ŝn to a renderer 954 of the SAOC decoder 950, wherein the renderer 954 then mixes the approximated object signals ŝ1, . . . , ŝn into a target scene represented by M audio output channels ŷ1, . . . , ŷM, for example, by employing a rendering matrix. The coefficients r1,1 . . . rN,M in FIG. 11 may, e.g., indicate some of the coefficients of the rendering matrix. The desired target scene may, in a special case, be the rendering of only one source signal out of the mixture (source separation scenario), but may also be any other arbitrary acoustic scene.
However, the processing according to the state of the art has several drawbacks:
The state-of-the-art systems are restricted to processing of audio source signals only. Signal processing in the encoder and the decoder is carried out under the assumption, that no further signal processing is applied to the mixture signals or to the original source object signals. The performance of such systems decreases if this assumption does not hold any more.
A prominent example, which violates this assumption, is the usage of an audio coder in the processing chain to reduce the amount of data to be stored and/or transmitted for efficiently carrying the downmix signals. The signal compression perceptually alters the downmix signals. This has the effect that the performance of the object separator in the decoding system decreases and thus the perceived quality of the rendered target scene decreases as well [ISS5, ISS6].