Audio coding is the domain of compression that deals with exploiting redundancy and irrelevancy in audio signals.
In MPEG USAC (see, e.g., [3]), joint stereo coding of two channels is performed using complex prediction, MPS 2-1-2 or unified stereo with band-limited or full-band residual signals. MPEG surround (see, e.g., [4]) hierarchically combines One-To-Two (OTT) and Two-To-Three (TTT) boxes for joint coding of multichannel audio with or without transmission of residual signals.
In MPEG-H, Quad Channel Elements hierarchically apply MPS 2-1-2 stereo boxes followed by complex prediction/MS stereo boxes building a fixed 4×4 remixing tree, (see, e.g., [1]).
AC4 (see, e.g., [6]) introduces new 3-, 4- and 5-channel elements that allow for remixing transmitted channels via a transmitted mix matrix and subsequent joint stereo coding information. Further, prior publications suggest to use orthogonal transforms like Karhunen-Loeve Transform (KLT) for enhanced multichannel audio coding (see, e.g., [7]).
For example, in the 3D audio context, loudspeaker channels are distributed in several height layers, resulting in horizontal and vertical channel pairs. Joint coding of only two channels as defined in USAC is not sufficient to consider the spatial and perceptual relations between channels. MPEG Surround is applied in an additional pre-/postprocessing step, residual signals are transmitted individually without the possibility of joint stereo coding, e.g. to exploit dependencies between left and right vertical residual signals. In AC-4 dedicated N-channel elements are introduced that allow for efficient encoding of joint coding parameters, but fail for generic speaker setups with more channels as proposed for new immersive playback scenarios (7.1+4, 22.2). MPEG-H Quad Channel element is also restricted to only 4 channels and cannot be dynamically applied to arbitrary channels but only a pre-configured and fixed number of channels.
The MPEG-H Multichannel Coding Tool allows the creation of an arbitrary tree of discretely coded stereo boxes, i.e. jointly coded channel pairs, see [2].
A problem that often arises in audio signal coding is caused by quantization, e.g., spectral quantization. Quantization may possibly result in spectral holes. For example, all spectral values in a particular frequency band may be set to zero on the encoder side as a result of quantization. For example, the exact value of such spectral lines before quantization may be relatively low and quantization then may lead to a situation, where the spectral values of all spectral lines, for example, within a particular frequency band have been set to zero. On the decoder side, when decoding, this may lead to undesired spectral holes.
Modern frequency-domain speech/audio coding systems such as the Opus/Celt codec of the IETF [9], MPEG-4 (HE-)AAC [10] or, in particular, MPEG-D xHE-AAC (USAC) [11], offer means to code audio frames using either one long transform—a long block—or eight sequential short transforms—short blocks—depending on the temporal stationarity of the signal. In addition, for low-bitrate coding these schemes provide tools to reconstruct frequency coefficients of a channel using pseudorandom noise or lower-frequency coefficients of the same channel. In xHE-AAC, these tools are known as noise filling and spectral band replication, respectively.
However, for very tonal or transient stereophonic input, noise filling and/or spectral band replication alone limit the achievable coding quality at very low bitrates, mostly since too many spectral coefficients of both channels need to be transmitted explicitly.
MPEG-H Stereo Filling is a parametric tool which relies on the use of a previous frame's downmix to improve the filling of spectral holes caused by quantization in the frequency domain. Like noise filling, Stereo Filling operates directly in the MDCT domain of the MPEG-H core coder, see [1], [5], [8].
However, using of MPEG Surround and Stereo Filling in MPEG-H is restricted to fixed channel pair elements and therefore cannot exploit time-variant inter-channel dependencies.
The Multichannel Coding Tool (MCT) in MPEG-H allows adapting to varying inter-channel dependencies but, due to usage of single channel elements in typical operating configurations, does not allow Stereo Filling. The conventional technology does not disclose perceptually optimal ways to generate previous frame's downmixes in case of time-variant, arbitrary jointly coded channel pairs. Using noise filling as a substitute for stereo filling in combination with the MCT to fill spectral holes would lead to noise artifacts, especially for tonal signals.