Audio signal processing becomes more and more important. Challenges arise, as modern perceptual audio codecs are expected to deliver satisfactory audio quality at increasingly low bit rates. Additionally, often the permissible latency is also very low, e.g. for bi-directional communication applications or distributed gaming etc.
Modern audio codecs, like e.g. USAC (Unified Speech and Audio Coding), often switch between time domain predictive coding and transform domain coding, nevertheless music content is still predominantly coded in the transform domain. At low bit rates, e.g. <14 kbit/s, tonal components in music items often sound bad when coded through transform coders, which makes the task of coding audio at sufficient quality even more challenging.
Additionally, low-delay constraints generally lead to a sub-optimal frequency response of the transform coder's filter bank (due to low-delay optimized window shape and/or transform length) and therefore further compromise the perceptual quality of such codecs.
According to the classic psychoacoustic model, pre-requisites for transparency with respect to quantization noise are defined. At high bit rates, this relates to a perceptually adapted optimal time/frequency distribution of quantization noise that obeys the human auditory masking levels. At low bit rates, however, transparency cannot be reached. Therefore, a masking level requirements reduction strategy may be employed at low bit rates.
Already, top-notch codecs have been provided for music content, in particular, transform coders based on the Modified Discrete Cosine Transform (MDCT), which quantize and transmit spectral coefficients in the frequency domain. However, at very low data rates, only very few spectral lines of each time frame can be coded by the available bits for that frame. As a consequence, temporal modulation artifacts and so-called warbling artifacts are inevitably introduced into the coded signal.
Most prominently, these types of artifacts are perceived in quasi-stationary tonal components. This happens especially if, due to delay constraints, a transform window shape has to be chosen that induces significant crosstalk between adjacent spectral coefficients (spectral broadening) due to the well-known leakage effect. However, nonetheless usually only one or few of these adjacent spectral coefficients remain non-zero after the coarse quantization by the low-bit rate coder.
As stated above, in conventional technology, according to one approach, transform coders are employed. Contemporary high compression ratio audio codecs that are well-suited for coding of music content all rely on transform coding. Most prominent examples are MPEG2/4 Advanced Audio Coding (AAC) and MPEG-D Unified Speech and Audio Coding (USAC). USAC has a switched core consistent of an Algebraic Code Excited Linear Prediction (ACELP) module plus a Transform Coded Excitation (TCX) module (see [5]) intended mainly for speech coding and, alternatively, AAC mainly intended for coding of music. Like AAC, also TCX is a transform based coding method. At low bit rate settings, these coding schemes are prone to exhibit warbling artifacts, especially if the underlying coding schemes are based on the Modified Discrete Cosine Transform (MDCT) (see [1]).
For music reproduction, transform coders are the advantageous technique for audio data compression. However, at low bit rates, traditional transform coders exhibit strong warbling and roughness artifacts. Most of the artifacts originate from too sparsely coded tonal spectral components. This happens especially if these are spectrally smeared by a suboptimal spectral transfer function (leakage effect) that is mainly designed to meet strict delay constraints.
According to another approach in conventional technology, the coding schemes are fully parametric for transients, sinusoids and noise. In particular, for medium and low bit rates, fully parametric audio codecs have been standardized, the most prominent of which are MPEG-4 Part 3, Subpart 7 Harmonic and Individual Lines plus Noise (HILN) (see [2]) and MPEG-4 Part 3, Subpart 8 SinuSoidal Coding (SSC) (see [3]). Parametric coders, however, suffer from an unpleasantly artificial sound and, with increasing bit rate, do not scale well towards perceptual transparency.
A further approach provides hybrid waveform and parametric coding. In [4], a hybrid of transform based waveform coding and MPEG 4-SSC (sinusoidal part is proposed. In an iterative process, sinusoids are extracted and subtracted from the signal to form a residual signal to be coded by transform coding techniques. The extracted sinusoids are coded by a set of parameters and transmitted alongside with the residual. In [6], a hybrid coding approach is provided that codes sinusoids and residual separately. In [7], at the so-called Constrained Energy Lapped Transform (CELT) codec/Ghost webpage, the idea of utilizing a bank of oscillators for hybrid coding is depictured.
At medium or higher bit rates, transform coders are well-suited for coding of music due to their natural sound. There, the transparency requirements of the underlying psychoacoustic model are fully or almost fully met. However, at low bit rates, coders have to seriously violate the requirements of the psychoacoustic model and in such a situation transform coders are prone to warbling, roughness, and musical noise artifacts.
Although fully parametric audio codecs are most suited for lower bit rates, they are, however, known to sound unpleasantly artificial. Moreover, these codecs do not seamlessly scale to perceptual transparency, since a gradual refinement of the rather coarse parametric model is not feasible.
Hybrid waveform and parametric coding could potentially overcome the limits of the individual approaches and could potentially benefit from the mutual orthogonal properties of both techniques. However, it is, in the current state of the art, hampered by a lack of interplay between the transform coding part and the parametric part of the hybrid codec. Problems relate to signal division between parametric and transform codec part, bit budget steering between transform and parametric part, parameter signalling techniques and seamless merging of parametric and transform codec output.