Transform based coding is the most commonly used scheme in audio compression/transmission systems of today. The major steps in such a scheme is to first convert a short block of the signal waveform into the frequency domain by a suitable transform, e.g., DFT (Discrete Fourier transform), DCT (Discrete Cosine Transform), or MDCT (Modified Discrete Cosine Transform). The transform coefficients are then quantized, transmitted or stored and later used to reconstruct the audio signal. This approach works well for general audio signals, but requires a high enough bitrate to create a sufficiently good representation of the transform coefficients. Below, a high-level overview of such transform domain coding schemes will be given.
On a block-by-block basis, the waveform to be encoded is transformed to the frequency domain. One commonly used transform used for this purpose is the so-called Modified Discrete Cosine Transform (MDCT). The thus obtained frequency domain transform vector is split into spectrum envelope (slowly varying energy) and spectrum residual. The spectrum residual is obtained by normalizing the obtained frequency domain vector with said spectrum envelope. The spectrum envelope is quantized, and quantization indices are transmitted to the decoder. Next, the quantized spectrum envelope is used as an input to a bit distribution algorithm, and bits for encoding of the residual vectors are distributed based on the characteristics of the spectrum envelope. As an outcome of this step, a certain number of bits are assigned to different parts of the residual (residual vectors or “sub-vectors”). Some residual vectors do not receive any bits and have to be noise-filled or bandwidth-extended. Typically, the coding of residual vectors is a two step procedure; first, the amplitudes of the vector elements are coded, and next the sign (which should not be confused with “phase”, which is associated with e.g. Fourier transforms) of the non-zero elements is encoded. Quantization indices for the residual's amplitude and sign are transmitted to the decoder, where residual and spectrum envelope are combined, and finally transformed back to time domain.
The capacity in telecommunication networks in continuously increasing. However, despite the increased capacity, there is still a strong drive to limit the required bandwidth per communication channel. In mobile networks, smaller transmission bandwidths for each call yields lower power consumption in both the mobile device and the base station serving the device. This translates to energy and cost saving for the mobile operator, while the end user will experience prolonged battery life and increased talk-time. Further, the less bandwidth that is consumed per user, the more users could be served (in parallel) by the mobile network.
One way of improving the quality of an audio signal, which is to be conveyed using a low or moderate bitrate, is to focus the available bits to accurately represent the lower frequencies in the audio signal. Then, BWE techniques may be used to model the higher frequencies based on the lower frequencies, which only requires a low number of bits. The background for these techniques is that the sensitivity of the human auditory system is frequency dependent. In particular, the human auditory system, i.e. our hearing, is less accurate for higher frequencies.
In a typical frequency-domain BWE scheme, high-frequency transform coefficients are grouped in bands. A gain (energy) for each band is calculated, quantized, and transmitted (to a decoder of the signal). At the decoder, a flipped or translated and energy normalized version of the received low-frequency coefficients is scaled with the high-frequency gains. In this way the BWE is not completely “blind,” since at least the spectral energy resembles that of the high-frequency bands of the target signal.
However, BWE of certain audio signals may result in audio signals comprising defects, which are annoying to a listener.