State-of-the-art audio coding uses time-frequency decomposition to represent the signal in a meaningful way for data reduction. Specifically, audio coders use transforms to perform a mapping of the time-domain samples into frequency-domain coefficients. Discrete-time transforms used for this time-to-frequency mapping are typically based on kernels of sinusoidal functions, such as the Discrete Fourier Transform (DFT) and the Discrete Cosine Transform (DCT). It can be shown that such transforms achieve “energy compaction” of the audio signal. This means that, in the transform (or frequency) domain, the energy distribution is localized on fewer significant coefficients than in the time-domain samples. Coding gains can then be achieved by applying adaptive bit allocation and suitable quantization to the frequency-domain coefficients. At the receiver, the bits representing the quantized and encoded parameters (for example, the frequency-domain coefficients) are used to recover the quantized frequency-domain coefficients (or other quantized data such as gains), and the inverse transform generates the time-domain audio signal. Such coding schemes are generally referred to as transform coding.
By definition, transform coding operates on consecutive blocks of samples of the input audio signal. Since quantization introduces some distortion in each synthesized block of audio signal, using non-overlapping blocks may introduce discontinuities at the block boundaries, which may degrade the audio signal quality. Hence, in transform coding, to avoid discontinuities, the encoded blocks of audio signal are overlapped prior to applying the discrete transform, and appropriately windowed in the overlapping segment to allow smooth transition from one decoded block to the next. Using a “standard” transform such as the DFT (or its fast equivalent, the FFT) or the DCT and applying it to overlapped blocks unfortunately results in what is called “non-critical sampling”. For example, taking a typical 50% overlap condition, encoding a block of N consecutive time-domain samples actually requires taking a transform on 2N consecutive samples—N samples from the present block and N samples from the next block overlapping part). Hence, for every block of N time-domain samples, 2N frequency-domain coefficients are encoded. Critical sampling in the frequency domain implies that N input time-domain samples produce only N frequency-domain coefficients to be quantized and coded.
Specialized transforms have been designed to allow the use of overlapping windows and still maintain critical sampling in the transform-domain—2N time-domain samples at the input of the transform result in N frequency-domain coefficients at the output of the transform. To achieve this, the block of 2N time-domain samples is first reduced to a block of N time domain samples through special time inversion and summation of specific parts of the 2N-sample long windowed signal. This special time inversion and summation introduces what is called “time-domain aliasing” or TDA. Once this aliasing is introduced in the block of signal, it cannot be removed using only that block. It is this time-domain aliased signal that is the input of a transform of size N (and not 2N), producing the N frequency-domain coefficients of the transform. To recover N time-domain samples, the inverse transform actually has to use the transform coefficients from two consecutive and overlapping frames to cancel out the TDA, in a process called Time-domain aliasing cancellation, or TDAC.
An example of such a transform applying TDAC, which is widely used in audio coding, is the Modified Discrete Cosine Transform (or MDCT). Actually, the MDCT performs the above mentioned TDA without explicit folding in the time domain. Rather, time-domain aliasing is introduced when considering both the direct and inverse MDCT (IMDCT) of a single block. This comes from the mathematical construction of the MDCT and is well known to those of ordinary skill in the art. But it is also known that this implicit time-domain aliasing can be seen as equivalent to first inverting parts of the time-domain samples and adding (or subtracting) these inverted parts to other parts of the signal. This is known as “folding”.
A problem arises when an audio coder switches between two coding models, one using TDAC and the other not. Suppose for example that a codec switches from a TDAC coding model to a non-TDAC coding model. The side of the block of samples encoded using the TDAC coding model, and which is common to the block encoded without using TDAC, contains aliasing which cannot be cancelled out using the block of samples encoded using the non-TDAC coding model.
A first solution is to discard the samples which contain aliasing that cannot be cancelled out.
This solution results in an inefficient use of transmission bandwidth because the block of samples for which TDA cannot be cancelled out is encoded twice, once by the TDAC-based codec and a second time by the non-TDAC based codec.
A second solution is to use specially designed windows which do not introduce TDA in at least one part of the window when the time inversion and summation process is applied. FIG. 1 is a diagram of an exemplary window introducing TDA on its left side but not on its right side. More specifically, in FIG. 1, a 2N-sample window 100 introduces TDA 110 on its left side. The window 100 of FIG. 1 is useful for transitions from a TDAC-based codec to a non-TDAC based codec. The first half of this window is shaped so that it introduces TDA 110, which can be cancelled if the previous window also uses TDA with overlapping. However, the right side of the window in FIG. 1 has a zero-valued sample 120 after the folding point at position 3N/2. This part of the window 100 therefore does not introduce any TDA when the time-inversion and summation (or folding) process is performed around the folding point at position 3N/2.
Further, the left side of the window 100 contains a flat region 130 preceded by a tapered region 140. The purpose of the tapered region 140 is to provide a good spectral resolution when the transform is computed and to smooth the transition during overlap-and-add operations between adjacent blocks. Increasing the duration of the flat region 130 of the window reduces the information bandwidth and decreases the spectral performance of the window because a part of the window is sent without any information.
In the multi-mode Moving Pictures Expert Group (MPEG) Unified Speech and Audio Codec (USAC) audio codec, several special windows such as the one described in FIG. 1 are used to manage the different transitions from frames using rectangular, non-overlapping windows to frames using non-rectangular, overlapping windows. These special windows were designed to achieve different compromises between spectral resolution, data overhead reduction and smoothness of transition between these different frame types.