State-of-the-art audio coding uses time-frequency decomposition to represent the signal in a meaningful way for data reduction. More specifically, audio coders use transforms to perform a mapping of the time-domain samples into frequency-domain coefficients. Discrete-time transforms used for this time-to-frequency mapping are typically based on kernels of sinusoidal functions, such as the Discrete Fourier Transform (DFT) and the Discrete Cosine Transform (DCT). It can be shown that such transforms achieve energy compaction of the audio signal. Energy compaction means that, in the transform (or frequency) domain, the energy distribution is localized on fewer significant frequency-domain coefficients than in the time-domain samples. Coding gains can then be achieved by applying adaptive bit allocation and suitable quantization to the frequency-domain coefficients. At the receiver, the bits representing the quantized and coded parameters (including the frequency-domain coefficients) are used to recover the quantized frequency-domain coefficients (or other quantized data such as gains), and the inverse transform generates the time-domain audio signal. Such coding schemes are generally referred to as transform coding.
By definition, transform coding operates on consecutive blocks (usually called “frames”) of samples of the input audio signal. Since quantization introduces some distortion in each synthesized block of audio signal, using non-overlapping blocks may introduce discontinuities at the block boundaries which may degrade the audio signal quality. Hence, in transform coding, to avoid discontinuities, the coded blocks of audio signal are overlapped prior to applying the transform, and appropriately windowed in the overlapping segment to allow smooth transition from one decoded block of samples to the next. Using a transform such as the DFT (or its fast equivalent, the Fast Fourier Transform (FFT)) or the DCT and applying it to overlapped blocks of samples unfortunately results in what is called “non-critical sampling”. For example, taking a typical 50% overlap condition, coding a block of N consecutive time-domain samples actually requires taking a transform on 2N consecutive samples, including N samples from the present block and N samples from the preceding and next block overlapping parts. Hence, for every block of N time-domain samples, 2N frequency-domain coefficients are coded. Critical sampling in the frequency domain implies that N input time-domain samples produce only N frequency-domain coefficients to be quantized and coded.
Specialized transforms have been designed to allow the use of overlapping windows and still maintain critical sampling in the transform-domain. With such specialized transforms, the 2N time-domain samples at the input of the transform result in N frequency-domain coefficients at the output of the transform. To achieve this, the block of 2N time-domain samples is first reduced to a block of N time domain samples through special time inversion, summation of specific parts of the 2N-sample long windowed signal at one end of the window, and subtraction of specific parts of the 2N-sample long windowed signal from each other at the other end of the window. These special time inversion, summation and subtraction introduce what is called “time-domain aliasing” (TDA). Once TDA is introduced in the block of samples of the audio signal, it cannot be removed using only that block. It is this time-domain aliased signal that is the input of a transform of size N (and not 2N), producing the N frequency-domain coefficients of the transform. To recover the N time-domain samples, the inverse transform uses the transform coefficients from two consecutive and overlapping frames or blocks to cancel out the TDA, in a process called Time-domain aliasing cancellation (TDAC).
An example of such a transform applying TDAC, which is widely used in audio coding, is the Modified Discrete Cosine Transform (MDCT). Actually, the MDCT introduces TDA without explicit folding in the time domain. Rather, time-domain aliasing is introduced when considering both the direct MDCT and inverse MDCT (IMDCT) of a single block of samples. This comes from the mathematical construction of the MDCT and is well known to those of ordinary skill in the art. But it is also known that this implicit time-domain aliasing can be seen as equivalent to first inverting parts of the time-domain samples and adding (or subtracting) these inverted parts to other parts of the signal. This is known as “folding”.
A problem arises when an audio coder switches between two coding modes, one using TDAC and the other not. Suppose for example that a codec switches from a TDAC coding mode to a non-TDAC coding mode. The side of the block of samples coded using the TDAC coding mode, and which is common to the block coded without using TDAC, contains TDA which cannot be cancelled out using the block of samples coded using the non-TDAC coding mode.
A first solution is to discard the samples which contain aliasing that cannot be cancelled out.
This first solution results in an inefficient use of transmission bandwidth because the block of samples for which TDA cannot be cancelled out is coded twice, once by the TDAC-based codec and a second time by the non-TDAC based codec.
A second solution is to use specially designed windows which do not introduce TDA in at least one part of the window when the time inversion and summation/subtraction process is applied. FIG. 1 is a diagram of an example of 2N-sample window introducing TDA on its left side but not on its right side. The window 100 of FIG. 1 is useful for transitions from a TDAC-based codec to a non-TDAC based codec. The first half of the window 100 is shaped so that it introduces TDA 110, which can be cancelled if the previous window also uses TDA with overlapping. However, the right side of the window 100 in FIG. 1 has a zero-valued region 120 after the folding point at position 3N/2. This region 120 of the window 100 therefore does not introduce any TDA when the time-inversion and summation/subtraction (or folding) process is performed around the folding point at position 3N/2.
As illustrated in FIG. 1, the window 100 contains a flat region 130 preceded by a left-side tapered region 140. The purpose of the tapered region 140 is to provide a good spectral resolution when the transform is computed and to smooth the transition during overlap-and-add operations between adjacent blocks. Increasing the duration of the flat region 130 of the window 100 reduces the overhead of information. However, the region 120 decreases the spectral performance of the window 100 since zero-valued sample information only is conveyed in region 120.
Therefore, there is a need for an improved TDAC technique usable, for example, in the multi-mode Moving Pictures Expert Group (MPEG) Unified Speech and Audio Codec (USAC), to manage the different transitions between frames using rectangular, non-overlapping windows and frames using non-rectangular, overlapping windows, while ensuring proper spectral resolution, data overhead reduction and smoothness of transition between these different frame types.