The present application is concerned with a downscaled decoding concept.
The MPEG-4 Enhanced Low Delay AAC (AAC-ELD) usually operates at sample rates up to 48 kHz, which results in an algorithmic delay of 15 ms. For some applications, e.g. lip-sync transmission of audio, an even lower delay is desirable. AAC-ELD already provides such an option by operating at higher sample rates, e.g. 96 kHz, and therefore provides operation modes with even lower delay, e.g. 7.5 ms. However, this operation mode comes along with an unnecessary high complexity due to the high sample rate.
The solution to this problem is to apply a downscaled version of the filter bank and therefore, to render the audio signal at a lower sample rate, e.g. 48 kHz instead of 96 kHz. The downscaling operation is already part of AAC-ELD as it is inherited from the MPEG-4 AAC-LD codec, which serves as a basis for AAC-ELD.
The question which remains, however, is how to find the downscaled version of a specific filter bank. That is, the only uncertainty is the way the window coefficients are derived whilst enabling clear conformance testing of the downscaled operation modes of the AAC-ELD decoder.
In the following the principles of the down-scaled operation mode of the AAC-(E)LD codecs are described.
The downscaled operation mode or AAC-LD is described for AAC-LD in ISO/IEC 14496-3:2009 in section 4.6.17.2.7 “Adaptation to systems using lower sampling rates” as follows:
“In certain applications it may be necessary to integrate the low delay decoder into an audio system running at lower sampling rates (e.g. 16 kHz) while the nominal sampling rate of the bitstream payload is much higher (e.g. 48 kHz, corresponding to an algorithmic codec delay of approx. 20 ms). In such cases, it is favorable to decode the output of the low delay codec directly at the target sampling rate rather than using an additional sampling rate conversion operation after decoding.
This can be approximated by appropriate downscaling of both, the frame size and the sampling rate, by some integer factor (e.g. 2, 3), resulting in the same time/frequency resolution of the codec. For example, the codec output can be generated at 16 kHz sampling rate instead of the nominal 48 kHz by retaining only the lowest third (i.e. 480/3=160) of the spectral coefficients prior to the synthesis filterbank and reducing the inverse transform size to one third (i.e. window size 960/3=320).
As a consequence, decoding for lower sampling rates reduces both memory and computational requirements, but may not produce exactly the same output as a full-bandwidth decoding, followed by band limiting and sample rate conversion.
Please note that decoding at a lower sampling rate, as described above, does not affect the interpretation of levels, which refers to the nominal sampling rate of the AAC low delay bitstream payload.”
Please note that AAC-LD works with a standard MDCT framework and two window shapes, i.e. sine-window and low-overlap-window. Both windows are fully described by formulas and therefore, window coefficients for any transformation lengths can be determined.
Compared to AAC-LD, the AAC-ELD codec shows two major differences:                The Low Delay MDCT window (LD-MDCT)        The possibility of utilizing the Low Delay SBR tool        
The IMDCT algorithm using the low delay MDCT window is described in 4.6.20.2 in [1], which is very similar to the standard IMDCT version using e.g. the sine window. The coefficients of the low delay MDCT windows (480 and 512 samples frame size) are given in Table 4.A.15 and 4.A.16 in [1]. Please note that the coefficients cannot be determined by a formula, as the coefficients are the result of an optimization algorithm. FIG. 9 shows a plot of the window shape for frame size 512.
In case the low delay SBR (LD-SBR) tool is used in conjunction with the AAC-ELD coder, the filter banks of the LD-SBR module are downscaled as well. This ensures that the SBR module operates with the same frequency resolution and, therefore, no more adaptions are implemented.
Thus, the above description reveals that there is a need for downscaling decoding operations such as, for example, downscaling a decoding at an AAC-ELD. It would be feasible to find out the coefficients for the downscaled synthesis window function anew, but this is a cumbersome task, necessitates additional storage for storing the downscaled version and renders a conformity check between the non-downscaled decoding and the downscaled decoding more complicated or, from another perspective, does not comply with the manner of downscaling requested in the AAC-ELD, for example. Depending on the downscale ratio, i.e. the ratio between the original sampling rate and the downscaled sampling rate, one could derive the downscaled synthesis window function simply by downsampling, i.e. picking out every second, third, . . . window coefficient of the original synthesis window function, but this procedure does not result in a sufficient conformity of the non-downscaled decoding and downscaled decoding, respectively. Using more sophisticated decimating procedures applied to the synthesis window function, lead to unacceptable deviations from the original synthesis window function shape. Therefore, there is a need in the art for an improved downscaled decoding concept.