During the last 20 years, particularly since the development of the MPEG-1 Layer 3 (MP3) and AC-2 (Dolby Digital) coders, perceptual audio coding has relied exclusively on the modified discrete cosine transform (MDCT), introduced by Princen et al. (see [1], [2]) and further investigated, under the name modulated lapped transform (MLT), by Malvar (see [3]), for waveform preserving spectral quantization. The inverse of this transform, given a length-M spectrum X′i for frame index i, can be written as
                                                        x              i              ′                        ⁡                          (              n              )                                =                                    2              M                        ⁢                                          ∑                                  k                  =                  0                                                  M                  -                  1                                            ⁢                                                          ⁢                                                                    X                    i                    ′                                    ⁡                                      (                    k                    )                                                  ⁢                                  cos                  (                                                            π                      M                                        ⁢                                          (                                              n                        +                                                                              M                            +                            1                                                    2                                                                    )                                        ⁢                                          (                                              k                        +                                                  1                          2                                                                    )                                                        )                                                                    ,                            (        1        )            with 0≤n<N and N being the window length. Since
      M    =          N      2        ,the overlapping ratio is 50%. In recent standards based on the MPEG-2 Advanced Audio Coding (AAC) specification (see [4], [5]), this concept has been extended to also allow parametric tools such as noise filling in the MDCT domain. The MPEG-H 3D Audio framework (see [6], [7]), for example, offers for semi-parametric transform-domain coding for example, the functionalities of noise filling of zeroed spectral lines above some frequency; stereo filling for semi-parametric joint-stereo coding (see [8], [9]); and Intelligent Gap Filling (IGF) for bandwidth extension (see [10]).
In [9], the combination of IGF and stereo filling, entitled spectral band substitution (SBS) in [8], assisted by transform kernel switching for input with non-trivial inter-channel phase differences, was shown to deliver good audio quality for most signals. On quasi-stationary harmonic segments, however, the subjective performance was lower than that of the alternative high-delay/complexity 3D Audio configuration using spectral band replication (SBR) and “unified stereo” MPEG Surround in a pseudo-QMF domain. An explanation for this behavior is the higher frequency resolution of the MDCTs utilized in the latter configuration: at the given output sample rate of 48 kHz, the M-size core transforms operate on 24-kHz downsampled downmix and residual signals, doubling the frame length.
SBS-based 3D Audio coding, due to its delay, complexity, and temporal-resolution advantages [8], represents the variant of choice at least for mono- and stereophonic signals, and it is desirable to improve its design—while maintaining the frame length—such that its performance can match that of the QMF-based configuration even on single-instrument and other tonal recordings. A viable solution for increased spectral efficiency on quasi-stationary segments is the extended lapped transform (ELT) proposed by Malvar (see [11], [12]), whose inverse (synthesis) version is identical to (1), except that 0≤n<L with L≥4M.
Thus, formula (1) indicates the inverse MLT as well as the inverse ELT. The only difference is that in case of the inverse MLT n is defined for 0 n<N, e.g., with N=2·M, and in case of the inverse ELT, n is defined for 0≤n<L, e.g., with L≥4M.
Unfortunately, as will be shown in below, the ELT's overlap ratio is at least 75% instead of the MDCT's 50%, which often leads to audible artifacts for transient waveform parts like drum hits or tone onsets. Moreover, practical solutions for block length switching between ELTs of different lengths—or between an ELT and MLT—similarly to the technique applied in MDCT codecs for precisely such transient frames, have not been presented and only theoretical work has been published (see, for example, [13], [14], [15], [16], [17]).