In applications having a very small bit rate available, it is known, in the context of encoding audio signals, to use an SBR technique for encoding. Only the low-frequency portion is encoded fully, i.e. at an adequate temporal and spectral resolution. For the high-frequency portion, only the spectral envelope, or the envelope of the spectral temporal curve of the audio signal, is detected and encoded. On the decoder side, the low-frequency portion is retrieved from the encoded signal and is subsequently used to reconstruct, or “replicate”, the high-frequency portion therefrom. However, to adapt the energy of the high-frequency portion, which has thus been preliminarily reconstructed, to the actual energy within the high-frequency portion of the original audio signal, the spectral envelope transmitted is used, on the decoder side, for spectral weighting of the high-frequency portion reconstructed preliminarily.
For the above effort to be worthwhile, it is important, of course, that the number of bits used for transmitting the spectral envelopes be as small as possible. It is therefore desirable for the temporal grid within which the spectral envelope is encoded to be as coarse as possible. On the other hand, however, too coarse a grid leads to audible artefacts, which is notable, in particular, with transients, i.e. at locations where the high-frequency portions will predominate rather than, as usual, the low-frequency portions, or where there is at least a rapid increase in the amplitude of the high-frequency portions. In audio signals, such transients correspond, for example, to the beginnings of a note, such as actuation of a piano string or the like. If the grid is too coarse over the time period of a transient, this may lead to audible artefacts in the decoder-side reconstruction of the entire audio signal. For, as one knows, on the decoder side, the high-frequency signal is reconstructed from the low-frequency portion in that, within the grid area, the spectral energy of the decoded low-frequency portion is normalized and then adapted to the spectral envelope transmitted by means of weighting. In other words, spectral weighting is simply performed within the grid area so as to reproduce the high-frequency portion from the low-frequency portion. However, if the grid area around the transient is too large, a lot of energy will be located, within this grid area, in addition to the energy of the transient, in the background and/or chord portion in the low-frequency portion which is used for reproducing the high-frequency portion. Said low-frequency portion is co-amplified by the weighting factor, even though this does not result in a good estimation of the high-frequency portion. Across the entire grid area, this will lead to an audible artefact which, in addition, will set in even before the actual transient. This problem may also be referred to as “pre-echo”.
The problem could be solved when the grid area around the transient is fine enough so that the transient/background ratio of the part of the low-frequency portion within this grid area is improved. Small grid areas or small grid boundary distances, however, are obstacles on the way to the above-outlined desire for a low bit consumption for encoding the spectral envelopes.
In the ISO/IEC 14496-3 standard—simply referred to as “the standard” below—an SBR encoding is described in the context of the AAC encoder. The AAC encoder encodes the low-frequency portion in a frame-by-frame manner. For each such SBR frame, the above-specified time and frequency resolution is defined at which the spectral envelope of the high-frequency portion is encoded in this frame. To address the problem that transients may also fall on SBR frame boundaries, the standard allows that the temporal grid may temporarily be defined such that the grid boundaries do not necessarily coincide with the frame boundaries. Rather, in this standard, the encoder transmits, per frame, a syntax element bs_frame_class to the decoder, said syntax element indicating per frame whether the temporal grid of the spectral envelope gridding for the respective frame is defined precisely between the two frame boundaries or between boundaries which are offset from the frame boundaries, specifically at the front and/or at the back. Overall, there are four different classes of SBR frames, i.e. FIXFIX, FIXVAR, VARFIX and VARVAR. The syntax used by the encoder in the standard to define the grid per SBR frame is depicted in a pseudo code representation in FIG. 12. In particular, in the representation of FIG. 12, those syntax elements which are actually encoded and/or transmitted by the encoder are printed in bold type in FIG. 12, the number of the bits used for transmission and/or encoding being indicated in the second column from the right in the respective row. As may be seen, the syntax element bs_frame_class which has just been mentioned is initially transmitted for each SBR frame. As a function thereof, further syntax elements will follow which, as will be illustrated, define the temporal resolution and/or gridding. If, for example, the 2-bits syntax element bs_frame_class indicates that the SBR frame in question is a FIXFIX SBR frame, the syntax element tmp which defines the number of grid areas in this SBR frame, and/or which defines the number of envelopes, as 2tmp will be transmitted as the second syntax element. The syntax element bs_amp_res, which is used for the quantization step size for encoding the spectral envelope in the current SBR frame, is automatically adjusted as a function of bs_num_env, and is not encoded or transmitted. Finally, for a FIXFIX frame, a bit is transmitted for determining the frequency resolution of the grid bs_freq_res. FIXFIX frames are defined precisely for one frame, i.e. the grid boundaries coincide with the frame boundaries as defined by the AAC encoder.
This is different for the other three classes. For FIXVAR, VARFIX and VARVAR frames, syntax elements bs_var_bord_1 and/or bs_bar_bod_0 are transmitted to indicate the number of time slots, i.e. the time units wherein the filter bank for spectral decomposition of the audio signal operates, by which are offset relative to the normal frame boundaries. As a function thereof, syntax elements bs_num_rel_1 and an associated tmp and/or bs_num_rel_0 and an associated tmp are also transmitted so as to define a number of grid areas, or envelopes, and the size thereof from the offset frame boundary. Finally, a syntax element bs_pointer is also transmitted within the variable SBR frames, said syntax element pointing to one of the defined envelopes and serving to define one or two noise envelopes for determining the noise portion within the frame as a function of the spectral envelope gridding, which, however, shall not be explained in detail below in order to simplify the representation. Finally, the respective frequency resolution is determined, namely by a respective one-bit syntax element bs_freq_res per envelope, for all grid areas and/or envelopes in the respective variable frames.
FIG. 13a represents, by way of example, a FIXFIX frame wherein the syntax element tmp is 1, so that the number of envelopes is bs_num_env 21=2. In FIG. 13a it shall be assumed that the time axis extends from the left to the right in a horizontal manner. An SBR frame, i.e. one of the frames in which the AAC encoder encodes the low-frequency portion, is indicated by reference numerals 902 in FIG. 13a. As can be seen, the SBR frame 902 has a length of 16 QMF slots, the QMF slots being, as has been mentioned, the time slots in which units the analysis filter bank operates, the QMF slots being indicated by box 904 in FIG. 13a. In FIXFIX frames, the envelopes, or grid areas, 906a and 906b, i.e. two in number here, have the same length within the SBR frames 902, so that a time grid and/or envelope boundary 908 is defined precisely in the center of the SBR frame 902. In this manner the exemplary FIXFIX frame of FIG. 13a defines that a spectral distribution for the grid area, or the envelope, 906a, and a further one for envelope 906, is temporally determined from the spectral values of the analysis filter bank. The envelopes, or grid areas, 906a and 906b thus specify the grid in which the spectral envelope is encoded and/or transmitted.
By comparison, FIG. 13b shows a VARVAR frame. SBR frame 902 and associated QMF slots 904 are indicated again. For this SBR frame, however, syntax elements bs_var_bord_0 and/or bs_var_bord_1 have defined that the envelopes 906a′, 906b′ and 906c′ associated therewith are not to start at the SBR frame start 902a and/or to end at the SBR frame end 902b. Rather, one may see from FIG. 13b that the previous SBR frame (not to be seen in FIG. 13b) has already been extended two QMF slots beyond the SBR frame start 902a of the current SBR frame, so that the last envelope 910 of the preceding SBR frame still extends into the current SBR frame 902. The last envelope 906c′ of the current frame also extends beyond the SBR frame end of the current SBR frame 902, namely, by way of example, also by two QMF slots here. In addition, one can also see here, by way of example, that the syntax elements of the VARVAR frame bs_num_rel_0 and bs_num_rel_1 are adjusted to 1, respectively, with the additional information that the envelopes thus defined have a length of four QMF slots at the start and at the end of the SBR frame 902, i.e. 906a′ and 906b′ in accordance with tmp=1, so as to extend from the frame boundaries into the SBR frame 902 by this number of slots. The remaining space of the SBR frame 902 will then be occupied by the remaining envelope, in this case the third envelope 906b′. 
By having T in one of the QMF slots 904, FIG. 13b indicates, by way of example, the reason why a VARVAR frame has been defined here, namely because the transient position T is located close to the SBR frame end 902b, and because there probably was a transient (not to be seen) also in the SBR frame preceding the current one.
The standardized version in accordance with ISO/ICE 14496-3 thus involves overlapping of two successive SBR frames. This enables setting the envelope boundaries in a variable manner, irrespective of the actual SBR frame boundaries in accordance with the waveform. Transients may thus be enveloped by envelopes of their own, and their energy may be cut off from the remaining signal. However, an overlap also involves an additional system delay, as was illustrated above. In particular, four frame classes are used for signaling in the standard. In the FIXFIX class, the boundaries of the SBR envelopes coincide with the boundaries of the core frame, as is shown in FIG. 13a. The FIXFIX class is used when no transient is present in this frame. The number of envelopes specifies their equidistant distribution within the frame. The FIXVAR class is provided when there is a transient in the current frame. Here, the respective set of envelopes thus starts at the SBR frame boundary and ends, in a variable manner, in the SBR transmission area. The VARFIX class is provided for the event that a transient is not located in the current, but in the previous frame. The sequence of envelopes from the last frame here is continued by a new set of envelopes which ends at the SBR frame boundary. The VARVAR class is provided for the case that a transient is present both in the last frame and in the current frame. Here, a variable sequence of envelopes is continued by a further variable sequence. As has been described above, the boundaries of the variable envelopes are transmitted in relation to one another.
Even though the number of QMF slots by which the boundaries may be offset relative to the fixed frame boundaries by means of the syntax elements bs_var_bord_0 and bs_var_bord_1, this possibility results in a delay on the decoder side due to the occurrence of envelopes which extend beyond SBR frame boundaries and thus necessitate the formation and/or averaging of spectral signal energies across SBR frame boundaries. However, this time delay is not tolerable in some applications, such as in applications in the field of telephony or other live applications which rely on the time delay caused by the encoding and decoding to be small. Even though the occurrence of pre-echoes is thus prevented, the solution is not suitable for applications necessitating a short delay time. In addition, the number of bits needed for transmitting the SBR frames in the above-described standard is relatively high.