Embodiments according to the invention are related to an audio encoder for providing an encoded audio information on the basis of an input audio information and to an audio decoder for providing a decoded audio information on the basis of an encoded audio information. Further embodiments according to the invention are related to an encoded audio information. Yet further embodiments according to the invention are related to a method for providing a decoded audio information on the basis of an encoded audio information and to a method for providing an encoded audio information on the basis of an input audio information. Further embodiments are related to computer programs for performing the inventive methods.
An embodiment of the invention is related to a proposed update on a unified-speech-and-audio-coding (USAC) bitstream syntax.
In the following, some background of the invention will be explained in order to facilitate the understanding of the invention and the advantages thereof. During the past decade, big effort has been put on creating the possibility to digitally store and distribute audio contents. One important achievement on this way is the definition of the international standard ISO/IEC 14496-3. Part 3 of this standard is related to an encoding and decoding of audio contents, and subpart 4 of part 3 is related to general audio coding. ISO/IEC 14496 part 3, subpart 4 defines a concept for encoding and decoding of general audio content. In addition, further improvements have been proposed in order to improve the quality and/or reduce the useful bit rate.
However, according to the concept described in said standard, a time domain audio signal is converted into a time-frequency representation. The transform from the time domain to the time-frequency domain is typically performed using transform blocks, which are also designated as “frames” of time domain samples. It has been found that it is advantageous to use overlapping frames, which are shifted, for example, by half a frame, because the overlap allows to efficiently avoid (or at least reduce) artifacts. In addition, it has been found that a windowing should be performed in order to avoid the artifacts originating from this processing of temporally limited frames. Also, the windowing allows for an optimization of an overlap-and-add process of subsequent temporally shifted but overlapping frames.
However, it has been found that it is problematic to efficiently represent edges, i.e. sharp transitions or so-called transients within the audio content, using windows of uniform length, because the energy of a transition will be spread out over the entire duration of a window, which results in audible artifacts. Accordingly, it has been proposed to switch between windows of different lengths, such that approximately stationary portions of an audio content are encoded using long windows, and such that transitional portions (e.g. portions comprising a transient) of the audio content are encoded using shorter windows.
However, in a system, which allows to choose between different windows for transforming an audio content from the time domain to the time-frequency domain, one may of course signal to a decoder which window should be used for a decoding of an encoded audio content of a given frame.
In conventional systems, for example in an audio decoder according to the international standard ISO/IEC 14496-3, part 3, subpart 4, a data element called “window_sequence”, which indicates the window sequence used in the current frame, is written with two bits into a bitstream in a so-called “ics_info” bitstream element. By taking the window sequence of the previous frame into account, eight different window sequences are signaled.
In view of the above discussion, it can be seen that a bit load of the encoded bitstream representing an audio information is created by the need to signal the type of window used.
In view of this situation, there is the desire to create a concept which allows for a more bitrate-efficient signaling of a type of window used for a transform between a time domain representation of an audio content and a time-frequency domain representation of the audio content.