The present invention relates to audio signal encoding, decoding, and processing, and, in particular, to adjusting a level of a signal to be frequency-to-time converted (or time-to-frequency converted) to the dynamic range of a corresponding frequency-to-time converter (or time-to-frequency converter). Some embodiments of the present invention relate to adjusting the level of the signal to be frequency-to-time converted (or time-to-frequency converted) to the dynamic range of a corresponding converter implemented in fixed-point or integer arithmetic. Further embodiments of the present invention relate to clipping prevention for spectral decoded audio signals using time domain level adjustment in combination with side information.
Audio signal processing becomes more and more important. Challenges arise as modern perceptual audio codecs are necessitated to deliver satisfactory audio quality at increasingly low bit rates.
In the current audio content production and delivery chains the digitally available master content (PCM stream (pulse code modulated stream)) is encoded e.g. by a professional AAC (Advanced Audio Coding) encoder at the content creation side. The resulting AAC bitstream is then made available for purchase e.g. through an online digital media store. It appeared in rare cases that some decoded PCM samples are “clipping” which means that two or more consecutive samples reached the maximum level that can be represented by the underlying bit resolution (e.g. 16 bit) of a uniformly quantized fixed-point representation (e.g. modulated according to PCM) for the output waveform. This may lead to audible artifacts (clicks or short distortion). Although typically an effort will be made at the encoder side to prevent the occurrence of clipping at the decoder side, clipping may nevertheless occur at the decoder side for various reasons, such as different decoder implementations, rounding errors, transmission errors, etc. Assuming an audio signal at the encoder's input that is below the threshold of clipping, the reasons for clipping in a modern perceptual audio encoder are manifold. First of all, the audio encoder applies quantization to the transmitted signal which is available in a frequency decomposition of the input waveform in order to reduce the transmission data rate. Quantization errors in the frequency domain result in small deviations of the signal amplitude and phase with respect to the original waveform. If amplitude or phase errors add up constructively, the resulting attitude in the time domain may temporarily be higher than the original waveform. Secondly, parametric coding methods (e.g. spectral band replication, SBR) parameterize the signal power in a rather course manner. Phase information is typically omitted. Consequently, the signal at the receiver side is only regenerated with correct power but without waveform preservation. Signals with an amplitude close to full scale are prone to clipping.
Modern audio coding systems offer the possibility to convey a loudness level parameter (g1) giving decoders the possibility to adjust loudness for playback with unified levels. In general, this might lead to clipping, if the audio signal is encoded at sufficiently high levels and transmitted normalization gains suggest increasing loudness levels. In addition, common practice in mastering audio content (especially music) boosts audio signals to the maximum possible values, yielding clipping of the audio signal when coarsely quantized by audio codecs.
To prevent clipping of audio signals, so called limiters are known as an appropriate tool to restrict audio levels. If an incoming audio signal exceeds a certain threshold, the limiter is activated and attenuates the audio signal in a way that the audio signal does not exceed a given level at the output. Unfortunately, prior to the limiter, sufficient headroom (in terms of dynamic range and/or bit resolution) is necessitated.
Usually, any loudness normalization is achieved in the frequency domain together with a so-called “dynamic range control” (DRC). This allows smooth blending of loudness normalization even if the normalization gain varies from frame to frame because of the filter-bank overlap.
Further, due to poor quantization or parametric description, any coded audio signal might go into clipping if the original audio was mastered at levels near the clipping threshold.
It is typically desirable to keep computational complexity, memory usage, and power consumption as small as possible in highly efficient digital signal processing devices based on a fixed-point arithmetic. For this reason, it is also desirable to keep the word length of audio samples as small as possible. To take any potential headroom for clipping due to loudness normalization into account, a filter bank, which typically is a part of an audio encoder or decoder, would have to be designed with a higher word length.
It would be desirable to allow signal limiting without losing data precision and/or without a need for using a higher word length for a decoder filter bank or an encoder filter bank. In the alternative or in addition it would be desirable if a relevant dynamic range of the signal to be frequency-to-time converted or vice versa could be determined continuously on a frame-by-frame basis for consecutive time sections or “frames” of the signal so that the level of the signal can be adjusted in a way that the current relevant dynamic range fits into the dynamic range provided by the converter (frequency-to-time domain converter or time-to-frequency-domain converter). It would also be desirable to make such a level shift for the purpose of frequency-to-time conversion or time-to-frequency conversion substantially “transparent” to other components of the decoder or encoder.