The present invention relates to an apparatus and a method for calculating a number of spectral envelopes, an audio encoder and a method for encoding audio signals.
Natural audio coding and speech coding are two major tasks of codecs for audio signals. Natural audio coding is commonly used for music or arbitrary signals at medium bit rates and generally offers wide audio bandwidths. On the other hand, speech coders are basically limited to speech reproduction, but can also be used at a very low bit rate.
Wide band speech offers a major subjective quality improvement over narrow band speech. Increasing the bandwidth not only improves the intelligibility and naturalness of speech, but also the speaker's recognition. Wide band speech coding is, thus, an important issue in the next generation of telephone systems. Further, due to the tremendous growth of the multimedia field, transmission of music and other non-speech signals at high quality over telephone systems is a desirable feature.
To drastically reduce the bit rate, source coding can be performed using split-band perceptional audio codecs. These natural audio codecs exploit perceptional irrelevancy and statistical redundancy in the signal. Moreover, it is common to reduce the sample rate and, thus, the audio bandwidth. It is also common to decrease the number of composition levels, occasionally allowing audible quantization distortion and to employ degradation of the stereo field through intensity coding. Excessive use of such methods results in annoying perceptional degradation. In order to improve the coding performance, spectral band replication is used as an efficient method to generate high frequency signals in a high frequency reconstruction (HFR) based codec.
Spectral band replication (SBR) comprises a technique that gained popularity as an add-on to popular perceptual audio coders such as MP3 and the advanced audio coding (AAC). SBR comprises a method of bandwidth extension in which the low band (base band or core band) of the spectrum is encoded using an state of the art codec, whereas the upper band (or high band) is coarsely parameterized using few parameters. SBR makes use of a correlation between the low band and the high band by predicting the wider band signal from the lower band using the extracted high band features. This is often sufficient, since the human ear is less sensitive to distortions in the higher band compared to the lower band. New audio coders, therefore, encode the lower spectrum using, for example, MP3 or AAC, whereas the higher band is encoded using SBR. The key to the SBR algorithm is the information used to describe the higher frequency portion of the signal. The primary design goal of this algorithm is to reconstruct the higher band spectrum without introducing any artifacts and to provide good spectral and temporal resolution. For example, a 64-band complex-valued polyphase filterbank is used at the analysis portion and at the encoder; the filterbank is used to obtain, e.g., energy samples of the original input signal's high band. These energy samples may then be used as reference values for an envelope adjustment scheme used at the decoder.
Spectral envelopes refer to a coarse spectral distribution of the signal in a general sense and comprise for example, filter coefficients in a linear predictive-based coder or a set of time-frequency averages of sub-band samples in a sub-band coder. Envelope data refers, in turn, to the quantized and coded spectral envelope. Especially if the lower frequency band is coded with a low bit rate, the envelope data constitutes a larger part of the bitstream. Hence, it is important to represent the spectral envelope compactly when using especially lower bit rates.
The spectral band replication makes use of tools, which are based on a replication of, e.g., sequences of harmonics, truncated during encoding. Moreover, it adjusts the spectral envelope of the generated high-band and applies inverse filtering and adds noise and harmonic components in order to recreate the spectral characteristics of the original signal. Therefore, the input of the SBR tool comprises, for example the quantized envelope data, miscellaneous control data, a time domain signal from the core coder (e.g. AAC or MP3). The output of the SBR tool is either a time domain signal or a QMF-domain (QMF=Quadrature Mirror Filter) representation of a signal as, for example, in case the MPEG surround tool is used. The description of the bit stream elements for the SBR payload can be found in the Standard ISO/IEC 14496-3:2005, sub-clause 4.5.2.8 and comprise among other data SBR extension data, an SBR header and indicates the number of SBR envelopes within an SBR frame.
For the implementation of an SBR on the encoder side, an analysis is performed on the input signal. Information obtained from this analysis is used to choose the appropriate time/frequency resolution of the current SBR frame. The algorithm calculates the start and stop time borders of the SBR envelopes in the current SBR frame, the number of SBR envelopes as well as their frequency resolution. The different frequency resolutions are calculated as described, for example, in the ISO/IEC 14496 3 Standard in sub-clause 4.6.18.3. The algorithm also calculates the number of noise floors for the given SBR frame and the start and stop time borders of the same. The start and stop time borders of the noise floors should be a sub-set of the start and stop time borders of the spectral envelopes. The algorithm divides the current SBR frame into four classes:
FIXFIX—Both the leading and the trailing time border equal nominal SBR-frame boundaries. All SBR envelope time borders in the frame are uniformly distributed in time. The number of envelopes is an integer power of two (1, 2, 4, 8, . . . ).
FIXVAR—The leading time border equals the leading nominal frame boundary. The trailing time border is variable and can be defined by bit stream elements. All SBR envelope time borders between the leading and the trailing time border can be specified as the relative distance in time slots to the previous border, starting from the trailing time border.
VARFIX—The leading time border is variable and be defined by bit stream elements. The trailing time border equals the trailing nominal frame boundary. All SBR envelope time borders between the leading and trailing time borders are specified in the bit stream as the relative distance in time slots to the previous border, starting from the leading time border.
VARVAR—Both, the leading and trailing time borders are variable and can be defined in the bit stream. All SBR envelope time borders between the leading and trailing time borders are also specified. The relative time borders starting from the leading time border are specified as the relative distance to the previous time border. The relative time borders starting from the trailing time border are specified as the relative distance to the previous time border.
There are no restrictions on SBR frame class transitions, i.e. any sequence of classes is allowed in the Standard. However, in accordance with this Standard, the maximal number of SBR envelopes per the SBR frame is restricted to 4 for class FIXFIX and 5 for class VARVAR. Classes FIXVAR and VARFIX are syntactically limited to four SBR envelopes. The spectral envelopes of the SBR frame are estimated over the time segment and with the frequency resolution given by the time/frequency grid. The SBR envelope is estimated by averaging the squared complex sub-band samples over the given time/frequency regions.
Transients receive in SBR, in general, a specific treatment by employing specific envelopes of variable lengths. Transients can be defined by portions within conventional signals, wherein a strong increase in energy appears within a short period of time, which may or may not be constrained on a specific frequency region. Examples for transients are hits of castanets and of percussion instruments, but also certain sounds of the human voice as, for example, the letters: P, T, K, . . . . The detection of this kind of transient is implemented so far always in the same way or by the same algorithm (using a transient threshold), which is independent of the signal, whether it is classified as speech or classified as music. In addition, a possible distinction between voiced and unvoiced speech does not influence the conventional or classical transient detection mechanism.
Hence, in case a transient is detected, the SBR-data should be adjusted in order that a decoder can replicate the detected transient appropriately. In WO 01/26095, an apparatus and a method is disclosed for spectral envelope coding, which takes into account a detected transient in the audio signal. In this conventional method, a non-uniform time and frequency sampling of the spectral envelope is achieved by an adaptively grouping sub-band samples from a fixed-size filterbank into frequency bands and time segments, each of which generates one envelope sample. The corresponding system defaults to long-time segments and high-frequency resolution, but in the vicinity of a transient, shorter time segments are used, whereby larger frequency steps can be used in order to keep the data size within limits. In case a transient is detected, the system switches from a FIXFIX-frame to a FIXVAR frame followed by a VARFIX-frame such that an envelope border is fixed right before the detected transient. This procedure repeats whenever a transient is detected.
In case the energy fluctuation changes only slowly, the transient detector will not detect the change. These changes may, however, be strong enough to generate perceivable artifacts if not treated appropriately. A simple solution would be to lower the threshold in the transient detector. This would, however, result in a frequent switch between different frames (FIXFIX to FIXVAR+VARFIX). As consequence, a significant amount of additional data has to be transmitted implying a poor coding effieciency—especially if the slow increase last over longer time (e.g. over multiple frames). This is not acceptable, since the signal does not comprise the complexity, which would justify a higher data rate and hence this is not an option to solve the problem.