The present invention is based on audio coding and in particular on frequency enhancement procedures such as bandwidth extension, spectral band replication or intelligent gap filling.
The present invention is particularly related to non-guided frequency enhancement procedures, i.e. where the decoder-side operates without side information or only with a minimum amount of side information.
Perceptual audio codecs often quantize and code only a lowpass part of the whole perceivable frequency range of an audio signal, especially when operated at (relatively) low bitrates. Although this approach guarantees an acceptable quality for the coded low-frequency signal, most listeners perceive the missing of the highpass part as a quality degradation. To overcome this issue, the missing high-frequency part can by synthesized by bandwidth extension schemes.
State of the art codecs often use either a waveform-preserving coder, such as AAC, or a parametric coder, such as a speech coder, to code the low-frequency signal. These coders operate up to a certain stop frequency. This frequency is called crossover frequency. The frequency portion below the crossover frequency is called low band. The signal above the crossover frequency, which is synthesized by means of a bandwidth extension scheme, is called high band.
A bandwidth extension typically synthesizes the missing bandwidth (high band) by means of the transmitted signal (low band) and extra side information. If applied in the field of low-bitrate audio coding, the extra information should consume as little as possible extra bitrate. Thus, usually a parametric representation is chosen for the extra information. This parametric representation is either transmitted from the encoder at comparably low bitrate (guided bandwidth extension) or estimated at the decoder based on specific signal characteristics (non-guided bandwidth extension). In the latter case, the parameters consume no bitrate at all.
The synthesis of the high band typically consists of two parts:                1. Generation of the high-frequency content. This can be done by either copying or flipping (parts of) the low frequency content to the high band, or inserting white or shaped noise or other artificial signal portions into the high band.        2. Adjustment of the generated high frequency content according to the parametric information. This includes manipulation of shape, tonality/noisiness and energy according to the parametric representation.        
The goal of the synthesis process is usually to achieve a signal that is perceptually close to the original signal. If this goal can't be matched, the synthesized portion should be least disturbing for the listener.
Other than a guided BWE scheme, a non-guided bandwidth extension can't rely on extra information for the synthesis of the high band. Instead, it typically uses empirical rules to exploit correlation between low band and high band. Whereas most music pieces and voiced speech segments exhibit a high correlation between high and low frequency band, this is usually not the case for unvoiced or fricative speech segments. Fricative sounds have very few energy in the lower frequency range while having high energy above a certain frequency. If this frequency is close to the crossover frequency, then it can be problematic to generate the artificial signal above the crossover frequency since in that case the lowband does contain little relevant signal parts. To cope with this problem, a good detection of such sounds is helpful.
HE-AAC is a well-known codec that consists of a waveform preserving codec for the low band (AAC) and a parametric codec for the high band (SBR). At decoder side, the high band signal is generated by transforming the decoded AAC signal into the frequency domain using a QMF filterbank. Subsequently, subbands of the low band signal are copied to the high band (generation of high frequency content). This high band signal is then adjusted in spectral envelope, tonality and noise floor based on the transmitted parametric side-information (adjustment of the generated high frequency content). Since this method uses a guided BWE approach, a weak correlation between high and low band is in general not problematic and can be overcome be transmitting the appropriate parameter sets. However, this necessitates additional bitrate, which might not be acceptable for a given application scenario.
The ITU Standard G.722.2 is a speech codec that operates in time domain only, i.e. without performing any calculations in frequency domain. Such a decoder outputs a time domain signal with a sampling rate of 12.8 kHz, which is subsequently upsampled to 16 kHz. The generation of the high frequency content (6.4-7.0 kHz) is based on inserting bandpass noise. In most operation modes the spectral shaping of the noise is done without using any side-information, only in the operation mode with highest bitrate information about the noise energy is transmitted in the bitstream. For reasons of simplicity, and since not all application scenarios can afford the transmission of extra parameters sets, in the following only the generation of the high band signal without using any side-information is described.
For generating the high band signal, a noise signal is scaled to have the same energy as the core excitation signal. In order to give more energy to unvoiced parts of the signal, a spectral tilt e is calculated:
  e  =                    ∑                  n          =          1                63            ⁢                        s          ⁡                      (            n            )                          ⁢                  s          ⁡                      (                          n              -              1                        )                                              ∑                  n          =          0                63            ⁢                        s          2                ⁡                  (          n          )                    where s is the high-pass filtered decoded core signal with cut-off frequency of 400 Hz. n is the sample index. In case of voiced segments where less energy is present at high frequencies, e approaches 1, while for unvoiced segments e is close to zero. In order to have more energy in the high band signal, for unvoiced speech the energy of the noise is multiplied by (1−e). Finally, the scaled noise signal is filtered by a filter which is derived from the core Linear Predictive Coding (LPC) filter by extrapolation in the Line Spectral Frequency (LSF) domain.
The non-guided bandwidth extension from G.722.2, which entirely operates in time domain, has the following drawbacks:                1. The generated HF content is based on noise. This creates audible artifacts if the HF signal is combined with a tonal, harmonic low-frequency signal (e.g. music). To avoid such artifacts, G.722.2 strongly limits the energy of the generated HF signal, which also limits potential benefits of the bandwidth extension. Thus, unfortunately also the maximum possible improvement of the brightness of a sound or the maximum obtainable increase in intelligibility of a speech signal is limited.        2. Since this non-guided bandwidth extension operates in the time domain, the filter operations cause additional algorithmic delay. This additional delay lowers the quality of the user experience in bi-directional communication scenarios or might not be allowed by the terms of requirement of a given communication technology standard.        3. Also, since the signal processing is performed in time domain, the filter operations are prone to instabilities. Moreover, the time domain filters have a high computational complexity.        4. Since only the overall sum of the energy of the high band signal is adapted to the energy of the core signal (and further weighted by the spectral tilt), there might be a significant local mismatch of energy at the crossover frequency between upper frequency range of the core signal (the signal just below the crossover frequency) and the high band signal. For example, this will be the case especially for tonal signals that exhibit an energy concentration in the very low frequency range but contain little energy in the upper frequency range.        5. Furthermore, it is computationally complex to estimate a spectral slope in a time domain representation. In frequency domain, an extrapolation of a spectral slope can be done very efficiently. Since most of the energy of e.g. fricatives is concentrated in the high frequency range, these may sound dull if a conservative energy and spectral slope estimation strategy like in G.722.2 is applied (see 1).        
To summarize, the known non-guided or blind bandwidth extension schemes may necessitate a significant computational complexity on the decoder side and nevertheless result in a limited audio quality specifically for problematic speech sounds such as fricatives. Furthermore, guided bandwidth extension schemes, although providing a better audio quality and sometimes necessitating less computational complexity on the decoder side cannot provide the substantial bitrate reductions due to the fact that the additional parametric information on the high band can necessitate a significant amount of additional bitrate with respect to the encoded core audio signal.
It is therefore an object of the present invention to provide an improved concept for audio processing in the context of non-guided frequency enhancement technologies.