The present invention is related to audio coding and, particularly to audio coding in the context of frequency enhancement, i.e., that a decoder output signal has a higher number of frequency bands compared to an encoded signal. Such procedures comprise bandwidth extension, spectral replication or intelligent gap filling.
Contemporary speech coding systems are capable of encoding wideband (WB) digital audio content, that is, signals with frequencies of up to 7-8 kHz, at bitrates as low as 6 kbit/s. The most widely discussed examples are the ITU-T recommendations G.722.2 [1] as well as the more recently developed G.718 [4, 10] and MPEG-D Unified Speech and Audio Coding (USAC) [8]. Both, G.722.2, also known as AMR-WB, and G.718 employ bandwidth extension (BWE) techniques between 6.4 and 7 kHz to allow the underlying ACELP core-coder to “focus” on the perceptually more relevant lower frequencies (particularly the ones at which the human auditory system is phase-sensitive), and thereby achieve sufficient quality especially at very low bitrates. In the USAC eXtended High Efficiency Advanced Audio Coding (xHE-AAC) profile, enhanced spectral band replication (eSBR) is used for extending the audio bandwidth beyond the core-coder bandwidth which is typically below 6 kHz at 16 kbit/s. Current state-of-the-art BWE processes can generally be divided into two conceptual approaches:                Blind or artificial BWE, in which high-frequency (HF) components are reconstructed from the decoded low-frequency (LF) core-coder signal alone, i.e. without requiring side information transmitted from the encoder. This scheme is used by AMR-WB and G.718 at 16 kbit/s and below, as well as some backward-compatible BWE post-processors operating on traditional narrowband telephonic speech [5, 9, 12] (Example: FIG. 15).        Guided BWE, which differs from blind BWE in that some of the parameters used for HF content reconstruction are transmitted to the decoder as side information instead of being estimated from the decoded core signal. AMR-WB, G.718, xHE-AAC, as well as some other codecs [2, 7, 11] use this approach, but not at very low bitrates (FIG. 16).        
FIG. 15 illustrates such a blind or artificial bandwidth extension as described in the publication Bernd Geiser, Peter Jax, and Peter Vary: “ROBUST WIDEBAND ENHANCEMENT OF SPEECH BY COMBINED CODING AND ARTIFICIAL BANDWIDTH EXTENSION”, Proceedings of International Workshop on Acoustic Echo and Noise Control (IWAENC), 2005. The stand-alone bandwidth extension algorithm illustrated in FIG. 15 comprises an interpolation procedure 1500, an analysis filter 1600, an excitation extension 1700, a synthesis filter 1800, a feature extraction procedure 1510, an envelope estimation procedure 1520 and a statistic model 1530. After an interpolation of the narrowband signal to a wideband sample rate, a feature vector is computed. Then, by means of a pre-trained statistical hidden Markov model (HMM), an estimate for the wideband spectral envelope is determined in terms of linear prediction (LP) coefficients. These wideband coefficients are used for analysis filtering of the interpolated narrowband signal. After the extension of the resulting excitation, an inverse synthesis filter is applied. The choice of an excitation extension which does not alter the narrowband is transparent with respect to the narrowband components.
FIG. 16 illustrates a bandwidth extension with side information as described in the above mentioned publication, the bandwidth extension comprising a telephone bandpass 1620, a side information extraction block 1610, a (joint) encoder 1630, a decoder 1640 and a bandwidth extension block 1650. This system for wideband enhancement of an error band speech signal by combined coding and bandwidth extension is illustrated in FIG. 16. At the transmitting terminal, the highband spectral envelope of the wideband input signal is analyzed and the side information is determined. The resulting message m is encoded either separately or jointly with the narrowband speech signal. At the receiver, the decoder side information is used to support the estimation of the wideband envelope within the bandwidth extension algorithm. The message m is obtained by several procedures. A spectral representation of frequencies from 3.4 kHz to 7 kHz is extracted from the wideband signal available only at the sending side.
This subband envelope is computed by selective linear prediction, i.e., computation of the wideband power spectrum followed by an IDFT of its upper band components and the subsequent Levinson-Durbin recursion of order 8. The resulting subband LPC coefficients are converted into the cepstral domain and are finally quantized by a vector quantizer with a codebook of size M=2N. For a frame length of 20 ms, this results in a side information data rate of 300 bit/s. A combined estimation approach extends a calculation of a posteriori probabilities and reintroduces dependences on the narrowband feature. Thus, an improved form of error concealment is obtained which utilizes more than one source of information for its parameter estimation.
A certain quality dilemma in WB codecs can be observed at low bitrates, typically below 10 kbit/s. On the one hand, such rates are already too low to justify the transmission of even moderate amounts of BWE data, ruling out typical guided BWE systems with 1 kbit/s or more of side information. On the other hand, a feasible blind BWE is found to sound significantly worse on at least some types of speech or music material due to the inability of proper parameter prediction from the core signal. This is particularly true for some vocal sound such as fricatives with low correlation between HF and LF. It is therefore desirable to reduce the side information rate of a guided BWE scheme to a level far below 1 kbit/s, which would allow its adoption even in very-low-bitrate coding.
Manifold BWE approaches have been documented in recent years [1-10]. In general, all of these are either fully blind or fully guided at a given operating point, regardless of the instantaneous characteristics of the input signal. Furthermore, many blind BWE systems [1, 3, 4, 5, 9, 10] are optimized particularly for speech signals rather than for music and may therefore yield non satisfactory results for music. Finally, most of the BWE realizations are relatively computationally complex, employing Fourier transforms, LPC filter computations, or vector quantization of the side information (Predictive Vector Coding in MPEG-D USAC [8]). This can be a disadvantage in the adoption of new coding technology in mobile telecommunication markets, given that the majority of mobile devices provide very limited computational power and battery capacity.
An approach which extends blind BWE by small side information is presented in [12] and is illustrated in FIG. 16. The side information “m”, however, is limited to the transmission of a spectral envelope of the bandwidth extended frequency range.
A further problem of the procedure illustrated in FIG. 16 is the very complicated way of envelope estimation using the lowband feature on the one hand and the additional envelope side information on the other hand. Both inputs, i.e., the lowband feature and the additional highband envelope influence the statistical model. This results in a complicated decoder-side implementation which is particularly problematic for mobile devices due to the increased power consumption. Furthermore, the statistical model is even more difficult to update due to the fact that it is not only influenced by the additional highband envelope data.