Numerous techniques exist for compressing (with loss) an audio frequency signal such as speech or music.
The conventional coding methods for the conversational applications are generally classified as waveform coding (PCM for “Pulse Code Modulation”, ADCPM for “Adaptive Differential Pulse Code Modulation”, transform coding, etc.), parametric coding (LPC for “Linear Predictive Coding”, sinusoidal coding, etc.) and parametric hybrid coding with a quantization of the parameters by “analysis by synthesis” of which CELP (“Code Excited Linear Prediction”) coding is the best known example.
For the non-conversational applications, the prior art for (mono) audio signal coding consists of perceptual coding by transform or in subbands, with a parametric coding of the high frequencies by band replication.
A review of the conventional speech and audio coding methods can be found in the works by W. B. Kleijn and K. K. Paliwal (eds.), Speech Coding and Synthesis, Elsevier, 1995; M. Bosi, R. E. Goldberg, Introduction to Digital Audio Coding and Standards, Springer 2002; J. Benesty, M. M. Sondhi, Y. Huang (Eds.), Handbook of Speech Processing, Springer 2008.
The focus here is more particularly on the 3GPP standardized AMR-WB (“Adaptive Multi-Rate Wideband”) codec (coder and decoder), which operates at an input/output frequency of 16 kHz and in which the signal is divided into two subbands, the low band (0-6.4 kHz) which is sampled at 12.8 kHz and coded by CELP model and the high band (6.4-7 kHz) which is reconstructed parametrically by “band extension” (or BWE, for “Bandwidth Extension”) with or without additional information depending on the mode of the current frame. It can be noted here that the limitation of the coded band of the AMR-WB codec at 7 kHz is essentially linked to the fact that the frequency response in transmission of the wideband terminals was approximated at 3.0 the time of standardization (ETSI/3GPP then ITU-T) according to the frequency mask defined in the standard ITU-T P.341 and more specifically by using a so-called “P341” filter defined in the standard ITU-T G.191 which cuts the frequencies above 7 kHz (this filter observes the mask defined in P.341). However, in theory, it is well known that a signal sampled at 16 kHz can have a defined audio band from 0 to 8000 Hz; the AMR-WB codec therefore introduces a limitation of the high band by comparison with the theoretical bandwidth of 8 kHz.
The 3GPP AMR-WB speech codec was standardized in 2001 mainly for the circuit mode (CS) telephony applications on GSM (2G) and UMTS (3G). This same codec was also standardized in 2003 by the ITU-T in the form of recommendation G.722.2 “Wideband coding speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB)”.
It comprises nine bit rates, called modes, from 6.6 to 23.85 kbit/s, and comprises continuous transmission mechanisms (DTX, for “Discontinuous Transmission”) with voice activity detection (VAD) and comfort noise generation (CNG) from silence description frames (SID, for “Silence Insertion Descriptor”), and lost frame correction mechanisms (FEC for “Frame Erasure Concealment”, sometimes called PLC, for “Packet Loss Concealment”).
The details of the AMR-WB coding and decoding algorithm are not repeated here; a detailed description of this codec can be found in the 3GPP specifications (TS 26.190, 26.191, 26.192, 26.193, 26.194, 26.204) and in ITU-T-G.722.2 (and the corresponding annexes and appendix) and in the article by B. Bessette et al. entitled “The adaptive multirate wideband speech codec (AMR-WB)”, IEEE Transactions on Speech and Audio Processing, vol. 10, no. 8, 2002, pp. 620-636 and the source code of the associated 3GPP and ITU-T standards.
The principle of band extension in the AMR-WB codec is fairly rudimentary. Indeed, the high band (6.4-7 kHz) is generated by shaping a white noise through a time (applied in the form of gains per subframe) and frequency (by the application of a linear prediction synthesis filter or LPC, for “Linear Predictive Coding”) envelope. This band extension technique is illustrated in FIG. 1.
A white noise uHB1(n), n=0, . . . , 79 is generated at 16 kHz for each 5 ms subframe by linear congruential generator (block 100). This noise uHB1(n) is formatted in time by application of gains for each subframe; this operation is broken down into two processing steps (blocks 102, 106 or 109):                A first factor is computed (block 101) to set the white noise uHB1(n) (block 102) at a level similar to that of the excitation, u(n), n=0, . . . , 63, decoded at 12.8 kHz in the low band:        
            u              HB        ⁢                                  ⁢        2              ⁡          (      n      )        =                    u                  HB          ⁢                                          ⁢          1                    ⁡              (        n        )              ⁢                                        ∑                          t              =              0                        63                    ⁢                                          ⁢                                    u              ⁡                              (                l                )                                      2                                                ∑                          t              =              0                        79                    ⁢                                          ⁢                                                    u                                  HB                  ⁢                                                                          ⁢                  1                                            ⁡                              (                l                )                                      2                              
It can be noted here that the normalization of the energies is done by comparing blocks of different size (64 for u(n) and 80 for uHB1(n)) without compensation of the differences in sampling frequencies (12.8 or 16 kHz).
The excitation in the high band is then obtained (block 106 or 109) in the form:uHB(n)=ĝHBuHB2(n)                                    in which the gain ĝHB is obtained differently depending on the bit rate. If the bit rate of the current frame is <23.85 kbit/s, the gain ĝHB is estimated “blind” (that is to say without additional information); in this case, the block 103 filters the signal decoded in low band by a high-pass filter having a cut-off frequency at 400 Hz to obtain a signal ŝhp(n), n=0, . . . , 63—this high-pass filter eliminates the influence of the very low frequencies which can skew the estimation made in the block 104 then —the “tilt” (indicator of spectral slope) denoted etilt of the signal ŝhp (n) is computed by normalized self-correlation (block 104):                        
      e    tilt    =                    ∑                  n          =          1                63            ⁢                          ⁢                                                  s              ^                        hp                    ⁡                      (            n            )                          ⁢                                            s              ^                        hp                    ⁡                      (                          n              -              1                        )                                              ∑                  n          =          0                63            ⁢                          ⁢                                                  s              ^                        hp                    ⁡                      (            n            )                          2                                                and finally, ĝHB is computed in the form:ĝHB=wSPgSP+(1−wSP)gBG             in which gSP=1−etilt is the gain applied in the active speech (SP) frames, gBG=1.25gSP is the gain applied in the inactive speech frames associated with a background (BG) noise and wSP is a weighting function which depends on the voice activity detection (VAD). It is understood that the estimation of the tilt (etilt) makes it possible to adapt the level of the high band as a function of the spectral nature of the signal; this estimation is particularly important when the spectral slope of the CELP decoded signal is such that the average energy decreases when the frequency increases (case of a voiced signal where etilt is close to 1, therefore gSP=1−etilt is thus reduced). It should also be noted that the factor ĝHB in the AMR-WB decoding is bounded to take values within the range [0.1, 1.0]. Indeed, for the signals whose energy increases when the frequency increases (etilt close to −1, gSP close to 2), the gain ĝHB is usually underestimated.                        
At 23.85 kbit/s, a correction information item is transmitted by the AMR-WB coder and decoded (blocks 107, 108) in order to refine the gain estimated for each subframe (4 bits every 5 ms, or 0.8 kbit/s). The artificial excitation uHB(n) is then filtered (block 111) by an LPC synthesis filter (block 111) of transfer function 1/AHB(z) and operating at the sampling frequency of 16 kHz. The construction of this filter depends on the bit rate of the current frame:                At 6.6 kbit/s, the filter 1/AHB(z) is obtained by weighting by a factor γ=0.9 an LPC filter of order 20, 1/Âext(z) which “extrapolates” the LPC filter of order 16, 1/Â(z), decoded in the low band (at 12.8 kHz)—the details of the extrapolation in the realm of the ISF (Imittance Spectral Frequency) parameters are described in the standard G.722.2 in section 6.3.2.1; in this case,1/AHB(z)=1/Âext(z/γ)        at the bit rates >6.6 kbit/s, the filter 1/AHB (z) is of order 16 and corresponds simply to:1/AHB(z)=1/Â(z/γ)                    in which γ=0.6. It should be noted that, in this case, the filter 1/Â(z/γ) is used at 16 kHz, which results in a spreading (by proportional transformation) of the frequency response of this filter from [0, 6.4 kHz] to [0, 8 kHz].                        
The result, sHB(n), is finally processed by a bandpass filter (block 112) of FIR (“Finite Impulse Response”) type, to keep only the 6-7 kHz band; at 23.85 kbit/s, a low-pass filter also of FIR type (block 113) is added to the processing to further attenuate the frequencies above 7 kHz. The high frequency (HF) synthesis is finally added (block 130) to the low frequency (LF) synthesis obtained with the blocks 120 to 122 and re-sampled at 16 kHz (block 123). Thus, even 3.5 if the high band extends in theory from 6.4 to 7 kHz in the AMR-WB codec, the HF synthesis is rather contained in the 6-7 kHz band before addition with the LF synthesis.
A number of drawbacks in the band extension technique of the AMR-WB codec can be identified, in particular:                the estimation of gains for each subframe (block 101, 103 to 105) is not optimal. Partly, it is based on an equalization of the “absolute” energy per subframe (block 101) between signals at different frequencies: artificial excitation at 16 kHz (white noise) and a signal at 12.8 kHz (decoded ACELP excitation). It can be noted in particular that this approach implicitly induces an attenuation of the high-band excitation (by a ratio 12.8/16=0.8); in fact, it will also be noted no de-emphasis is performed on the high band in the AMR-WB codec, which implicitly induces an amplification relatively close to 0.6 (which corresponds to the value of the frequency response of 1/(1−0.68z−1) at 6400 Hz). In fact, the factors of 1/0.8 and of 0.6 are compensated approximately.        Regarding speech, the 3GPP AMR-WB codec characterization tests documented in the 3GPP report TR 26.976 have shown that the mode at 23.85 kbit/s has a less good quality than at 23.05 kbit/s, its quality being in fact similar to that of the mode at 15.85 kbit/s. This shows in particular that the level of artificial HF signal has to be controlled very prudently, because the quality is degraded at 23.85 kbit/s whereas the 4 bits per frame are considered to best make it possible to approximate the energy of the original high frequencies.        The low-pass filter at 7 kHz (block 113) introduces a shift of almost 1 ms between the low and high bands, which can potentially degrade the quality of certain signals by slightly desynchronizing the two bands at 23.85 kbit/s —this desynchronization can also pose problems when switching bit rate from 23.85 kbit/s to other modes.        
An example of band extension via a temporal approach is described in the 3GPP standard TS 26.290 describing the AMR-WB+ codec (standardized in 2005). This example is illustrated in the block diagrams of FIGS. 2a (general block diagram) and 2b (gain prediction by response level correction) which correspond respectively to FIGS. 16 and 10 of the 3GPP specification TS 26.290.
In the AMR-WB+ codec, the (mono) input signal sampled at the frequency Fs (in Hz) is divided into two separate frequency bands, in which two LPC filters are computed and coded separately:                one LPC filter, denoted A(z), in the low band (0−Fs/4)—its quantized version is denoted Â(z)        another LPC filter, denoted AHF (Z), in the spectrally aliased high band (Fs/4−Fs/2)—its quantized version is denoted ÂHF(z)        
The band extension is done in the AMR-WB+ codec as detailed in sections 5.4 (HF coding) and 6.2 (HF decoding) of the 3GPP specification TS 26.290. The principle thereof is summarized here: the extension consists in using the excitation decoded at low frequencies (LFC excit.) and in formatting this excitation by a temporal gain per subframe (block 205) and an LPC synthesis filtering (block 207); the processing operations to enhance (post-processing) the excitation (block 206) and smooth the energy of the reconstructed HF signal (block 208) are moreover implemented as illustrated in FIG. 2a. 
It is important to note that this extension in AMR-WB+ necessitates the transmission of additional information: the coefficients of the filter ÂHF(z) in 204 and a temporal formatting gain per subframe (block 201). One particular feature of the band extension algorithm in AMR-WB+ is that the gain per subframe is quantified by a predictive approach; in other words, the gains are not coded directly, but rather gain corrections which are relative to an estimation of the gain denoted gmatch. This estimation, gmatch, actually corresponds to a level equalization factor between the filters Â(z) and ÂHF(z) at the frequency of separation between low band and high band (Fs/4). The computation of the factor gmatch (block 203) is detailed in FIG. 10 of the 3GPP specification TS 26.290 reproduced here in FIG. 2b. This figure will not be detailed more here. It will simply be noted that the blocks 210 to 213 are used to compute the energy of the impulse response of
                    A        ^            z                      (                  1          -                      0.9            ⁢                          z                              -                1                                                    )            ⁢                                    A            ^                                H            ⁢            F                          ⁡                  (          z          )                      ,while recalling that the filter ÂHF(z) models a spectrally aliased high band (because of the spectral properties of the filter bank separating the low and high bands). Since the filters are interpolated by subframes, the gain gmatch is computed only once per frame, and it is interpolated by subframes.
The band extension gain coding technique in AMR-WB+, and more particularly the compensation of levels of the LPC filters at their junction is an appropriate method in the context of a band extension by LPC models in low and high band, and it can be noted that such a level compensation between LPC filters is not present in the band extension of the AMR-WB codec. However, it is in practice possible to verify that the direct equalization of the level between the two LPC filters at the separation frequency is not an optimal method and can provoke an overestimation of energy in high band and audible artifacts in certain cases; it will be recalled that an LPC filter represents a spectral envelope, and the principle of equalization of the level between two LPC filters for a given frequency amounts to adjusting the relative level of two LPC envelopes. Now, such an equalization performed at a precise frequency does not ensure a complete continuity and overall consistency of the energy (in frequency) in the vicinity of the equalization point when the frequency envelope of the signal fluctuates significantly in this vicinity. A mathematical way of positing the problem consists in noting that the continuity between two curves can be ensured by forcing them to meet at one and the same point, but there is nothing to guarantee that the local properties (successive derivatives) coincide so as to ensure a more global consistency. The risk in ensuring a spot continuity between low and high band LPC envelopes is of setting the LPC envelope in high band at a relative level that is too strong or too weak, the case of a level that is too strong being more damaging because it results in more annoying artifacts.
Moreover, the gain compensation in AMR-WB+ is primarily a prediction of the gain known to the coder and to the decoder and which serves to reduce the bit rate necessary for the transmission of gain information scaling the high-band excitation signal. Now, in the context of an interoperable enhancement of the AMR-WB coding/decoding, it is not possible to modify the existing coding of the gains by subframes (0.8 kbit/s) of the band extension in the AMR-WB 23.85 kbit/s mode. Furthermore, for the bit rates strictly less than 23.85 kbit/s, the compensation of levels of LPC filters in low and high bands can be applied in the band extension of a decoding compatible with AMR-WB, but experience shows that this sole technique derived from the AMR-WB+ coding, applied without optimization, can cause problems of overestimation of energy of 3.0 the high band (>6 kHz).
There is therefore a need to improve the compensation of gains between linear prediction filters of different frequency bands for the frequency band extension in a codec of AMR-WB type or an interoperable version of this codec without in any way overestimating the energy in a frequency band and without requiring additional information from the coder.