Numerous techniques exist for compressing (with loss) an audio frequency signal such as speech or music.
The conventional coding methods for the conversational applications are generally classified as waveform coding (PCM for “Pulse Code Modulation”, ADCPM for “Adaptive Differential Pulse Code Modulation”, transform coding, etc.), parametric coding (LPC for “Linear Predictive Coding”, sinusoidal coding, etc.) and parametric hybrid coding with a quantization of the parameters by “analysis by synthesis” of which CELP (“Code Excited Linear Prediction”) coding is the best known example.
For the non-conversational applications, the prior art for (mono) audio signal coding consists of perceptual coding by transform or in subbands, with a parametric coding of the high frequencies by band replication.
A review of the conventional speech and audio coding methods can be found in the works by W. B. Kleijn and K. K. Paliwal (eds.), Speech Coding and Synthesis, Elsevier, 1995; M. Bosi, R. E. Goldberg, Introduction to Digital Audio Coding and Standards, Springer 2002; J. Benesty, M. M. Sondhi, Y. Huang (Eds.), Handbook of Speech Processing, Springer 2008.
The focus here is more particularly on the 3GPP standardized AMR-WB (“Adaptive Multi-Rate Wideband”) codec (coder and decoder), which operates at an input/output frequency of 16 kHz and in which the signal is divided into two subbands, the low band (0-6.4 kHz) which is sampled at 12.8 kHz and coded by CELP model and the high band (6.4-7 kHz) which is reconstructed parametrically by “band extension” (or BWE, for “Bandwidth Extension”) with or without additional information depending on the mode of the current frame. It can be noted here that the limitation of the coded band of the AMR-WB codec at 7 kHz is essentially linked to the fact that the frequency response in transmission of the wideband terminals was approximated at the time of standardization (ETSI/3GPP then ITU-T) according to the frequency mask defined in the standard ITU-T P.341 and more specifically by using a so-called “P341” filter defined in the standard ITU-T G.191 which cuts the frequencies above 7 kHz (this filter observes the mask defined in P.341). However, in theory, it is well known that a signal sampled at 16 kHz can have a defined audio band from 0 to 8000 Hz; the AMR-WB codec therefore introduces a limitation of the high band by comparison with the theoretical bandwidth of 8 kHz.
The 3GPP AMR-WB speech codec was standardized in 2001 mainly for the circuit mode (CS) telephony applications on GSM (2G) and UMTS (3G). This same codec was also standardized in 2003 by the ITU-T in the form of recommendation G.722.2 “Wideband coding speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB)”.
It comprises nine bit rates, called modes, from 6.6 to 23.85 kbit/s, and comprises continuous transmission mechanisms (DTX, for “Discontinuous Transmission”) with voice activity detection (VAD) and comfort noise generation (CNG) from silence description frames (SID, for “Silence Insertion Descriptor”), and lost frame correction mechanisms (FEC for “Frame Erasure Concealment”, sometimes called PLC, for “Packet Loss Concealment”).
The details of the AMR-WB coding and decoding algorithm are not repeated here; a detailed description of this codec can be found in the 3GPP specifications (TS 26.190, 26.191, 26.192, 26.193, 26.194, 26.204) and in ITU-T-G.722.2 (and the corresponding annexes and appendix) and in the article by B. Bessette et al. entitled “The adaptive multirate wideband speech codec (AMR-WB)”, IEEE Transactions on Speech and Audio Processing, vol. 10, no. 8, 2002, pp. 620-636 and the source code of the associated 3GPP and ITU-T standards.
The principle of band extension in the AMR-WB codec is fairly rudimentary. Indeed, the high band (6.4-7 kHz) is generated by shaping a white noise through a time (applied in the form of gains per sub-frame) and frequency (by the application of a linear prediction synthesis filter or LPC, for “Linear Predictive Coding”) envelope. This band extension technique is illustrated in FIG. 1.
A white noise uHR1(n), n=0, . . . , 79 is generated at 16 kHz for each 5 ms sub-frame by linear congruential generator (block 100). This noise uHB1, (n) is formatted in time by application of gains for each sub-frame; this operation is broken down into two processing steps (blocks 102, 106 or 109):
A first factor is computed (block 101) to set the white noise uHB1(n) (block 102) at a level similar to that of the excitation, u(n), n=0, . . . , 63, decoded at 12.8 kHz in the low band:
            u              HB        ⁢                                  ⁢        2              ⁡          (      n      )        =                    u                  HB          ⁢                                          ⁢          1                    ⁡              (        n        )              ⁢                                        ∑                          l              =              0                        63                    ⁢                                          ⁢                                    u              ⁡                              (                l                )                                      2                                                ∑                          l              =              0                        79                    ⁢                                          ⁢                                                    u                                  HB                  ⁢                                                                          ⁢                  1                                            ⁡                              (                l                )                                      2                              
It can be noted here that the standardization of the energies is done by comparing blocks of different size (64 for u(n) and 80 for uHB1(n)) without compensation of the differences in sampling frequencies (12.8 or 16 kHz).
The excitation in the high band is then obtained (block 106 or 109) in the form:uHB(n)=ĝHBuHB2(n)                in which the gain ĝHB is obtained differently depending on the bit rate. If the bit rate of the current frame is <23.85 kbit/s, the gain ĝHB is estimated “blind” (that is to say without additional information); in this case, the block 103 filters the signal decoded in low band by a high-pass filter having a cut-off frequency at 400 Hz to obtain a signal ŝhp(n), n=0, . . . , 63—this high-pass filter eliminates the influence of the very low frequencies which can skew the estimation made in the block 104—then the “tilt” (indicator of spectral slope) denoted etilt of the signal ŝhp(n) is computed by standardized self-correlation (block 104):        
      e    tilt    =                    ∑                  n          =          1                63            ⁢                          ⁢                                                  s              ^                        hp                    ⁡                      (            n            )                          ⁢                                            s              ^                        hp                    ⁡                      (                          n              -              1                        )                                              ∑                  n          =          0                63            ⁢                          ⁢                                                  s              ^                        hp                    ⁡                      (            n            )                          2                            and finally, ĝHB is computed in the form:ĝHB=wSPgSP+(1−wSP)gBG         in which gSP=1−etilt is the gain applied in the active speech (SP) frames, gBG=1.25gSP is the gain applied in the inactive speech frames associated with a background (BG) noise and wSP is a weighting function which depends on the voice activity detection (VAD). It is understood that the estimation of the tilt (etilt) makes it possible to adapt the level of the high band as a function of the spectral nature of the signal; this estimation is particularly important when the spectral slope of the CELP decoded signal is such that the average energy decreases when the frequency increases (case of a voiced signal where etilt is close to 1, therefore gSP=1−etilt is thus reduced). It should also be noted that the factor ĝHB in the AMR-WB decoding is bounded to take values within the range [0.1, 1.0].        
At 23.85 kbit/s, a correction information item is transmitted by the AMR-WB coder and decoded (blocks 107, 108) in order to refine the gain estimated for each sub-frame (4 bits every 5 ms, or 0.8 kbit/s).
The artificial excitation uHB(n) is then filtered (block 111) by an LPC synthesis filter (block 111) of transfer function 1/AHB(z) and operating at the sampling frequency of 16 kHz. The construction of this filter depends on the bit rate of the current frame:
At 6.6 kbit/s, the filter 1/AHB(z) is obtained by weighting by a factor γ=0.9 an LPC filter of order 20, 1/Âext(z), which “extrapolates” the LPC filter of order 16, 1/Â(z), decoded in the low band (at 12.8 kHz) the details of the extrapolation in the realm of the ISF (Imittance Spectral Frequency) parameters are described in the standard G.722.2 in section 6.3.2.1; in this case,1/AHB(z)=1/Âext(z/γ)
at the bit rates >6.6 kbit/s, the filter 1/AHB(z) is of order 16 and corresponds simply to:1/AHB(Z)=1/Â(z/γ)                in which γ=0.6. It should be noted that, in this case, the filter 1/Â(z/γ) is used at 16 kHz, which results in a spreading (by proportional transformation) of the frequency response of this filter from [0, 6.4 kHz] to [0, 8 kHz].The result, sHB(n), is finally processed by a bandpass filter (block 112) of FIR (“Finite Impulse Response”) type, to keep only the 6-7 kHz band; at 23.85 kbit/s, a low-pass filter also of FIR type (block 113) is added to the processing to further attenuate the frequencies above 7 kHz. The high frequency (HF) synthesis is finally added (block 130) to the low frequency (LF) synthesis obtained with the blocks 120 to 123 and resampled at 16 kHz (block 123). Thus, even if the high band extends in theory from 6.4 to 7 kHz in the AMR-WB codec, the HF synthesis is rather contained in the 6-7 kHz band before addition with the LF synthesis.        
A number of drawbacks in the band extension technique of the AMR-WB codec can be identified:
The signal in the high band is a white noise formatted (by temporal gains per sub-frame, by filtering by 1/AHB(z) and bandpass filtering), which is not a good general model of the signal in the 6.4-7 kHz band. There are, for example, very harmonic music signals for which the 6.4-7 kHz band contains sinusoidal components (or tones) and no noise (or little noise); for these signals the band extension of the AMR-WB codec greatly degrades the quality.
The low-pass filter at 7 kHz (block 113) introduces a shift of almost 1 ms between the low and high bands, which can potentially degrade the quality of certain signals by slightly desynchronizing the two bands at 23.85 kbit/s—this desynchronization can also pose problems when switching bit rate from 23.85 kbit/s to other modes.
the estimation of gains for each sub-frame (block 101, 103 to 105) is not optimal. Partly, it is based on an equalization of the “absolute” energy per sub-frame (block 101) between signals at different frequencies: artificial excitation at 16 kHz (white noise) and a signal at 12.8 kHz (decoded ACELP excitation). It can be noted in particular that this approach implicitly induces an attenuation of the high-band excitation (by a ratio 12.8/16=0.8); in fact, it will also be noted no de-emphasis is performed on the high band in the AMR-WB codec, which implicitly induces an amplification relatively close to 0.6 (which corresponds to the value of the frequency response of 1/(1−0.68z−1) at 6400 Hz). In fact, the factors of 1/0.8 and of 0.6 are compensated approximately.
Regarding speech, the 3GPP AMR-WB codec characterization tests documented in the 3GPP report TR 26.976 have shown that the mode at 23.85 kbit/s has a less good quality than at 23.05 kbit/s, its quality being in fact similar to that of the mode at 15.85 kbit/s. This shows in particular that the level of artificial HF signal has to be controlled very prudently, because the quality is degraded at 23.85 kbit/s whereas the 4 bits per frame are considered to best make it possible to approximate the energy of the original high frequencies.
The limitation of the coded band to 7 kHz results from the application of a strict model of the transmission response of the acoustic terminals (filter P.341 in the ITU-T G.191) standard. Now, for a sampling frequency of 16 kHz, the frequencies in the 7-8 kHz band remain important, particularly for the music signals, to ensure a good quality level.
The AMR-WB decoding algorithm has been improved partly with the development of the scalable ITU-T G.718 codec which was standardized in 2008.
The ITU-T G.718 standard comprises a so-called interoperable mode, for which the core coding is compatible with the G.722.2 (AMR-WB) coding at 12.65 kbit/s; furthermore, the G.718 decoder has the particular feature of being able to decode an AMR-WB/G.722.2 bit stream at all the possible bit rates of the AMR-WB codec (from 6.6 to 23.85 kbit/s).
The G.718 interoperable decoder in low delay mode (G.718-LD) is illustrated in FIG. 2. Below is a list of the improvements provided by the AMR-WB bit stream decoding functionality in the G.718 decoder, with references to FIG. 1 when necessary:
the band extension (described for example in clause 7.13.1 of Recommendation G.718, block 206) is identical to that of the AMR-WB decoder, except that the 6-7 kHz bandpass filter and 1/AHB(z) synthesis filter(blocks 111 and 112) are in reverse order. In addition, at 23.85 kbit/s, the 4 bits transmitted per sub-frames by the AMR-WB coder are not used in the interoperable G.718 decoder; the synthesis of the high frequencies (HF) at 23.85 kbit/s is therefore identical to 23.05 kbit/s which avoids the known problem of AMR-WB decoding quality at 23.85 kbit/s. Above all, the 7 kHz low-pass filter (block 113) is not used, and the specific decoding of the 23.85 kbit/s mode is omitted (blocks 107 to 109).
A post-processing of the synthesis at 16 kHz (see clause 7.14 of G.718) is implemented in G.718 by “noise gate” in the block 208 (to “enhance” the quality of the silences by reduction of the level), high-pass filtering (block 209), low frequency post filter (called “bass posfilter”) in the block 210 attenuating the cross-harmonic noise at low frequencies and a conversion to 16 bit integers with saturation control (with gain control or AGC) in the block 211.
However, the band extension in the AMR-WB and/or G.718 (interoperable mode) codecs is still limited on a number of aspects.
In particular, the synthesis of high frequencies by formatted white noise (by a temporal approach of LPC source-filter type) is a very limited model of the signal in the band of the frequencies higher than 6.4 kHz.
Only the 6.4-7 kHz band is re-synthesized artificially, whereas in practice a wider band (up to 8 kHz) is theoretically possible at the sampling frequency of 16 kHz, is which can potentially enhance the quality of the signals, if they are not pre-processed by a filter of P.341 type (50-7000 Hz) as defined in the Software Tool Library (standard G.191) of the ITU-T.
There is therefore a need to improve the band extension in a codec of AMR-WB type or an interoperable version of this codec or more generally to improve the band extension of an audio signal.