Many techniques exist at present for converting an audio-frequency (speech and/or audio) signal into the form of a digital signal and processing the signals digitized in this way. The standard high-quality audio coding methods are generally classified as “waveform coding”, “parametric coding by analysis by synthesis”, and “perceptual coding in sub-bands or by transforms”.
The first category includes quantizing techniques with or without memory such as PCM or ADPCM coding.
The second category includes techniques that represent the signal by means of a model, generally a linear predictive model, having parameters that are determined using methods derived from waveform coding. For this reason, this category is often referred to as hybrid coding. For example, CELP (code excited linear prediction) coding belongs to this second category. In CELP coding, the input signal is coded by means of a “source-filter” model inspired by the speech production process. The parameters transmitted represent separately the source (or “excitation”) and the filter. The filter is generally an all-pole filter. The basic concepts of coding audio-frequency signals and more particularly of CELP coding and quantization are explained in the following works in particular: W. B. Kleijn and K. K. Paliwal, editors, Speech Coding and Synthesis, Elsevier, 1995, and Nicolas Moreau, Techniques de compression des signaux [Signal compression techniques], Collection Technique et Scientifique des Télécommunications, Masson, 1995.
The third category includes coding techniques such as MPEG 1 and 2 Layer III, better known as MP3, or MPEG 4 AAC.
The ITU-T G.729 system is one example of CELP coding designed for speech signals in the telephone band (300 hertz (Hz)-3400 Hz) sampled at 8 kilohertz (kHz). It operates at a fixed bitrate of 8 kilobits per second (kbps) with 10 milliseconds (ms) frames. Its operation is specified in detail in ITU-T Recommendation G.729, Coding of Speech at 8 kbps using Conjugate Structure Algebraic Code Excited Linear Prediction (CS-ACELP), March 1996.
FIGS. 1(a), 1(b) and 1(c) together constitute a simplified diagram of the associated coder and decoder. FIG. 1(c) shows how the G.729 decoder reconstructs the speech signal from data supplied by the demultiplexer (112). The excitation is reconstituted into 5 ms sub-frames by adding two contributions:                an innovator code (113), 5 ms long, consisting of 4 pulses ±1 scaled by a gain gc, (114 and 118) and zeros;        a 5 ms block taken in the past of the excitation and shifted by a fractional delay (specified by the pitch parameters T0, T0_frac) (115 and 116), scaled by a gain gp (117 and 118).        
The excitation decoded in this way is shaped by a 10th order LPC (linear predictive coding) synthesis filter 1/A(z) (120), having coefficients that are decoded (119) in the LSF (line spectrum frequency) domain from pairs of spectrum lines and interpolated at 5 ms sub-frame level. To improve quality and to mask certain coding artefacts, the reconstructed signal is then processed by an adaptive post-filter (121) and a post-processing high-pass filter (122). The FIG. 1(c) decoder therefore relies on the “source-filter” model to synthesize the signal. The parameters associated with this model are listed in the FIG. 2 table, with those describing the excitation distinguished from those describing the filter.
FIG. 1(a) represents a very high level diagram of the G.729 coder. It therefore shows the pre-processing high-pass filtering (101), the LPC analysis and quantization (102), the coding of the excitation (103) and the multiplexing of the coding parameters (104). The pre-processing and LPC analysis and quantizing blocks of the G.729 coder are not discussed here; for more details see the ITU-T recommendation referred above. FIG. 1(b) is a diagram of the excitation coding. It shows how the excitation parameters listed in FIG. 2 are determined and quantized. The excitation is coded in three steps:                determination of the pitch delay (106) and estimation of the pitch gain (107);        determination of the parameters of the innovator code in the ACELP dictionary (positions and signs of the 4 pulses (108)) and estimation of the gain (109);        conjoint coding of the pitch and code gains.        
The excitation parameters are determined by minimizing the quadratic error (111) between the CELP target (105) and the excitation filtered by W(z)/Â(z) (110). This process of analysis by synthesis is described in detail in the ITU-T recommendation referred to above.
In practice, the complexity of the G.729 coder/decoder (codec) is relatively high (around 18 WMOPS (weighted million operations per second)). To meet the requirements of applications such as simultaneous transmission of voice and data via DSVD (digital simultaneous voice and data) modems, an interworking system of lesser complexity (around 9 WMOPS) is also recommended by the ITU-T: the G.729A codec. This is described and compared to the G.729 codec in R. Salami et al., Description of ITU-T Recommendation G.729 Annex A: Reduced complexity 8 kbps CS-ACELP codec, ICASSP 1997.
Of the significant differences between G.729 and G.729A, that which reduces the G.729 complexity the most relates to searching in the ACELP dictionary: in the G.729A coder an in-depth search firstly of the four signed pulses replaces the interleaved loop search used in the G.729 coder. By virtue of its low complexity, the G.729A codec is now very widely used in voice over IP or ATM applications in the telephone band (300-3400 Hz).
With the growth of optical fiber and broadband networks such as ADSL, deploying new services can now be envisaged, such as bidirectional communication of much higher quality than standard systems using the telephone band. One step in this direction is to provide “wideband” quality, i.e. to use audio-frequency signals sampled at 16 kHz and limited to a usable band of 50 Hz-7000 Hz. The quality obtained is then similar to that of AM radio.
The choice of a codec for deploying “wideband” quality instead of “narrowband” quality must take a number of important factors into account.                The infrastructure of existing IP networks and connection points (telephone modems, ADSL, LAN, WiFi, etc.) is extremely heterogeneous in terms of bitrate, quality of service as characterized by jitter, bitrate of loss of packets, etc.        The terminals reproducing the sounds (telephone, PC or other) sometimes differ in terms of sampling frequency and the number of audio channels. It is sometimes difficult to tell in advance in the coder the real capacity of the terminals.        Numerous standards for coding audio-frequency signals (including the G.729 and G.729A codecs) are already deployed in networks. Transcoding between the various associated formats is often necessary (for example in gateways or routers), although this generally implies a loss of quality and non-negligible complexity.        
The approach known as “hierarchical” coding is the technical solution best suited to taking account of all these constraints.
Unlike conventional coding, such as G.729 or G.729A coding, generating a bit stream at fixed bitrate, hierarchical coding generates a bit stream that can be decoded in whole or in part. As a general rule, hierarchical coding comprises a core layer and one or more enhancement layers. The core layer is generated by a low fixed bitrate core codec, guaranteeing the minimum coding quality. This layer must be received by the decoder to maintain an acceptable quality level. The enhancement layers serve to improve quality. However, it can happen that they are not all received by the decoder, because of transmission errors, for example in the event of congestion of an IP network.
This technique therefore offers great flexibility in terms of the choice of the bitrate and the quality of reconstruction. The coder always assumes that the bitrate is the maximum bitrate. However, anywhere in the communication chain the bitrate can be adapted simply by truncating the bit stream. Hierarchical coding can moreover progressively deploy wideband quality, relying on a standard of the CELP coding in the telephone band type (such as the ITU-T G.729 and G.729A standards).
Of the various approaches to hierarchical coding based on a CELP core coder, the following four techniques may be mentioned:                hierarchical CELP coding with excitation enrichment as described in the paper by R. D. De lacovo, D. Sereno, Embedded CELP coding for variable-rate between 6.4 and 9.6 kbps, ICASSP 1991;        band extension with transmission of auxiliary information as described in the paper by J.-M. Valin et al., Bandwidth Extension of Narrowband Speech for Low Bit-Rate Wideband Coding, Proc. IEEE Speech Coding Workshop (SCW), 2000, pp. 130-132.        in the paper by S. K. Jung, K-T. Kim, H-G. Kang, A bit/rate band scalable speech coder based on ITU-T G.723.1 standard, ICASSP 2004, a hierarchical coder is constructed from a G.723.1 coder with two enhancement layers, the first being of the telephone band cascade CELP type and the second being high-band transform coding attained by QMF (quadrature mirror filter) filtering;        in the paper by H. Taddéi et al., A scalable Three Bit rate (8, 14.2 and 24 kbps) Audio Coder, 107th Convention AES 1999, the coding uses a G.729 8 kbps core coder, an intermediate telephone band enhancement layer to increase the bitrate to 14.2 kbps, followed by a wideband enhancement layer using transform coding to reach 24 kbps.        
The difference between the concept of hierarchical CELP coding by excitation enrichment and the coding shown in FIG. 1(b) lies in the addition of an innovator dictionary to represent the CELP target better. This coding approach is in fact similar to multistage quantizing effected in the domain of the CELP target (or “perceptually” weighted domain). This additional dictionary enriches, or enhances, the decoded excitation because it is in fact added at the decoder level to the cumulative contribution of the two adaptive and fixed dictionaries of standard CELP decoding as shown in FIG. 1(c). This. CELP excitation enrichment principle can also be varied to include an additional adapted dictionary or a plurality of innovator dictionaries.
The band extension system proposed in the above paper by J.-M. Valin is shown in the FIG. 3 diagram. A signal in the telephone band (300 Hz-3400 Hz) is widened to the 0-8000 Hz wideband by adding (31) three contributions:
a baseband regenerated by the block (32);                the telephone band signal, for example coded by the G.729 system (40) and resampled by the block (33) at 16 kHz;        a high band constructed with aid of the blocks (34) to (39).        
Note more particularly in this diagram the extension of the highband, which is founded on the “source-filter” model. This begins with a narrowband LPC analysis (34) that determines the coefficients of the prediction filter ANB(z) (36). The result of this LPC analysis is also used by the LPC envelope extension unit (35) to determine the coefficients of a full-band LPC synthesis filter 1/BWB(z) (38). Envelope extension can be effected using codebook mapping techniques, for example, with no transmission of auxiliary information, or with explicit information requiring transmission by quantization at a low additional bitrate. In parallel, the narrowband LPC residual (or excitation) signal is calculated by the unit (36). The resulting excitation sampled at 8 kHz is extended to the sampling frequency of 16 kHz by the unit (37). This operation can be carried out in the excitation domain by employing non-linearity, oversampling and filtering, in order to extend the harmonic structure and to whiten the full-band excitation. The extended excitation is then shaped by the full-band synthesis filter 1/BWB (38) and the result is limited by the high-pass filter (39) to the 3400 Hz-8000 Hz band.
All known techniques of the prior art give rise to the following problems, however:                wideband speech degraded by certain artefacts, such as aliasing caused by the use of a bank of QMF filters;        music badly coded by the models linked to the speech production process;        high bitrate granularity;        quality degraded by the presence of pre-echo in the enhancement layer using transform coding;        delay and complexity.        
Moreover, certain fundamental problems are rarely touched on in the prior art: the phase non-linearity of pre-processing and post-processing is only rarely taken into account. The enhancement layers rely on coding a difference signal between original (pre-processed or not) and synthesis of the lower layer have badly degraded performance if the phase non-linearity (or group delay) of the pre-processing and post-processing filters is not compensated or eliminated.