Detailed hereinafter is hierarchical coding, having the capability of providing varied bitrates, by apportioning into hierarchized subsets the information relating to an audio signal to be coded, in such a way that this information can be used in order of importance from the standpoint of quality of audio rendition. The criterion taken into account for determining the order is a criterion of optimization (or rather of lesser degradation) of the quality of the coded audio signal. Hierarchical coding is particularly suited to transmission on heterogeneous networks or those exhibiting time-varying available bitrates, or else to transmission destined for terminals exhibiting varying capabilities.
The basic concept of hierarchical (or “scalable”) audio coding may be described as follows.
The binary stream comprises a base layer and one or more improvement layers. The base layer is generated by a fixed-bitrate codec, called a “core codec”, guaranteeing the minimum quality of the coding. This layer must be received by the decoder to maintain an acceptable quality level. The improvement layers serve to improve the quality. It may, however, happen that they are not all received by the decoder.
The main benefit of hierarchical coding is that it then allows adaptation of the bitrate by simple “truncation of the binary stream”. The number of layers (that is to say the number of possible truncations of the binary stream) defines the granularity of the coding. One speaks of “high granularity” coding if the binary stream comprises few layers (of the order of 2 to 4) and of “fine granularity” coding if it allows for example an increment of the order of 1 to 2 kbit/s.
The techniques of bitrate- and bandwidth-scalable coding, with a core coder of CELP type, in the telephonic band and one or more improvement layer(s) in the widened band, are more particularly described hereinafter. An example of such systems is given in the standard UIT-T G.729.1 from 8 to 32 kbit/s with fine granularity. The G.729.1 coding/decoding algorithm is summarized hereinafter.
1. Reminders Regarding the G.729.1 Coder
The G.729.1 coder is an extension of the UIT-T G.729 coder. It entails a modified G.729-core hierarchical coder producing a signal whose band ranges from the narrow band (50-4000 Hz) to the widened band (50-7000 Hz) with a bitrate of 8 to 32 kbit/s for conversational services. This codec is compatible with existing voice over IP equipment which uses the G.729 codec.
The G.729.1 coder is shown diagrammatically in FIG. 1. The widened-band input signal swb, sampled at 16 kHz, is firstly decomposed into two sub-bands by QMF (“Quadrature Mirror Filter”) filtering. The low band (0-4000 Hz) is obtained by low-pass filtering LP (block 100) and decimation (block 101), and the high band (4000-8000 Hz) by high-pass filtering HP (block 102) and decimation (block 103). The filters LP and HP are of length 64.
The low band is preprocessed by a high-pass filter eliminating the components below 50 Hz (block 104), to obtain the signal sLB, before narrow-band CELP coding (block 105) at 8 and 12 kbit/s. This high-pass filtering takes account of the fact that the useful band is defined as covering the interval 50-7000 Hz. The narrow-band CELP coding is a cascade CELP coding comprising as first stage a modified G.729 coding without preprocessing filter and as second stage an additional fixed CELP dictionary.
The high band is firstly preprocessed (block 106) to compensate for the aliasing due to the high-pass filter (block 102) combined with the decimation (block 103). The high band is thereafter filtered by a low-pass filter (block 107) eliminating the components between 3000 and 4000 Hz of the high band (that is to say the components between 7000 and 8000 Hz in the original signal) to obtain the signal sHB. A parametric band extension (block 108) is carried out thereafter.
An important feature of the G.729.1 encoder according to FIG. 1 is the following. The error signal dLB of the low band is calculated (block 109) on the basis of the output of the CELP coder (block 105) and a predictive transform coding (of TDAC for “Time Domain Aliasing Cancellation” type in the G.729.1 standard) is carried out at the block 110. With reference to FIG. 1, it is seen in particular that the TDAC encoding is applied both to the error signal on the low band and to the filtered signal on the high band.
Additional parameters may be transmitted by the block 111 to a homologous decoder, this block 111 carrying out a processing termed “FEC” for “Frame Erasure Concealment”, with a view to reconstructing erased frames, if any.
The various binary streams generated by the coding blocks 105, 108, 110 and 111 are finally multiplexed and structured as a hierarchical binary train in the multiplexing block 112. The coding is carried out per blocks of samples (or frames) of 20 ms, i.e. 320 samples per frame.
The G.729.1 codec therefore has an architecture as three coding steps comprising:                the cascade CELP coding,        the parametric band extension by the module 108, of TDBWE (“Time Domain Bandwidth Extension”) type, and        a predictive TDAC transform coding, applied after a transformation of MDCT (“Modified Discrete Cosine Transform”) type.        
2. Reminders Regarding the G.729.1 Decoder
The G.729.1 decoder is illustrated in FIG. 2. The bits describing each 20-ms frame are demultiplexed in the block 200.
The binary stream of the layers at 8 and 12 kbit/s is used by the CELP decoder (block 201) to generate the narrow-band synthesis (0-4000 Hz). That portion of the binary stream associated with the layer at 14 kbit/s is decoded by the band extension module (block 202). That portion of the binary stream associated with the bitrates above 14 kbit/s is decoded by the TDAC module (block 203). A processing of the pre-echoes and post-echoes is carried out by the blocks 204 and 207 as well as an enhancement (block 205) and a post-processing of the low band (block 206).
The widened-band output signal swb, sampled at 16 kHz, is obtained by way of the bank of synthesis QMF filters (blocks 209, 210, 211, 212 and 213) integrating the inverse aliasing (block 208).
The description of the transform-coding layer is detailed hereinafter.
3. * Reminders Regarding the TDAC Transform Based Coder in the G.729.1 Coder
The transform coding of TDAC type in the G.729.1 coder is illustrated in FIG. 3.
The filter WLB(z) (block 300) is a perceptual weighting filter, with gain compensation, applied to the low-band error signal dLB. MDCT transforms are thereafter calculated (block 301 and 302) to obtain:                the MDCT spectrum DLBw of the difference signal, perceptually filtered, and        the MDCT spectrum SHB of the original signal of the high band.        
These MDCT transforms (blocks 301 and 302) are applied to 20 ms of signal sampled at 8 kHz (160 coefficients). The spectrum Y(k) arising from the fusion block 303 thus comprises 2×160, i.e. 320 coefficients. It is defined as follows:[Y(0)Y(1) . . . Y(319)]=[DLBw(0)DLBw(1) . . . DLBw(159)SHB(0)SHB(1) . . . SHB(159)]
This spectrum is divided into eighteen sub-bands, a sub-band j being assigned a number denoted nb_coef(j) of coefficients. The slicing into sub-bands is specified in table 1 hereinafter.
Thus, a sub-band j comprises the coefficients Y(k) with sb_bound(j)≦k<sb_bound(j+1).
Note that the coefficients 280-319 corresponding to the 7000 Hz-8000 Hz frequency band are not coded; they are set to zero at the decoder, since the passband of the codec is from 50-7000 Hz.
TABLE 1Limits and size of the sub-bands in TDAC codingJsb_bound(j)nb_coef (j)0016116162321634816464165801669616711216812816914416101601611176161219216132081614224161524016162561617272818280—
The spectral envelope {log_rms(j)}j=0, . . . , 17 is calculated in the block 304 according to the formula:
            log_rms      ⁢              (        j        )              =                  1        2            ⁢                        log          2                ⁡                  [                                                    1                                  nb_coef                  ⁢                                      (                    j                    )                                                              ⁢                                                ∑                                      k                    =                                          sb                      ⁢                                                                                          ⁢                      _                      ⁢                                                                                          ⁢                                              bound                        ⁡                                                  (                          j                          )                                                                                                                                                sb                      ⁢                                                                                          ⁢                      _                      ⁢                                                                                          ⁢                                              bound                        ⁡                                                  (                                                      j                            +                            1                                                    )                                                                                      -                    1                                                  ⁢                                                      Y                    ⁡                                          (                      k                      )                                                        2                                                      +                          ɛ                              rm                ⁢                                                                  ⁢                s                                              ]                      ,          ⁢      j    =    0    ,  …  ⁢          ,  17where εrms=2−24.
The spectral envelope is coded at variable bitrate in the block 305. This block 305 produces quantized, integer values, denoted rms_index(j) (with j=0, . . . , 17), obtained by simple scalar quantization:rms_index(j)=round(2·log_rms(j))where the notation “round” designates rounding to the nearest integer, and with the constraint:−11≦rms_index(j)≦+20
This quantized value rms_index(j) is transmitted to the bits allocation block 306.
The coding of the spectral envelope, itself, is further performed by the block 305, separately for the low band (rms_index(j), with j=0, . . . , 9) and for the high band (rms_index(j), with j=10, . . . , 17). In each band, two types of coding may be chosen according to a given criterion, and, more precisely, the values rms_index(j):                may be coded by so-called “differential Huffman” coding,        or may be coded by natural binary coding.        
A bit (0 or 1) is transmitted to the decoder to indicate the mode of coding which has been chosen.
The number of bits allocated to each sub-band for its quantization is determined at the block 306 on the basis of the quantized spectral envelope arising from the block 305.
The bit allocation performed minimizes the quadratic error while adhering to the constraint of an integer number of bits allocated per sub-band and of a maximum number of bits not to be exceeded. The spectral content of the sub-bands is thereafter coded by spherical vector quantization (block 307).
The various binary streams generated by the blocks 305 and 307 are thereafter multiplexed and structured as a hierarchical binary train at the multiplexing block 308.
4. Reminder Regarding the Transform Based Decoder in the G.729.1 Decoder
The step of TDAC type transform based decoding in the G.729.1 decoder is illustrated in FIG. 4.
In a symmetric manner to the encoder (FIG. 3), the decoded spectral envelope (block 401) makes it possible to retrieve the allocation of bits (block 402). The envelope decoding (block 401) reconstructs the quantized values of the spectral envelope (rms_index(j), for j=0, . . . , 17), on the basis of the binary train generated by the block 305 (multiplexed) and deduces therefrom the decoded envelope:rms—q(j)=21/2 rms—index(j)
The spectral content of each of the sub-bands is retrieved by inverse spherical vector quantization (block 403). The untransmitted sub-bands, for lack of sufficient “budget” of bits, are extrapolated (block 404) on the basis of the MDCT transform of the signal output by the band extension block (block 202 of FIG. 2).
After upgrading of this spectrum (block 405) as a function of the spectral envelope and post-processing (block 406), the MDCT spectrum is split into two (block 407):                with 160 first coefficients corresponding to the spectrum DLBw of the perceptually filtered, low-band decoded difference signal,        and 160 subsequent coefficients corresponding to the spectrum SHB of the high-band decoded original signal.        
These two spectra are transformed into temporal signals by inverse MDCT transform, denoted IMDCT (blocks 408 and 410), and the inverse perceptual weighting (filter denoted WLB(z)−1) is applied to the signal dLBw (block 409) resulting from the inverse transform.
The allocation of bits to the sub-bands (block 306 of FIG. 3 or block 402 of FIG. 4) is more particularly described hereinafter.
The blocks 306 and 402 carry out an identical operation on the basis of the values rms_index(j), j=0, . . . , 17. Therefore, hereinafter merely the operation of the block 306 is described.
The aim of the binary allocation is to apportion between each of the sub-bands a certain (variable) budget of bits, denoted nbits_VQ, with:
nbits_VQ=351−nbits_rms, where nbits_rms is the number of bits used by the coding of the spectral envelope.
The result of the allocation is the integer number of bits, denoted nbit(j) (with j=0, . . . , 17), allocated to each of the sub-bands with, as global constraint:
            ∑              j        =        0            17        ⁢          nbit      ⁡              (        j        )              ≤  nbits_VQ
In the G.729.1 standard, the values nbit(j) (j=0, . . . , 17), are moreover constrained by the fact that nbit(j) must be chosen from among a reduced set of values specified in table 2 hereinafter.
TABLE 2Possible values of number of bits allocated in the TDAC sub-bands.Size of thesub-band jnb_coef(j)Set of authorized values nbit(j) (in number of bits) 8R8 = {0, 7, 10, 12, 13, 14, 15, 16}16R16 = {0, 9, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32}
The allocation in the G.729.1 standard relies on a “perceptual importance” per sub-band related to the energy of the sub-band, denoted ip(j) (j=0 . . . 17), defined as follows:
            ip      ⁡              (        j        )              =                            1          2                ⁢                              log            2                    ⁡                      (                          rms_q              ⁢                                                (                  j                  )                                2                            ×              nb_coef              ⁢                              (                j                )                                      )                              +      offset                  where      ⁢                          ⁢      offset        =          -      2.      
Since the values rms_q(j)=21/2 rms—index(j), this formula simplifies to the form:
      ip    ⁡          (      j      )        =      {                                                      1              2                        ⁢            rms_index            ⁢                          (              j              )                                                for                                                    j              =              0                        ,            …            ⁢                                                  ,            16                                                                          1              2                        ⁢                          (                                                rms_index                  ⁢                                      (                    j                    )                                                  -                1                            )                                                for                                      j            =            17                              
On the basis of the perceptual importance of each sub-band, the allocation nbit(j) is calculated as follows:
      nbit    ⁡          (      j      )        =      arg    ⁢                  ⁢                  min                  r          ∈                      R                          nb              ⁢                                                          ⁢              _              ⁢                                                          ⁢                              coef                ⁡                                  (                  j                  )                                                                        ⁢                                            nb_coef            ⁢                          (              j              )                        ×                          (                                                ip                  ⁡                                      (                    j                    )                                                  -                                  λ                  opt                                            )                                -          r                            where λopt is a parameter optimized by dichotomy to satisfy the global constraint
            ∑              j        =        0            17        ⁢          nbit      ⁡              (        j        )              ≤  nbits_VQby best approximating the threshold nbits_VQ.
The impact of the perceptual weighting (filtering of the block 300) on the allocation of bits (block 306) of the TDAC transform based coder is now described in greater detail.
In the G.729.1 standard, the TDAC coding uses the filter WLB(z) for perceptual weighting in the low band (block 300), as indicated hereinabove. In essence, the perceptual weighting filtering makes it possible to shape the coding noise. The principle of this filtering is to utilize the fact that it is possible to inject more noise into the zones of frequencies where the original signal has high energy.
The perceptual weighting filters most commonly used in narrow-band CELP coding are of the form Ā(z/γ1)/Ā(z/γ2) where 0≦γ2≦γ1<1 and Ā(z) represents a linear prediction spectrum (LPC). The synthesis based analysis in CELP coding thus amounts to minimizing the quadratic error in a signal domain weighted perceptually by this type of filter.
However, to ensure spectral continuity when the spectra DLBw and SHB are adjoining (block 303 of FIG. 3), the filter WLB(z) is defined in the form:
            W      LB        ⁡          (      z      )        =      fac    ⁢                  ⁢                            A          ^                ⁡                  (                      z            /                          γ              1                                )                                      A          ^                ⁡                  (                                    z              /              γ                        ⁢                                                  ⁢            2                    )                    with γ1=0.96, γ2=0.6 and
  fac  =                                  ∑                      i            =            0                    p                ⁢                                            (                              -                                  γ                  2                                            )                        i                    ⁢                                    a              ^                        i                                                ∑                      i            =            0                    p                ⁢                                            (                              -                                  γ                  1                                            )                        i                    ⁢                                    a              ^                        i                                  
The factor fac makes it possible to ensure at the junction of the low and high bands (4 kHz) a gain of the filter at 1 to 4 kHz. It is important to note that, in the TDAC coding according to the G.729.1 standard, the coding relies only on an energy criterion.
5. Drawbacks of the Prior Art
The energy criterion of the TDAC coding of G.729.1, used in the high band (4000-7000 Hz), is not optimal from a perceptual point of view, especially for coding music signals.
The perceptual weighting filter is particularly suited to speech signals. It is widely used in standards for speech coding based on the coding format of CELP type. However, for music signals, it is apparent that this perceptual weighting based on a shaping of the quantization noise in accordance with the formants of the input signal is insufficient. Most audio coders rely on a transform coding using frequency masking models, or simultaneous masking; they are more generic (in the sense that they do not use a CELP-like speech production model) and are therefore more suitable for coding music signals.
Reference may be made to the document entitled “Introduction to digital audio coding and standards”, by M. Bosi and R. Goldberg, published by Kluver Academic Publishers, in 2003, to get more details about masking models and their application in transform based coders.
There therefore exists a requirement to improve the quality of coding of the signals for better perceptual rendition, while retaining interoperability with G.729.1 coding.