Detailed hereinafter is hierarchical coding, having the capability of providing varied bitrates, by apportioning into hierarchized subsets the information relating to an audio signal to be coded, in such a way that this information can be used in order of importance from the standpoint of quality of audio rendition. The criterion taken into account for determining the order is a criterion of optimization (or rather of lesser degradation) of the quality of the coded audio signal. Hierarchical coding is particularly suited to transmission on heterogeneous networks or those exhibiting time-varying available bitrates, or else to transmission destined for terminals exhibiting varying capabilities.
The basic concept of hierarchical (or “scalable”) audio coding may be described as follows.
The binary stream comprises a base layer and one or more improvement layers. The base layer is generated by a fixed-bitrate codec, called a “core codec”, guaranteeing the minimum quality of the coding. This layer must be received by the decoder to maintain an acceptable quality level. The improvement layers serve to improve the quality. It may, however, happen that they are not all received by the decoder.
The main benefit of hierarchical coding is that it then allows adaptation of the bitrate by simple “truncation of the binary stream”. The number of layers (that is to say the number of possible truncations of the binary stream) defines the granularity of the coding. One speaks of “high granularity” coding if the binary stream comprises few layers (of the order of 2 to 4) and of “fine granularity” coding if it allows for example an increment of the order of 1 to 2 kbit/s.
The techniques of bitrate- and bandwidth-scalable coding, with a core coder of CELP type, in the telephonic band and one or more improvement layer(s) in the widened band, are more particularly described hereinafter. An example of such systems is given in the standard UIT-T G.729.1 from 8 to 32 kbit/s with fine granularity. The G.729.1 coding/decoding algorithm is summarized hereinafter.
1. Reminders regarding the G.729.1 coder
The G.729.1 coder is an extension of the UIT-T G.729 coder. It entails a modified G.729-core hierarchical coder producing a signal whose band ranges from the narrow band (50-4000 Hz) to the widened band (50-7000 Hz) with a bitrate of 8 to 32 kbit/s for conversational services. This codec is compatible with existing Voice over IP equipment which uses the G.729 codec.
The G.729.1 coder is shown diagrammatically in FIG. 1. The widened-band input signal sWB, sampled at 16 kHz, is firstly decomposed into two sub-bands by QMF (“Quadrature Mirror Filter”) filtering. The low band (0-4000 Hz) is obtained by low-pass filtering LP (block 100) and decimation (block 101), and the high band (4000-8000 Hz) by high-pass filtering HP (block 102) and decimation (block 103). The filters LP and HP are of length 64.
The low band is preprocessed by a high-pass filter eliminating the components below 50 Hz (block 104), to obtain the signal sLB, before narrow-band CELP coding (block 105) at 8 and 12 kbit/s. This high-pass filtering takes account of the fact that the useful band is defined as covering the interval 50-7000 Hz. The narrow-band CELP coding is a cascade CELP coding comprising as first stage a modified G.729 coding without preprocessing filter and as second stage an additional fixed CELP dictionary.
The high band is firstly preprocessed (block 106) to compensate for the aliasing due to the high-pass filter (block 102) combined with the decimation (block 103). The high band is thereafter filtered by a low-pass filter (block 107) eliminating the components between 3000 and 4000 Hz of the high band (that is to say the components between 7000 and 8000 Hz in the original signal) to obtain the signal SHB. A parametric band extension (block 108) is carried out thereafter.
An important feature of the G.729.1 encoder according to FIG. 1 is the following: the error signal dLB of the low band is calculated (block 109) on the basis of the output of the CELP coder (block 105) and a predictive transform coding (of TDAC for “Time Domain Aliasing Cancellation” type in the G.729.1 standard) is carried out at the block 110. With reference to FIG. 1, it is seen in particular that the TDAC encoding is applied both to the error signal on the low band and to the filtered signal on the high band.
Additional parameters may be transmitted by the block 111 to a homologous decoder, this block 111 carrying out a processing termed “FEC” for “Frame Erasure Concealment”, with a view to reconstructing erased frames, if any.
The various binary streams generated by the coding blocks 105, 108, 110 and 111 are finally multiplexed and structured as a hierarchical binary train in the multiplexing block 112. The coding is carried out per blocks of samples (or frames) of 20 ms, i.e. 320 samples per frame.
The G.729.1 codec therefore has an architecture as three coding steps comprising:
the cascade CELP coding,
the parametric band extension by the module 108, of TDBWE (“Time Domain Bandwidth Extension”) type, and
a predictive TDAC transform coding, applied after a transformation of MDCT (“Modified Discrete Cosine Transform”) type.
2. Reminders regarding the G.729.1 decoder
The G.729.1 decoder is illustrated in FIG. 2. The bits describing each 20-ms frame are demultiplexed in the block 200.
The binary stream of the layers at 8 and 12 kbit/s is used by the CELP decoder (block 201) to generate the narrow-band synthesis (0-4000 Hz). That portion of the binary stream associated with the layer at 14 kbit/s is decoded by the band extension module (block 202). That portion of the binary stream associated with the bitrates above 14 kbit/s is decoded by the TDAC module (block 203). A processing of the pre-echoes and post-echoes is carried out by the blocks 204 and 207 as well as an enhancement (block 205) and a post-processing of the low band (block 206).
The widened-band output signal ŝwb, sampled at 16 kHz, is obtained by way of the bank of synthesis QMF filters (blocks 209, 210, 211, 212 and 213) integrating the inverse aliasing (block 208).
The description of the transform-coding layer is detailed hereinafter.
3. Reminders regarding the TDAC transform based coder in the G.729.1 coder
The transform coding of TDAC type in the G.729.1 coder is illustrated in FIG. 3.
The filter WLB(z) (block 300) is a perceptual weighting filter, with gain compensation, applied to the low-band error signal dLB. MDCT transforms are thereafter calculated (block 301 and 302) to obtain:
the MDCT spectrum DLBw of the difference signal, perceptually filtered, and
the MDCT spectrum SHB of the original signal of the high band.
These MDCT transforms (blocks 301 and 302) are applied to 20 ms of signal sampled at 8 kHz (160 coefficients). The spectrum Y(k) arising from the fusion block 303 thus comprises 2×160, i.e. 320 coefficients. It is defined as follows:[Y(0)Y(1) . . . Y(319)]=[DLBw(0)DLBw(1) . . . DLBw(159)SHB(0)SHB(1) . . . SHB(159)]
This spectrum is divided into eighteen sub-bands, a sub-band j being assigned a number denoted nb_coef(j) of coefficients. The slicing into sub-bands is specified in table 1 hereinafter.
Thus, a sub-band j comprises the coefficients Y(k) with sb_bound(j)≦k<sb_bound(j+1).
Note that the coefficients 280-319 corresponding to the 7000 Hz-8000 Hz frequency band are not coded; they are set to zero at the decoder, since the passband of the codec is from 50-7000 Hz.
TABLE 1Limits and size of the sub-bands in TDAC codingJsb _bound (j)nb_coef (j)0016116162321634816464165801669616711216812816914416101601611176161219216132081614224161524016162561617272818280—
The spectral envelope {log_rms(j)}j=0, . . . , 17 is calculated in the block 304 according to the formula:
            log_rms      ⁢              (        j        )              =                  1        2            ⁢                        log          2                ⁡                  [                                                    1                                  nb_coef                  ⁢                                      (                    j                    )                                                              ⁢                                                ∑                                      k                    =                                                                  sb                        ⁢                        _                        ⁢                        bound                                            ⁢                                              (                        j                        )                                                                                                                                                sb                        ⁢                        _                        ⁢                        bound                                            ⁢                                              (                                                  j                          +                          1                                                )                                                              -                    1                                                  ⁢                                                      Y                    ⁡                                          (                      k                      )                                                        2                                                      +                          ɛ              rms                                ]                      ,j=0, . . . , 17where εrms=2−24.
The spectral envelope is coded at variable bitrate in the block 305. This block 305 produces quantized, integer values, denoted rms_index(j) (with j=0, . . . , 17), obtained by simple scalar quantization:rms_index(j)=round(2·log—rms(j)where the notation “round” designates rounding to the nearest integer, and with the constraint:−11≦rms_index(j)≦+20
This quantized value rms_index(j) is transmitted to the bit allocation block 306.
The coding of the spectral envelope, itself, is further performed by the block 305, separately for the low band (rms_index(j), with j=0, . . . , 9) and for the high band (rms_index(j), with j=10, . . . , 17). In each band, two types of coding may be chosen according to a given criterion, and, more precisely, the values rms_index(j):
may be coded by so-called “differential Huffman” coding,
or may be coded by natural binary coding.
A bit (0 or 1) is transmitted to the decoder to indicate the mode of coding which has been chosen.
The number of bits allocated to each sub-band for its quantization is determined at the block 306 on the basis of the quantized spectral envelope arising from the block 305.
The bit allocation performed minimizes the quadratic error while adhering to the constraint of an integer number of bits allocated per sub-band and of a maximum number of bits not to be exceeded. The spectral content of the sub-bands is thereafter coded by spherical vector quantization (block 307).
The various binary streams generated by the blocks 305 and 307 are thereafter multiplexed and structured as a hierarchical binary train at the multiplexing block 308.
4. Reminder regarding the transform based decoder in the G.729.1 decoder
The step of TDAC type transform based decoding in the G.729.1 decoder is illustrated in FIG. 4.
In a symmetric manner to the encoder (FIG. 3), the decoded spectral envelope (block 401) makes it possible to retrieve the allocation of bits (block 402). The envelope decoding (block 401) reconstructs the quantized values of the spectral envelope (rms_index(j), for j=0, . . . , 17), on the basis of the binary train generated by the block 305 (multiplexed) and deduces therefrom the decoded envelope:rms—q(j)=21/2 rms—index(j) 
The spectral content of each of the sub-bands is retrieved by inverse spherical vector quantization (block 403). The untransmitted sub-bands, for lack of sufficient “budget” of bits, are extrapolated (block 404) on the basis of the MDCT transform of the signal output by the band extension block (block 202 of FIG. 2).
After upgrading of this spectrum (block 405) as a function of the spectral envelope and post-processing (block 406), the MDCT spectrum is split into two (block 407):
with 160 first coefficients corresponding to the spectrum {circumflex over (D)}LBw of the perceptually filtered, low-band decoded difference signal,
and 160 subsequent coefficients corresponding to the spectrum ŜHB of the high-band decoded original signal.
These two spectra are transformed into temporal signals by inverse MDCT transform, denoted IMDCT (blocks 408 and 410), and the inverse perceptual weighting (filter denoted WLB(z)−1) is applied to the signal {circumflex over (d)}LBw (block 409) resulting from the inverse transform.
The allocation of bits to the sub-bands (block 306 of FIG. 3 or block 402 of FIG. 4) is more particularly described hereinafter.
The blocks 306 and 402 carry out an identical operation on the basis of the values rms_index(j), j=0, . . . , 17. Therefore, hereinafter merely the operation of the block 306 is described.
The aim of the binary allocation is to apportion between each of the sub-bands a certain (variable) budget of bits, denoted nbits_VQ, with:
nbits_VQ=351−nbits_rms, where nbits_rms is the number of bits used by the coding of the spectral envelope.
The result of the allocation is the integer number of bits, denoted nbit(j) (with j=0, . . . , 17), allocated to each of the sub-bands with, as overall constraint:
            ∑              j        =        0            17        ⁢          nbit      ⁡              (        j        )              ≤  nbits_VQ
In the G.729.1 standard, the values nbit(j) (j=0, . . . , 17), are moreover constrained by the fact that nbit(j) must be chosen from among a reduced set of values specified in table 2 hereinafter.
TABLE 2Possible values of number of bitsallocated in the TDAC sub-bands.Size of thesub-band jnb_coef(j)Set of authorized values nbit(j) (in number of bits)8R8 = {0, 7, 10, 12, 13, 14, 15, 16}16R16 = {0, 9, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,27, 28, 29, 30, 31, 32}
The allocation in the G.729.1 standard relies on a “perceptual importance” per sub-band related to the energy of the sub-band, denoted ip(j)(j=0 . . . 17), defined as follows:
            ip      ⁡              (        j        )              =                            1          2                ⁢                              log            2                    ⁡                      (                          rms_q              ⁢                                                (                  j                  )                                2                            ×              nb_coef              ⁢                              (                j                )                                      )                              +      offset                          where        ⁢                                  ⁢        offset            =              -        2.              ⁢                
Since the values rms_q(j)=21/2 rms—index(j), this formula simplifies to the form:
      ip    ⁡          (      j      )        =      {                                                      1              2                        ⁢            rms_index            ⁢                          (              j              )                                                                                          for                ⁢                                                                  ⁢                j                            =              0                        ,            …            ⁢                                                  ,            16                                                                          1              2                        ⁢                          (                                                rms_index                  ⁢                                      (                    j                    )                                                  -                1                            )                                                                          for              ⁢                                                          ⁢              j                        =            17                              
On the basis of the perceptual importance of each sub-band, the allocation nbit(j) is calculated as follows:
      nbit    ⁡          (      j      )        =      arg    ⁢                  ⁢                  min                  r          ∈                      R                                          nb                ⁢                _                ⁢                coef                            ⁢                              (                j                )                                                        ⁢                                            nb_coef            ⁢                          (              j              )                        ×                          (                                                ip                  ⁡                                      (                    j                    )                                                  -                                  λ                  opt                                            )                                -          r                            where λopt is a parameter optimized by dichotomy to satisfy the overall constraint
            ∑              j        =        0            17        ⁢          nbit      ⁡              (        j        )              ≤  nbits_VQby best approximating the threshold nbits_VQ.
New initiatives for extending a core coder of G.729.1 type such as described hereinabove or of G.718 type to super widened band (SWB for “Super Wide Band”), are currently undergoing discussion.
A possible extension solution is described for example in the document by the authors M. Tammi, L. Laaksonen, A. Rämö, H. Toukomaa, entitled “Scalable Superwideband Extension for Wideband Coding”, ICASSP, 2009.
This document describes a super-widened band coding/decoding system comprising a core coding stage of G.729.1 or G.718 type and a band extension stage.
The core coding performs the coding of the frequency band ranging from 0 to 7 kHz whereas the extension band performs a coding in the frequency band ranging from 7 to 14 kHz.
A first extension coding layer is based on a parametric model relying on two modes of coding: a generic mode and a sinusoidal mode.
The generic mode uses a procedure for transposition in the MDCT domain for artificially generating the high-frequency (7-14 kHz) MDCT coefficients on the basis of the low frequencies (0-7 kHz). The low frequency band making it possible to code a high frequency band is selected on a criterion for maximizing the normalized correlation.
The sinusoidal mode is normally used for particularly harmonic or tonal signals. In this mode, the highest-energy components are selected. Their positions, their amplitudes and their signs are then transmitted.
This first layer is transmitted with a bitrate of 4 kbit/s. In this article, a second layer for improving the 7-14 kHz band is proposed, it is based on the coding of extra sinusoids making it possible to best approximate the MDCT spectrum of the input signal. The allocation of bits for this second extension layer is fixed once and for all.
Thus, the extension coding presented in this document improves the signal only in the extension frequency band ranging from 7 to 14 kHz. The frequency band from 0 to 7 kHz of the core coding is not modified.
It may happen, however, that certain frequency sub-bands of the core frequency band do not receive sufficient bitrate.
In the case where 0 bit is allocated to a core coding sub-band, the decoder then makes direct use of the synthesized signal arising from the first band extension coding layer TDBWE for the 4-7 kHz band, to fill in the unallocated bands.
It turns out, however, that these bands may sometimes penalize the perceived quality when the coder is combined with a 7-14 kHz band extension module.
Indeed, the addition of the high frequencies sometimes increases the perception of defects arising from the low frequencies.
Thus, a band extension may accentuate the core layer coding defects.
There therefore exists a requirement for overall improvement to the quality of the coded signal on the whole of the frequency band and not only on the extension frequency band.