The present invention relates to processing acoustic data.
This processing is suitable in particular for the transmission and/or storage of digital signals such as audio-frequency signals (speech, music, or other).
Various techniques exist for coding an audio-frequency signal in digital form. The most common techniques are:                waveform encoding methods such as pulse code modulation (PCM) and adaptive differential pulse code modulation (ADPCM).        analysis-by-synthesis parametric coding methods such as code excited linear prediction (CELP) coding and        sub-band perceptual coding methods or transform coding.        
These techniques process the input signal sequentially, sample by sample (PCM or ADPCM) or by blocks of samples called “frames” (CELP and transform coding).
Briefly, it will be recalled that a sound signal such as a speech signal can be predicted from its recent past (for example from 8 to 12 samples at 8 kHz) using parameters assessed over short windows (10 to 20 ms in this example). These short-term predictive parameters representing the vocal tract transfer function (for example for pronouncing consonants), are obtained by linear prediction coding (LPC) methods. A longer-term correlation is also used to determine periodicities of voiced sounds (for example the vowels) resulting from the vibration of the vocal cords. This involves determining at least the fundamental frequency of the voiced signal, which typically varies from 60 Hz (low voice) to 600 Hz (high voice) according to the speaker. Then a long term prediction (LTP) analysis is used to determine the LTP parameters of a long-term predictor, in particular the inverse of the fundamental frequency, often called “pitch period”. The number of samples in a pitch period is then defined by the ratio Fe/F0 (or its integer part), where:                Fe is the sampling rate, and        F0 is the fundamental frequency.        
It will be recalled therefore that the long-term prediction LTP parameters, including the pitch period, represent the fundamental vibration of the speech signal (when voiced), while the short-term prediction LPC parameters represent the spectral envelope of this signal.
In certain coders, the set of these LPC and LTP parameters thus resulting from a speech coding can be transmitted by blocks to a homologous decoder via one or more telecommunications networks so that the original speech can then be reconstructed.
In standard speech coding, the coder generates a fixed bit rate bitstream. This bit-rate constraint simplifies the implementation and use of the coder and the decoder. Examples of such systems are the UIT-T G.711 64 kbit/s coding standard, the UIT-T G.729 8 kbit/s coding standard, or the GSM-EFR 12.2 kbit/s coding.
In certain applications (such as mobile telephony or voice over IP (Internet Protocol), it is preferable to generate a variable-rate bitstream. The bit-rate values are taken from a predefined set. Such a coding technique, called “multi-rate”, thus proves more flexible than a fixed bit-rate coding technique.
Several multi-rate coding techniques can be distinguished:                source- and/or channel-controlled multi-mode coding, used in particular in 3GPP AMR-NB, 3GPP AMR-WB, or 3GPP2 VMR-WB coders,        hierarchical, or “scalable” coding, which generates a so-called “hierarchical” bitstream since it comprises a core bit rate and one or more enhancement layers (standard coding according to G.722 at 48, 56 and 64 kbit/s being typically bit-rate scalable, while UIT-T G.729.1 and MPEG-4 CELP codings are both bit-rate and bandwidth-scalable),        multiple-description coding, described in particular in:                    “A multiple description speech coder based on AMR-WB for mobile ad hoc networks”, H. Dong, A. Gersho, J. D. Gibson, V. Cuperman, ICASSP, p. 277-280, vol. 1 (May 2004).                        
Details will be given below of hierarchical coding, having the capacity to provide varied bit rates by distributing the information relating to an audio signal to be coded in hierarchically-arranged subsets, so that this information can be used by order of importance with respect to the audio rendering quality. The criterion taken into account for determining the order is an optimization (or rather minimum degradation) criterion of the quality of the coded audio signal. Hierarchical coding is particularly suited to transmission on heterogeneous networks or those having available bit rates varying over time, or also transmission to terminals having variable capacities.
The basic concept of hierarchical (or “scalable”) audio coding can be described as follows.
The bitstream comprises a base layer and one or more enhancement layers. The base layer is generated by a (fixed) low bit-rate codec classified as a “core codec” guaranteeing the minimum quality of the coding. This layer must be received by the decoder in order to maintain an acceptable level of quality. The enhancement layers serve to enhance the quality. It can occur however that they are not all received by the decoder.
The main advantage of hierarchical coding is that it then allows an adaptation of the bit rate simply by “bitstream truncation”. The number of layers (i.e. the number of possible bitstream truncations) defines the granularity of the coding. The expression “high granularity” is used if the bitstream comprises few layers (of the order of 2-4) and “fine granularity” coding allows for example a pitch of the order of 1-2 kbit/s.
More particularly described below are bit-rate and bandwidth-scalable coding techniques with a CELP-type core coder in a telephony band, plus one or more enhancement layers in wideband. An example of such systems is given in the UIT-T G.729.1 8-32 kbit/s fine granularity standard. The G.729.1 coding/decoding algorithm is summarized hereafter.
Reminders on the G.729.1 Coder
The G.729.1 coder is an extension of the UIT-T G.729 coder. This is a modified G.729 hierarchical core coder, producing a signal the band of which extends from narrowband (50-4000 Hz) to wideband (50-7000 Hz) at a bit rate of 8-32 kbit/s for speech services. This codec is compatible with existing voice over IP equipment (for the most part equipped according to standard G.729). It is appropriate to point out finally that standard G.729.1 was approved in May 2006.
The G.729.1 coder is shown diagrammatically in FIG. 1. The wideband input signal swb, sampled at 16 kHz, is firstly split into two sub-bands by quadrature mirror filtering (QMF). The low band (0-4000 Hz) is obtained by low-pass filtering LP (block 100) and decimation (block 101), and the high band (4000-8000 Hz) by high-pass filtering HP (block 102) and decimation (block 103). The LP and HP filters are of length 64 bits.
The low band is pre-processed by a high-pass filter removing components below 50 Hz (block 104), in order to obtain the signal sLB, before narrowband CELP coding (block 105) at 8 and 12 kbit/s. This high-pass filtering takes into account the fact that the useful band is defined as covering the range 50-7000 Hz. The narrowband CELP coding is a CELP cascade coding comprising as a first stage a modified G.729 coding without a pre-processing filter and as a second stage an additional fixed CELP dictionary.
The high band is firstly pre-processed (block 106) in order to compensate for the aliasing due to the high-pass filter (block 102) in combination with the decimation (block 103). The high band is then filtered by a low-pass filter (block 107) eliminating the high-band components between 3000 and 4000 Hz (i.e. the components in the original signal between 7000 and 8000 Hz) in order to obtain the signal sHB. Band expansion (block 108) is then carried out.
A significant feature of the G.729.1 encoder according to FIG. 1 is the following. The low-band error signal dLB is computed (block 109) on the basis of the output of the CELP coder (block 105) and a predictive transform coding (for example of the TDAC (time domain aliasing cancellation) type in standard G.729.1) is carried out at block 110. With reference to FIG. 1, it can be seen in particular that the TDAC encoding is applied both to the low-band error signal and to the high-band filtered signal.
Additional parameters can be transmitted by block 111 to a corresponding decoder, this block 111 carrying out a processing called “FEC” for “Frame Erasure Concealment”, in order to reconstitute any erased frames.
The different bitstreams generated by coding blocks 105, 108, 110 and 111 are finally multiplexed and structured in a hierarchical bitstream in the multiplexing block 112. The coding is carried out by blocks of samples (or frames) of 20 ms, i.e. 320 samples per frame.
The G.729.1 codec thus has a three-stage coding architecture comprising:                CELP cascade coding,        expansion of bandwidth parameters by the time domain bandwidth extension (TDB WE) type module 108, and        TDAC predictive transform coding, applied after a modified discrete cosine transform (MDCT) type transform.        
Reminders on the G.729.1 Decoder
The corresponding decoder according to standard G.729.1 is shown in FIG. 2. The bits describing each frame of 20 ms are demultiplexed in block 200.
The bitstream of layers at 8 and 12 kbit/s is used by the CELP decoder (block 201) to generate the narrowband synthesis (0-4000 Hz). The portion of the bitstream associated with the layer at 14 kbit/s is decoded by the bandwidth expansion module (block 202). The portion of the bitstream associated with bit rates higher than 14 kbit/s is decoded by the TDAC module (block 203). A pre- and post-echo processing is carried out by blocks 204 and 207 as well as an enhancement (block 205) and post-processing of the low band (block 206).
The wideband output signal ŝwb, sampled at 16 kHz, is obtained using the QMF synthesis filterbank (blocks 209, 210, 211, 212 and 213) integrating the aliasing cancellation (block 208).
The description of the transform coding layer is detailed hereafter.
Reminders on the TDAC Transform Coder in the G.729.1 Coder
The TDAC type transform coding in the G.729.1 coder is shown in FIG. 3.
The filter WLB(z) (block 300) is a perceptual weighting filter, with gain compensation, applied to the low band error signal dLB. MDCT transforms are then computed (block 301 and 302) in order to obtain:                the MDCT spectrum DLBw of the difference signal, perceptually filtered, and        the MDCT spectrum SHB of the original high-band signal.        
These MDCT transforms (blocks 301 and 302) are applied to 20 ms of signal sampled at 8 kHz (160 coefficients). The spectrum Y(k) coming from the merging block 303 thus comprises 2×160, i.e. 320 coefficients. It is defined as follows:[Y(0)Y(1) . . . Y(319)]=[DLBw(0)DLBw(1) . . . DLBw(159)SHB(0)SHB(1) . . . SHB(159)]
This spectrum is divided into eighteen sub-bands, a sub-band j being allocated a number of coefficients denoted nb_coef(j). The division into sub-bands is specified in Table 1 hereafter.
Thus, a sub-band j comprises the coefficients Y(k) with sb_bound(j)≦k<sb_bound(j+1).
TABLE 1Boundaries and size of the sub-bands in TDAC codingJsb_bound(j)nb_coef(j)0016116162321634816464165801669616711216812816914416101601611176161219216132081614224161524016162561617272 818280—
The spectral envelope {log_rms(j)}j=0, . . . , 17 is computed in block 304 according to the formula:
                    log_rms        ⁢                  (          j          )                    =                        1          2                ⁢                              log            2                    [                                                    1                                  nb_coef                  ⁢                                      (                    j                    )                                                              ⁢                                                ∑                                      k                    =                                          sb_bound                      ⁢                                              (                        j                        )                                                                                                                        sb_bound                      ⁢                                              (                                                  j                          +                          1                                                )                                                              -                    1                                                  ⁢                                                      Y                    ⁡                                          (                      k                      )                                                        2                                                      +                          ɛ              rms                                ]                      ,                  ⁢          j      =      0        ,    …    ⁢                  ,    17    where            ɛ      rms        =                  2                  -          24                    .      
The spectral envelope is coded at a variable bit rate in block 305. This block 305 produces quantized integer values denoted rms_index(j) (with j=0 . . . , 17), obtained by simple scalar quantization:rms_index(j)=round(2·log—rms(j))where the notation “round” denotes rounding to the nearest integer, and with the constraint:−11≦rms_index(j)≦+20
This quantized value rms_index(j) is transmitted to the bit allocation block 306.
Coding of the spectral envelope itself is also carried out by block 305, separately for the low band (rms_index(j), with j=0, . . . , 9), and for the high band (rms_index(j), with j=10, . . . , 17). In each band, two types of coding can be chosen according to a given criterion, and, more precisely, the values rms_index(j):                can be encoded by coding called “differential Huffman coding”,        or can be encoded by natural binary coding.        
A bit (0 or 1) is transmitted to the decoder in order to indicate the chosen coding mode.
The number of bits allocated to each sub-band for its quantization is determined at block 306, on the basis of the quantized spectral envelope coming from block 305. The bit allocation carried out minimizes the root mean square deviation while respecting the constraint of a whole number of bits allocated per sub-band and a maximum number of bits that is not to be exceeded. The spectral content of the sub-bands is then encoded by spherical vector quantization (block 307).
The different bitstreams generated by blocks 305 and 307 are then multiplexed and structured in a hierarchical bitstream at the multiplexing block 308.
Reminder on the Transform Decoder in the G.729.1 Decoder
The stage of TDAC type transform decoding in the decoder G.729.1 is shown in FIG. 4.
In a similar manner to the encoder (FIG. 3), the decoded spectral envelope (block 401) makes it possible to retrieve the bit allocation (block 402). The envelope decoding (block 401) reconstructs the quantized values of the spectral envelope (rms_index(j), for j=0, . . . , 17), on the basis of the (multiplexed) bitstream generated by the block 305, deducing the decoded envelope therefrom:rms—q(j)=21/2 rms—index(j) 
The spectral content of each of the sub-bands is retrieved by inverse spherical vector quantization (block 403). The sub-bands which are not transmitted due to an insufficient “bit budget” are extrapolated (block 404) on the basis of the MDCT transform of the output signal of the band extension (block 202 in FIG. 2).
After level adjustment of this spectrum (block 405) in relation to the spectral envelope and post-processing (block 406), the MDCT spectrum is split in two (block 407):                with 160 first coefficients corresponding to the spectrum {circumflex over (D)}LBw of the low band decoded difference signal, perceptually filtered,        and 160 following coefficients corresponding to the spectrum ŜHB of the original high-band decoded signal.        
These two spectra are transformed into time signals by inverse MDCT transform, denoted IMDCT (blocks 408 and 410), and the inverse perceptual weighting (filter denoted WLB(z)−1 is applied to the signal {circumflex over (d)}LBw (block 409) resulting from the inverse transform.
The allocation of bits to the sub-bands (block 306 in FIG. 3 or block 402 in FIG. 4) is more particularly described hereafter.
Blocks 306 and 402 carry out an identical operation on the basis of the values rms_index(j), j=0, . . . , 17. Thus it will be considered sufficient to describe below the functions of block 306 only.
The purpose of the binary allocation is to distribute between each of the sub-bands a certain (variable) bit budget denoted nbits_VQ, with:
nbits_VQ=351−nbits_rms, where nbits_rms is the number of bits used by the coding of the spectral envelope.
The result of the allocation is the whole number of bits, denoted nbit(j) (with j=0, . . . , 17), allocated to each of the sub-bands, having as an overall constraint:
            ∑              j        =        0            17        ⁢          nbit      ⁡              (        j        )              ≈  nbits_VQ
In standard G.729.1, the values nbit(j) (j=0, . . . , 17), are moreover constrained by the fact that nbit(j) must be chosen from a restricted value set specified in Tableau 2 below.
TABLE 2Possible values for number of bits allocated in TDAC sub-bands.Size of thesub-band jnb_coef(j)Set of permitted values for nbit(j) (in number of bits)8R8 = {0, 7, 10, 12, 13, 14, 15, 16}16R16 = {0, 9, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,26, 27, 28, 29, 30, 31, 32}
The allocation in standard G.729.1 relies on a “perceptual importance” per sub-band linked to the sub-band energy, denoted ip(j) (j=0 . . . 17), defined as follows:
            ip      ⁡              (        j        )              =                            1          2                ⁢                              log            2                    ⁡                      (                          rms_q              ⁢                                                (                  j                  )                                2                            ×              nb_coef              ⁢                              (                j                )                                      )                              +      offset        where      offset    =          -      2.      
Since the values rms_q(j)=21/2 rms—index(j), this formula can be simplified in the form:
      ip    ⁡          (      j      )        =      {                                                      1              2                        ⁢            rms_index            ⁢                          (              j              )                                                                                          for                ⁢                                                                  ⁢                j                            =              0                        ,            …            ⁢                                                  ,            16                                                                          1              2                        ⁢                          (                                                rms_index                  ⁢                                      (                    j                    )                                                  -                1                            )                                                                          for              ⁢                                                          ⁢              j                        =            17                              
On the basis of the perceptual importance of each sub-band, the allocation nbit(j) is computed as follows:
      nbit    ⁡          (      j      )        =            arg              r        ∈        R              ⁢                  min                  nb_coef          ⁢                      (            j            )                              ⁢                                            nb_coef            ⁢                          (              j              )                        ×                          (                                                ip                  ⁡                                      (                    j                    )                                                  -                                  λ                  opt                                            )                                -          r                            where λopt is a parameter optimized by dichotomy.
The incidence of the perceptual weighting (filtering of block 300) on the bit allocation (block 306) of the TDAC transform coder will now be described in more detail.
In standard G.729.1, the TDAC coding uses the perceptual weighting filter WLB(z) in the low band (block 300), as described above. In substance, the perceptual weighting filtering makes it possible to shape the coding noise. The principle of this filtering is to use the fact that it is possible to inject more noise in the frequency zones where the original signal has a strong energy.
The perceptual weighting filters most commonly used in narrowband CELP coding have the form Â(z/γ1)/Â(z/γ2) where 0<γ2<γ1<1 and Â(z) represents a linear prediction spectrum (LPC). Thus the effect of the CELP coding analyse-by-synthesis is to minimize the root mean square deviation in a signal domain perceptually weighted by this type of filter.
However, in order to ensure the spectral continuity when the spectra DLBw and SHB are adjacent (block 303 in FIG. 3), the filter WLB(z) is defined in the form:
                    W        LB            ⁡              (        z        )              =          fac      ⁢                                    A            ^                    ⁡                      (                          z              /                              γ                1                                      )                                                A            ^                    ⁡                      (                          z              /                              γ                2                                      )                                with                    γ        1            =      0.96        ,                  ⁢                  γ        2            =      0.6        and      fac    =                                              ∑                          i              =              0                        p                    ⁢                                                    (                                  -                                      γ                    2                                                  )                            i                        ⁢                                          a                ^                            i                                                            ∑                          i              =              0                        p                    ⁢                                                    (                                  -                                      γ                    1                                                  )                            i                        ⁢                                          a                ^                            i                                                
The factor fac allows a filter gain at 1-4 kHz to be provided at the junction of the low and high bands (4 kHz). It is important to note that, in TDAC coding according to standard G.729.1, the coding relies on an energy criterion alone.
Drawbacks of the Prior Art
In standard G.729.1, the encoder TDAC processes in conjunction:                the signal difference between the original low band and the CELP synthesis, perceptually filtered by a filter of the type Â(z/γ1)/Â(z/γ2), gain-compensated (ensuring spectral continuity), and        the high band which contains the original high-band signal.        
The low-band signal corresponds to the 50 Hz-4 kHz frequencies, while the high-band signal corresponds to the 4-7 kHz frequencies.
The joint coding of these two signals is carried out in the MDCT domain according to the root mean square deviation criterion. Thus the high band is coded according to energy criteria, which is sub-optimal (in the “perceptual” sense of the term).
Still more generally, a coding in several bands can be considered, a perceptual weighting filter being applied to the signal of at least one band in the time domain, and the set of sub-bands being coded in conjunction by transform coding. If it is desired to apply perceptual weighting in the frequency domain, the problem then posed is the continuity and homogeneity of the spectra between sub-bands.