Coded-Excited Linear Prediction (CELP) is a very popular technology which is used to encode a speech signal by using specific human voice characteristics or a human vocal voice production model. Examples of CELP inner core layer plus a first Modified Discrete Cosine Transform (MDCT) enhancement layer can be found in the ITU-T G.729.1 or G.718 standards, the related contents of which are summarized hereinbelow. A very detailed description can be found in the ITU-T standard documents.
General Description of ITU-T G.729.1
ITU-T G.729.1 is also called a G.729EV coder which is an 8-32 kbit/s scalable wideband (50-7000 Hz) extension of ITU-T Rec. G.729. By default, the encoder input and decoder output are sampled at 16,000 Hz. The bitstream produced by the encoder is scalable and has 12 embedded layers, which will be referred to as Layers 1 to 12. Layer 1 is the core layer corresponding to a bit rate of 8 kbit/s. This layer is compliant with the G.729 bitstream, which makes G.729EV interoperable with G.729. Layer 2 is a narrowband enhancement layer adding 4 kbit/s, while Layers 3 to 12 are wideband enhancement layers adding 20 kbit/s with steps of 2 kbit/s.
This coder is designed to operate with a digital signal sampled at 16,000 Hz followed by conversion to 16-bit linear pulse code modulation (PCM) for the input to the encoder. However, the 8,000 Hz input sampling frequency is also supported. Similarly, the format of the decoder output is 16-bit linear PCM with a sampling frequency of 8,000 or 16,000 Hz. Other input/output characteristics are converted to 16-bit linear PCM with 8,000 or 16,000 Hz sampling before encoding, or from 16-bit linear PCM to the appropriate format after decoding.
The G.729EV coder is built upon a three-stage structure: embedded Code-Excited Linear-Prediction (CELP) coding, Time-Domain Bandwidth Extension (TDBWE) and predictive transform coding that will be referred to as Time-Domain Aliasing Cancellation (TDAC). The embedded CELP stage generates Layers 1 and 2, which yield a narrowband synthesis (50-4,000 Hz) at 8 kbit/s and 12 kbit/s. The TDBWE stage generates Layer 3 and allows producing a wideband output (50-7000 Hz) at 14 kbit/s. The TDAC stage operates in the MDCT domain and generates Layers 4 to 12 to improve quality from 14 to 32 kbit/s. TDAC coding represents jointly the weighted CELP coding error signal in the 50-4,000 Hz band and the input signal in the 4,000-7,000 Hz band.
The G.729EV coder operates on 20 ms frames. However, the embedded CELP coding stage operates on 10 ms frames, like G.729. As a result, two 10 ms CELP frames are processed per 20 ms frame. In the following, to be consistent with the text of ITU-T Rec. G.729, the 20 ms frames used by G.729EV will be referred to as superframes, whereas the 10 ms frames the 5 ms subframes involved in the CELP processing will be respectively called frames and subframes.
G729.1 Encoder
A functional diagram of the G729.1 encoder part is presented in FIG. 1. The encoder operates on 20 ms input superframes. By default, input signal 101, sWB(n), is sampled at 16,000 Hz., therefore, the input superframes are 320 samples long. Input signal sWB(n) is first split into two sub-bands using a quadrature mirror filterbank (QMF) defined by the filters H1(z) and H2(z). Lower-band input signal 102, sLBqmf(n), obtained after decimation is pre-processed by a high-pass filter Hh1(z) with 50 Hz cut-off frequency. The resulting signal 103, sLB(n), is coded by the 8-12 kbit/s narrowband embedded CELP encoder. To be consistent with ITU-T Rec. G.729, the signal sLB(n) will also be denoted s(n). The difference 104, dLB(n), between s(n) and the local synthesis 105, ŝenh(n), of the CELP encoder at 12 kbit/s is processed by the perceptual weighting filter WLB(z). The parameters of WLB(z) are derived from the quantized LP coefficients of the CELP encoder. Furthermore, the filter WLB(z) includes a gain compensation that guarantees the spectral continuity between the output 106, dLBw(n), of WLB(z) and the higher-band input signal 107, sHB(n). The weighted difference dLBw(n) is then transformed into frequency domain by MDCT. The higher-band input signal 108, sHBfold(n), obtained after decimation and spectral folding by (−1)n is pre-processed by a low-pass filter Hh2(z) with a 3,000 Hz cut-off frequency. Resulting signal sHB(n) is coded by the TDBWE encoder. The signal sHB(n) is also transformed into the frequency domain by MDCT. The two sets of MDCT coefficients, 109, DLBw(k), and 110, SHB(k), are finally coded by the TDAC encoder. In addition, some parameters are transmitted by the frame erasure concealment (FEC) encoder in order to introduce parameter-level redundancy in the bitstream. This redundancy allows improved quality in the presence of erased superframes.
G729.1 Decoder
A functional diagram of the G729.1 decoder is presented in FIG. 2a, however, the specific case of frame erasure concealment is not considered in this figure. The decoding depends on the actual number of received layers or equivalently on the received bit rate.
If the received bit rate is:
8 kbit/s (Layer 1): The core layer is decoded by the embedded CELP decoder to obtain 201, ŝLB(n)=ŝ(n). Then, ŝLB(n) is postfiltered into 202, ŝLBpost(n), and post-processed by a high-pass filter (HPF) into 203, ŝLBqmf(n)=ŝLBhpf(n). The QMF synthesis filterbank defined by the filters G1(z) and G2(z) generates the output with a high-frequency synthesis 204, ŝHBqmf(n), set to zero.
12 kbit/s (Layers 1 and 2): The core layer and narrowband enhancement layer are decoded by the embedded CELP decoder to obtain 201, ŝLB(n)=ŝenh(n), and ŝLB(n) is then postfiltered into 202, ŝLBpost(n) and high-pass filtered to obtain 203, ŝLBqmf(n)=ŝLBhpf(n). The QMF synthesis filterbank generates the output with a high-frequency synthesis 204, ŝHBqmf(n) set to zero.
14 kbit/s (Layers 1 to 3): In addition to the narrowband CELP decoding and lower-band adaptive postfiltering, the TDBWE decoder produces a high-frequency synthesis 205, ŝHBbwe(n) which is then transformed into frequency domain by MDCT so as to zero the frequency band above 3000 Hz in the higher-band spectrum 206, ŜHBbwe(n). The resulting spectrum 207, ŜHB(k) is transformed in time domain by inverse MDCT and overlap-add before spectral folding by (−1)n. In the QMF synthesis filterbank the reconstructed higher band signal 204, ŝHBqmf(n) is combined with the respective lower band signal 202, ŝLBqmf(n)=ŝLBpost(n) reconstructed at 12 kbit/s without high-pass filtering.
Above 14 kbit/s (Layers 1 to 4+): In addition to the narrowband CELP and TDBWE decoding, the TDAC decoder reconstructs MDCT coefficients 208, {circumflex over (D)}LBw(k) and 207, ŜHB(k), which correspond to the reconstructed weighted difference in lower band (0-4,000 Hz) and the reconstructed signal in higher band (4,000-7,000 Hz). Note that in the higher band, the non-received sub-bands and the sub-bands with zero bit allocation in TDAC decoding are replaced by the level-adjusted sub-bands of ŜHBbwe(k). Both {circumflex over (D)}LBw(k) and ŜHB(k) are transformed into the time domain by inverse MDCT and overlap-add. Lower-band signal 209, {circumflex over (d)}LBw(n) is then processed by the inverse perceptual weighting filter WLB(z)−1. To attenuate transform coding artifacts, pre/post-echoes are detected and reduced in both the lower- and higher-band signals 210, {circumflex over (d)}LB(n) and 211, ŝHB(n). The lower-band synthesis ŝLB(n) is postfiltered, while the higher-band synthesis 212, ŝHBfold(n), is spectrally folded by (−1)n. The signals ŝLBqmf(n)=ŝLBpost(n) and ŝHBqmf(n) are then combined and upsampled in the QMF synthesis filterbank.
Bit Allocation to Coder Parameters and Bitstream Layer Format
For a given bit rate, the bitstream is obtained by concatenation of the contributing layers. For example, at 24 kbit/s, which corresponds to 480 bits per superframe, the bitstream comprises Layer 1(160 bits)+Layer 2(80 bits)+Layer 3(40 bits)+Layers 4 to 8(200 bits). The G.729EV bitstream format is illustrated in FIG. 2b. 
Since the TDAC coder employs spectral envelope entropy coding and adaptive sub-band bit allocation, the TDAC parameters are encoded with a variable number of bits. However, the bitstream above 14 kbit/s can be still formatted into layers of 2 kbit/s, because the TDAC encoder performs a bit allocation on the basis of the maximum encoder bitrate (32 kbit/s) and the TDAC decoder can handle bitstream truncations at arbitrary positions.
G.729.1 TDAC Encoder (Layers 4 to 12)
A G.729.1 Time Domain Aliasing Cancellation (TDAC) encoder is illustrated in FIG. 3. The TDAC encoder represents jointly two split MDCT spectra 301, DLBw(k), and 302, SHB(k), by gain-shape vector quantization. DLBw(k) represents CELP coding error in weighted spectrum domain of [0.4 kHz] and SHB(k) is the unquantized weighted spectrum of [4 kHz, 8 kHz]. The joint spectrum is divided into sub-bands. The gains in each sub-band define the spectral envelope and the shape of each sub-band is encoded by embedded spherical vector quantization using trained permutation codes.
G.729.1 Perceptual Weighting of the CELP Difference Signal
The difference 104, dLB(n), between the embedded CELP encoder input s(n) and the 12 kbit/s local synthesis 105, ŝenh(n), is processed by a perceptual weighting filter WLB(z) defined as:
                                                        W              LB                        ⁡                          (              z              )                                =                      fac            ⁢                                                            A                  ^                                ⁡                                  (                                      z                    /                                          γ                      1                      ′                                                        )                                                                              A                  ^                                ⁡                                  (                                      z                    /                                          γ                      2                      ′                                                        )                                                                    ,                            (        1        )            where fac is a gain compensation and âi are the coefficients of the quantized linear-prediction filter Â(z)i obtained from the embedded CELP encoder. The gain compensation factor guarantees the spectral continuity between the output 106, dLBw(n), of WLB(z) and the signal 107, sHB(n), in the adjacent higher band. The filter WLB(z) models the short-term inverse frequency masking curve and allows applying MDCT coding optimized for the mean-square error criterion. It also maps the difference signal 104, dLB(n), into a weighted domain similar to the CELP target domain used at 8 and 12 kbit/s.Sub-Bands
The MDCT coefficients in the 0-7,000 Hz band are split into 18 sub-bands. The j-th sub-band comprises nb_coef(j) coefficients 103, Y(k), with sb_bound (j)≦k≦sb_bound (j+1). The first 17 sub-bands comprise 16 coefficients (400 Hz), and the last sub-band comprises 8 coefficients (200 Hz). The spectral envelope is defined as the root mean square (rms) 304 in log domain of the 18 sub-bands:
                                          log_rms            ⁢                          (              j              )                                =                                    1              2                        ⁢                                          log                2                            [                                                                    1                                          nb_coef                      ⁢                                              (                        j                        )                                                                              ⁢                                                            ∑                                              k                        =                                                                              sb                            ⁢                            _                            ⁢                            bound                                                    ⁢                                                      (                            j                            )                                                                                                                                                                            sb                            ⁢                            _                            ⁢                            bound                                                    ⁢                                                      (                                                          j                              +                              1                                                        )                                                                          -                        1                                                              ⁢                                                                  Y                        ⁡                                                  (                          k                          )                                                                    2                                                                      +                                  ɛ                  rms                                            ]                                      ,                                  ⁢                  j          =          0                ,        …        ⁢                                  ,        17        ,                            (        2        )            where: εrms=2−24. The spectral envelope is quantized with 5 bits by uniform scalar quantization and the resulting quantization indices are coded using a two-mode binary encoder. The 5-bit quantization consists in computing the indices 305, rms_index(j), j=0, . . . , 17, as follows:
                                          rms_index            ⁢                          (              j              )                                =                      round            ⁡                          (                                                1                  2                                ⁢                log_rms                ⁢                                  (                  j                  )                                            )                                      ,                            (        3        )            with the restriction:−11≦rms_index(j)≦+20,  (4)i.e., the indices are limited by −11 and +20(32 possible values). The resulting quantized full-band envelope is then divided into two subvectors:
lower-band spectral envelope: (rms_index(0), rms_index(1), . . . , rms_index(9)); and
higher-band spectral envelope: (rms_index(10), rms_index(11), . . . , rms_index(17)).
These two subvectors are coded separately using a two-mode lossless encoder which switches adaptively between differential Huffman coding (mode 0) and direct natural binary coding (mode 1). Differential Huffman coding is used to minimize the average number of bits, whereas direct natural binary coding is used to limit the worst-case number of bits as well to correctly encode the envelope of signals which are saturated by differential Huffman coding (e.g., sinusoids). One bit is used to indicate the selected mode to the spectral envelope decoder. The higher-band spectral envelope is encoded in a similar way, i.e., by switched differential Huffman coding and (direct) natural binary coding. One bit is used to indicate the selected mode to the decoder.
Sub-Band Ordering by Perceptual Importance
The perceptual importance 307, ip(j), j=0 . . . 17, of each sub-band is defined as:
                                          ip            ⁡                          (              j              )                                =                                                    1                2                            ⁢                                                log                  2                                ⁡                                  (                                      rms_q                    ⁢                                                                  (                        j                        )                                            2                                        ×                    nb_coef                    ⁢                                          (                      j                      )                                                        )                                                      +            offset                          ,                            (        5        )            where rms_q(j)=21/2 rms—index(j) is the quantized rms and rms_q(j)2×nb_coef(j) corresponds to the quantized sub-band energy. Consequently, the perceptual importance is equivalent to the sub-band log-energy (let alone the offset). This information is related to the quantized spectral envelope as follows:
                              ip          ⁡                      (            j            )                          =                                            1              2                        ⁡                          [                                                rms_index                  ⁢                                      (                    j                    )                                                  +                                                      log                    2                                    ⁡                                      (                                          nb_coef                      ⁢                                              (                        j                        )                                                              )                                                              ]                                +                      offset            .                                              (        6        )            
The offset value is introduced to simplify further the expression of 307, ip(j). Using offset=−2, the perceptual importance boils down to:
                              ip          ⁡                      (            j            )                          =                  {                                                                                          1                    2                                    ⁢                  rms_index                  ⁢                                      (                    j                    )                                                                                                                                          for                      ⁢                                                                                          ⁢                      j                                        =                    0                                    ,                  …                  ⁢                                                                          ,                  16                                                                                                                          1                    2                                    ⁢                                      (                                                                  rms_index                        ⁢                                                  (                          j                          )                                                                    -                      1                                        )                                                                                                                    for                    ⁢                                                                                  ⁢                    j                                    =                  17.                                                                                        (        7        )            
The sub-bands are then sorted by decreasing perceptual importance. The result is an index 0≦ord_ip(j)<18, j=0, . . . , 17 for each sub-band which indicates that sub-band j has the (ord_ip(j)+1)-th largest perceptual importance. This ordering is used for bit allocation and multiplexing of vector quantization indices.
Bit Allocation for Split Spherical Vector Quantization
The number of bits allocated to each sub-band is determined using the perceptual importance ip(j), j=0 . . . 17, which is also computed at the TDAC decoder. As a result, the decoder can perform the same operation without any side information. The maximum allocation is limited to 2 bits per sample. The total bit budget is nbits_VQ=351-nbits_HB-nbits_LB, where nbits_LB and nbits_HB correspond to the number of bits used to encode the lower-band and higher-band spectral envelope, respectively. The total number of allocated bits never exceeds the bit budget (due to the properly initialized search interval). However it may be inferior to the bit budget. In this case the remaining bit budget is further distributed to each sub-band in the order of decreasing perceptual importance (this procedure is based on the indices ord_ip(j)).
Quantization of MDCT Coefficients
Each sub-band j=0, . . . , 17 of dimension nb_coef(j) is encoded with nbit(j) bits by spherical vector quantization. This operation is divided into two steps: (1) searching for the best codevector and (2) indexing of the selected codevector.
TDAC Decoder (Layers 4 to 12)
The TDAC decoder is depicted in FIG. 4. The received normalization factor (called norm_MDCT) transmitted by the encoder with 4 bits is used in the TDAC decoder to scale the MDCT coefficients. The factor is used to scale the signal reconstructed by two inverse MDCTs.
Spectral Envelope Decoding
The higher-band spectral envelope is decoded first. The bit indicating the selected coding mode at the encoder may be: 0→differential Huffman coding, 1→natural binary coding. If mode 0 is selected, 5 bits are decoded to obtain an index rms_index(10) in [−11, +20]. Then, the Huffman codes associated with the differential indices diff_index(j), j=11, . . . , 17, are decoded. The index, 401, rms_index(j), j=11, . . . , 17, is reconstructed as follows:rms_index(j)=rms_index(j−1)+diff_index(j).  (8)
If mode 1 is selected, rms_index(j), j=10, . . . , 17, is obtained in [−11, +20] by decoding 8×5 bits. If the number of bits is not sufficient to decode the higher-band spectral envelope completely, the decoded indices rms_index(j) are kept to allow partial level-adjustment of the decoded higher-band spectrum. The bits related to the lower band, i.e., rms_index(j), j=0, . . . , 9, are decoded in a similar way as in the higher band, including one bit to select mode 0 or 1. The decoded indices are combined into a single vector [rms_index(0) rms_index(1) . . . rms_index(17)], which represents the reconstructed spectral envelope in log domain. This envelope is converted into the linear domain as follows, 402:rms—q(j)=21/2 rms—index(j)  (9)If the spectral envelope is not completely decoded, the sub-band ordering is not performed, and the bit allocation is not performed.Decoding of the Vector Quantization Indices
The vector quantization indices are read from the TDAC bitstream according to their perceptual importance. If sub-band j has zero bit allocated, i.e., 403, nbit(j)=0, or if the corresponding vector quantization is not received, its coefficients are set to zero at this stage. In sub-band j of dimension nb_coef(j) and non-zero bit allocation, 403, nbit(j), the vector quantization index identifies a codevector y which is a signed permutation of an absolute leader y0.
Extrapolation of Missing Higher-Band Sub-Bands and Level Adjustment of Extrapolated Sub-Bands
In the higher-band spectrum (for sub-bands j=10, . . . , 17) the non-received sub-bands and the sub-bands with nbit(j)=0 are replaced by the equivalent sub-bands in the MDCT of the TDBWE synthesis, i.e., 406, Ŷext(sb_bound(j)+k)=ŜHBbwe(sb_bound(j)−160+k), k=0, . . . , nb_coef(j)−1. To gracefully improve quality with the number of received TDAC layers, the MDCT coefficients of the signal, 405, ŝHBbwe(n) obtained by bandwidth extension (TDBWE) are level adjusted based on the received TDAC spectral envelope. The rms of the extrapolated sub-bands is therefore set to, 402, rms_q(j) if this higher-band envelope information is available.
Inverse Perceptual Weighting Filter
The inverse filter WLB(Z)−1 is defined as:
                                                                        W                LB                            ⁡                              (                z                )                                                    -              1                                =                                    1              fac                        ⁢                                                            A                  ^                                ⁡                                  (                                      z                    /                                          γ                      2                      ′                                                        )                                                                              A                  ^                                ⁡                                  (                                      z                    /                                          γ                      1                      ′                                                        )                                                                    ,                            (        10        )            where 1/fac is a gain compensation factor and âi are the coefficients of the decoded linear-predictive filter Â(z) obtained from the narrowband embedded CELP decoder as in 4.1.1/G.729. As in the encoder, these coefficients are updated every 5 ms subframe. The role of WLB(z)−1 is to shape the coding noise introduced by the TDAC decoder in the lower band. The factor 1/fac is adapted to guarantee the spectral continuity between {circumflex over (d)}LB(n) and ŝLB(n).