1. Field of the Invention
The present invention is generally in the field of audio/speech coding. In particular, the present invention is in the field of low bit rate audio/speech coding.
2. Background Art
Frequency domain coding (transform coding) has been widely used in various ITU-T, MPEG, and 3 GPP standards. If bit rate is very low, a concept of BandWidth Extension (BWE) is well possible to be used. BWE usually comprises frequency envelope coding, temporal envelope coding, and spectral fine structure generation. Unavoidable errors in generating fine spectrum could lead to unstable decoded signal or obviously audible echoes especially for fast changing signal. Fine or precise quantization of temporal envelope shaping can clearly reduce echoes and/or perceptual distortion; but it could require lot of bits if traditional approach is used. A well known pre-art of BWE can be found in the standard ITU-T G.729.1 in which the algorithm is named as TDBWE (Time Domain Bandwidth Extension). The description of ITU-T G.729.1 related to TDBWE will be given here.
Frequency domain can be defined as FFT transformed domain; it can also be in MDCT (Modified Discrete Cosine Transform) domain.
General Description of ITU-T G.729.1
ITU G.729.1 is also called G.729EV coder which is an 8-32 kbit/s scalable wideband (50-7000 Hz) extension of ITU-T Rec. G.729. By default, the encoder input and decoder output are sampled at 16 000 Hz. The bitstream produced by the encoder is scalable and consists of 12 embedded layers, which will be referred to as Layers 1 to 12. Layer 1 is the core layer corresponding to a bit rate of 8 kbit/s. This layer is compliant with G.729 bitstream, which makes G.729EV interoperable with G.729. Layer 2 is a narrowband enhancement layer adding 4 kbit/s, while Layers 3 to 12 are wideband enhancement layers adding 20 kbit/s with steps of 2 kbit/s.
This coder is designed to operate with a digital signal sampled at 16000 Hz followed by conversion to 16-bit linear PCM for the input to the encoder. However, the 8000 Hz input sampling frequency is also supported. Similarly, the format of the decoder output is 16-bit linear PCM with a sampling frequency of 8000 or 16000 Hz. Other input/output characteristics should be converted to 16-bit linear PCM with 8000 or 16000 Hz sampling before encoding, or from 16-bit linear PCM to the appropriate format after decoding. The bitstream from the encoder to the decoder is defined within this Recommendation.
The G.729EV coder is built upon a three-stage structure: embedded Code-Excited Linear-Prediction (CELP) coding, Time-Domain Bandwidth Extension (TDBWE) and predictive transform coding that will be referred to as Time-Domain Aliasing Cancellation (TDAC). The embedded CELP stage generates Layers 1 and 2 which yield a narrowband synthesis (50-4000 Hz) at 8 and 12 kbit/s. The TDBWE stage generates Layer 3 and allows producing a wideband output (50-7000 Hz) at 14 kbit/s. The TDAC stage operates in the Modified Discrete Cosine Transform (MDCT) domain and generates Layers 4 to 12 to improve quality from 14 to 32 kbit/s. TDAC coding represents jointly the weighted CELP coding error signal in the 50-4000 Hz band and the input signal in the 4000-7000 Hz band.
The G.729EV coder operates on 20 ms frames. However, the embedded CELP coding stage operates on 10 ms frames, like G.729. As a result two 10 ms CELP frames are processed per 20 ms frame. In the following, to be consistent with the text of ITU-T Rec. G.729, the 20 ms frames used by G.729EV will be referred to as superframes, whereas the 10 ms frames and the 5 ms subframes involved in the CELP processing will be respectively called frames and subframes. In this G.729EV, TDBWE algorithm is related to our topics.
G.729.1 Encoder
A functional diagram of the encoder part is presented in FIG. 1. The encoder operates on 20 ms input superframes. By default, the input signal 101, sWB(n), is sampled at 16000 Hz. Therefore, the input superframes are 320 samples long. The input signal sWB(n) is first split into two sub-bands using a QMF filter bank defined by the filters H1/(z) and H2(z). The lower-band input signal 102, sLBqmf(n), obtained after decimation is pre-processed by a high-pass filter Hh1(z) with 50 Hz cut-off frequency. The resulting signal 103, sLB(n) is coded by the 8-12 kbit/s narrowband embedded CELP encoder. To be consistent with ITU-T Rec. G.729, the signal sLB(n) will also be denoted s(n). The difference 104, dLB(n), between s(n) and the local synthesis 105, ŝenh(n), of the CELP encoder at 12 kbit/s is processed by the perceptual weighting filter WLB (z). The parameters of WLB(z) are derived from the quantized LP coefficients of the CELP encoder. Furthermore, the filter WLB(z) includes a gain compensation which guarantees the spectral continuity between the output 106, dLBw(n), of WLB(z) and the higher-band input signal 107, sHB(n). The weighted difference dLBw (n) is then transformed into frequency domain by MDCT. The higher-band input signal 108, sHBfold(n), obtained after decimation and spectral folding by (−1)n is pre-processed by a low-pass filter Hh2(z) with 3000 Hz cut-off frequency. The resulting signal sHB(n) is coded by the TDBWE encoder. The signal sHB(n) is also transformed into frequency domain by MDCT. The two sets of MDCT coefficients 109, DLBw(k), and 110, SHB(k), are finally coded by the TDAC encoder. In addition, some parameters are transmitted by the frame erasure concealment (FEC) encoder in order to introduce parameter-level redundancy in the bitstream. This redundancy allows improving quality in the presence of erased superframes.
TDBWE Encoder
The TDBWE encoder is illustrated in FIG. 2. The Time Domain Bandwidth Extension (TDBWE) encoder extracts a fairly coarse parametric description from the pre-processed and downsampled higher-band signal 201, sHB(n). This parametric description comprises time envelope 202 and frequency envelope 203 parameters. A summarized description of respective envelope computations and the parameter quantization scheme will be given later.
The 20 ms input speech superframe 201, sHB(n) is subdivided into 16 segments of length 1.25 ms each, i.e., each segment comprises 10 samples. The 16 time envelope parameters 202, Tenv(i), i=0, . . . , 15, are computed as logarithmic subframe energies:
                                                        T              env                        ⁡                          (              i              )                                =                                    1              2                        ⁢                                          log                2                            ⁡                              (                                  1                  ⁢                                      /                                    ⁢                  10                  ⁢                                                            ∑                                              n                        =                        0                                            9                                        ⁢                                                                  S                        HB                        2                                            ⁡                                              (                                                  n                          +                                                      i                            ·                            10                                                                          )                                                                                            )                                                    ,                                  ⁢                  i          =          0                ,        …        ⁢                                  ,        15                            (        1        )            
The TDBWE parameters Tenv(i), i=0, . . . , 15, are quantized by mean-removed split vector quantization. First, a mean time envelope 204 is calculated:
                              M          T                =                              1            16                    ⁢                                    ∑                              i                =                0                            15                        ⁢                                          T                env                            ⁡                              (                i                )                                                                        (        2        )            
The mean value 204, MT, is then scalar quantized with 5 bits using uniform 3 dB steps in log domain. This quantization gives the quantized value 205, {circumflex over (M)}T. The quantized mean is then subtracted:TenvM(i)=Tenv(i)−{circumflex over (M)}T,i=0, . . . , 15  (3)
The mean-removed time envelope parameter set is split into two vectors of dimension 8Tenv,1=(TenvM(0)1, . . . , TenvM(1), . . . , TenvM(7)) and Tenv,2=(TenvM(8),TenvM(9), . . . , TenvM(15))  (4)
Finally, vector quantization using pre-trained quantization tables is applied. Note that the vectors Tenv,1 and Tenv,2 share the same vector quantization codebooks to reduce storage requirements. The codebooks (or quantization tables) for Tenv,1/Tenv,2 have been generated by modifying generalized Lloyd-Max centroids such that a minimal distance between two centroids is verified. The codebook modification procedure consists in rounding Lloyd-Max centroids on a rectangular grid with a step size of 6 dB in log domain.
For the computation of the 12 frequency envelope parameters 203, Fenv(j), j=0, . . . , 11, the signal 201, sHB(n), is windowed by a slightly asymmetric analysis window wF(n). The maximum of the window wF(n) is centered on the second 10 ms frame of the current superframe. The window wF (n) is constructed such that the frequency envelope computation has a lookahead of 16 samples (2 ms) and a lookback of 32 samples (4 ms). The windowed signal sHBw(n) is transformed by FFT. Finally, the frequency envelope parameter set is calculated as logarithmic weighted sub-band energies for 12 evenly spaced and equally wide overlapping sub-bands in the FFT domain. The j-th sub-band starts at the FFT bin of index 2 j and spans a bandwidth of 3 FFT bins.
G729.1 Decoder
A functional diagram of the decoder is presented in FIG. 3. The specific case of frame erasure concealment is not considered in this figure. The decoding depends on the actual number of received layers or equivalently on the received bit rate.
If the received bit rate is:                8 kbits (Layer 1): The core layer is decoded by the embedded CELP decoder to obtain 301, ŝLB(n)=ŝ(n). Then ŝLB(n) is postfiltered into 302, ŝLBpost(n), and post-processed by a high-pass filter (HPF) into 303, ŝLBqmf(n)=ŝLBhpf(n). The QMF synthesis filterbank defined by the filters G1(z) and G2 (z) generates the output with a high-frequency synthesis 304, ŝHBqmf(n), set to zero.        12 kbit/s (Layers 1 and 2): The core layer and narrowband enhancement layer are decoded by the embedded CELP decoder to obtain 301, ŝLB(n)=ŝenh(n), and ŝLB(n) is then postfiltered into 302, ŝLBpost(n) and high-pass filtered to obtain 303, ŝLBqmf(n)=ŝLBhpf(n). The QMF synthesis filterbank generates the output with a high-frequency synthesis 304, ŝHBqmf(n) set to zero.        14 kbit/s (Layers 1 to 3): In addition to the narrowband CELP decoding and lower-band adaptive postfiltering, the TDBWE decoder produces a high-frequency synthesis 305, ŝHBbwe(n) which is then transformed into frequency domain by MDCT so as to zero the frequency band above 3000 Hz in the higher-band spectrum 306, ŜHBbwe(k). The resulting spectrum 307, ŜHBpost(k) is transformed in time domain by inverse MDCT and overlap-add before spectral folding by (−1)n. In the QMF synthesis filterbank the reconstructed higher band signal 304, ŝHBqmf(n) is combined with the respective lower band signal 302, ŝLBqmf(n)=ŝLBpost(n) reconstructed at 12 kbits without high-pass filtering.        Above 14 kbits (Layers 1 to 4+): In addition to the narrowband CELP and TDBWE decoding, the TDAC decoder reconstructs MDCT coefficients 308, {circumflex over (D)}LBw(k) and 307, ŜHB(k), which correspond to the reconstructed weighted difference in lower band (0-4000 Hz) and the reconstructed signal in higher band (4000-7000 Hz). Note that in the higher band, the non-received sub-bands and the sub-bands with zero bit allocation in TDAC decoding are replaced by the level-adjusted sub-bands of ŜHBbwe(k). Both {circumflex over (D)}LBw(k) and ŜHB(k) are transformed into time domain by inverse MDCT and overlap-add. The lower-band signal 309, {circumflex over (d)}LBw(n) is then processed by the inverse perceptual weighting filter WLB (z)−1. To attenuate transform coding artifacts, pre/post-echoes are detected and reduced in both the lower- and higher-band signals 310, {circumflex over (d)}LB(n) and 311, ŝHB(n). The lower-band synthesis ŝLB(n) is postfiltered, while the higher-band synthesis 312, ŝHBfold(n), is spectrally folded by (−1)n. The signals ŝLBqmf(n)=ŝLBpost(n) and ŝHBqmf(n) are then combined and upsampled in the QMF synthesis filterbank.TDBWE Decoder        
FIG. 4 illustrates the concept of the TDBWE decoder module. The TDBWE received parameters which are used to shape an artificially generated excitation signal 402, ŝHBexc(n), according to desired time and frequency envelopes 408, {circumflex over (T)}env(i), and 409, {circumflex over (F)}env(j). This is followed by a time-domain post-processing procedure.
The quantized parameter set consists of the value {circumflex over (M)}T and of the following vectors: {circumflex over (T)}env,1, {circumflex over (T)}env,2, {circumflex over (F)}env,1, {circumflex over (F)}env,2, and {circumflex over (F)}env,3. The split vectors are defined by Equations 4. The quantized mean time envelope {circumflex over (M)}T is used to reconstruct the time envelope and the frequency envelope parameters from the individual vector components, i.e.:{circumflex over (T)}env(i)={circumflex over (T)}envM(i)+{circumflex over (M)}T,i=0, . . . , 15  (5)and{circumflex over (F)}env(j)={circumflex over (F)}envM(j)+{circumflex over (M)}T,j=0, . . . 11  (6)
The TDBWE excitation signal 401, exc(n), is generated by 5 ms subframe based on parameters which are transmitted in Layers 1 and 2 of the bitstream. Specifically, the following parameters are used: the integer pitch lag T0=int(T1) or int(T2) depending on the subframe, the fractional pitch lag frac, the energy of the fixed codebook contributions
            E      c        =                  ∑                  n          =          0                39            ⁢                        (                                                                      g                  ^                                c                            ·                              c                ⁡                                  (                  n                  )                                                      +                                                            g                  ^                                enh                            ·                                                c                  ′                                ⁡                                  (                  n                  )                                                              )                2              ,and the energy of the adaptive codebook contribution
      E    p    =            ∑              n        =        0            39        ⁢                            (                                                    g                ^                            p                        ·                          v              ⁡                              (                n                )                                              )                2            .      The parameters of the excitation generation are computed every 5 ms subframe. The excitation signal generation consists of the following steps:                estimation of two gains gv and guv for the voiced and unvoiced contributions to the final excitation signal 401, exc(n);        pitch lag post-processing;        generation of the voiced contribution;        generation of the unvoiced contribution; and        low-pass filtering.        
The shaping of the time envelope of the excitation signal 402, sHBexc(n), utilizes the decoded time envelope parameters 408, {circumflex over (T)}env(i), with i=0, . . . , 15 to obtain a signal 403, ŝHBT(n), with a time envelope which is near-identical to the time envelope of the encoder side higher-band signal 201, sHB(n). This is achieved by simple scalar multiplication:ŝHBT(n)=gT(n)·sHBexc(n),n=0, . . . , 159  (7)
In order to determine the gain function gT(n), the excitation signal 402, sHBexc(n), is segmented and analyzed in the same manner as the parameter extraction in the encoder. The obtained analysis results are, again, time envelope parameters {tilde over (T)}env (i) with i=0, . . . , 15. They describe the observed time envelope of sHBexc(n). Then a preliminary gain factor is calculated:g′T(i)=2{circumflex over (T)}env(i)−{tilde over (T)}env(i),i=0, . . . , 15  (8)
For each signal segment with index i=0, . . . , 15, these gain factors are interpolated using a “flat-top” Hanning window
                                          w            t                    ⁡                      (            n            )                          =                  {                                                                                          1                    2                                    ·                                      [                                          1                      -                                              cos                        ⁡                                                  (                                                                                    (                                                              n                                +                                1                                                            )                                                        ·                                                          π                              6                                                                                )                                                                                      ]                                                                                                                    n                    =                    0                                    ,                  …                  ⁢                                                                          ,                  4                                                                                    1                                                                                  n                    =                    5                                    ,                  …                  ⁢                                                                          ,                  9                                                                                                                          1                    2                                    ·                                      [                                          1                      -                                              cos                        ⁡                                                  (                                                                                    (                                                              n                                +                                9                                                            )                                                        ·                                                          π                              6                                                                                )                                                                                      ]                                                                                                                    n                    =                    10                                    ,                  …                  ⁢                                                                          ,                  14                                                                                        (        9        )            
This interpolation procedure finally yields the desired gain function:
                                          g            T                    ⁡                      (                          n              +                              i                ·                10                                      )                          =                  {                                                                                                                                                                  w                          t                                                ⁡                                                  (                          n                          )                                                                    ·                                                                        g                          T                          ′                                                ⁡                                                  (                          i                          )                                                                                      +                                                                  w                        t                                            ⁡                                              (                                                  n                          +                          10                                                )                                                                              ⁣                                      ·                                                                  g                        T                        ′                                            ⁡                                              (                                                  i                          -                          1                                                )                                                                                                                                                              n                    =                    0                                    ,                  …                  ⁢                                                                          ,                  4                                                                                                                                                w                      t                                        ⁡                                          (                      n                      )                                                        ·                                                            g                      T                      ′                                        ⁡                                          (                      i                      )                                                                                                                                        n                    =                    5                                    ,                  …                  ⁢                                                                          ,                  9                                                                                        (        10        )            where g′T(−1) is defined as the memorized gain factor g′T (15) from the last 1.25 ms segment of the preceding superframe.
The signal 404, ŝHBF(n), was obtained by shaping the excitation signal sHBexc(n) (generated from parameters estimated in lower-band by the CELP decoder) according to the desired time and frequency envelopes. There is in general no coupling between this excitation and the related envelope shapes {circumflex over (T)}env(i) and {circumflex over (F)}env(j). As a result, some clicks may be present in the signal ŝHBF(n). To attenuate these artifacts, an adaptive amplitude compression is applied to ŝHBF(n). Each sample of ŝHBF(n) of the i-th 1.25 ms segment is compared to the decoded time envelope {circumflex over (T)}env(i) and the amplitude of ŝHBF(n) is compressed in order to attenuate large deviations from this envelope. The TDBWE synthesis 405, ŝHBbwe(n), is transformed to ŜHBbwe(k) by MDCT. This spectrum is used by the TDAC decoder to extrapolate missing sub-bands.