In modern audio/speech signal compression technology, a concept of BandWidth Extension (BWE) is widely used. The similar or same technology sometimes is also called High Band Extension (HBE), SubBand Replica (SBR), or Spectral Band Replication (SBR). Although the name could be different, they all have the similar meaning of encoding/decoding some frequency sub-bands (usually high bands) with little budget of bit rate (or even with zero budget of bit rate) or significantly lower bit rate than normal encoding/decoding approaches. Low bit rate coding sometimes causes low quality. If a few bits can improve the quality, it is worth spending the few bits.
Frequency domain can be defined as FFT transformed domain. It can also be in Modified Discrete Cosine Transform (MDCT) domain. A well known BWE can be found in the standard ITU-T G.729.1, in which the algorithm is named as Time Domain Bandwidth Extension (TDBWE).
General Description of ITU G.729.1
ITU-T G.729.1 is also called a G.729EV coder, which is an 8-32 kbit/s scalable wideband (50 Hz-7,000 Hz) extension of ITU-T Rec. G.729. By default, the encoder input and decoder output are sampled at 16,000 Hz. The bitstream produced by the encoder is scalable and consists of 12 embedded layers, which will be referred to as Layers 1 to 12. Layer 1 is the core layer corresponding to a bit rate of 8 kbit/s. This layer is compliant with G.729 bitstream, which makes G.729EV interoperable with G.729. Layer 2 is a narrowband enhancement layer adding 4 kbit/s, while Layers 3 to 12 are wideband enhancement layers adding 20 kbit/s with steps of 2 kbit/s.
The G.729EV coder is designed to operate with a digital signal sampled at 16,000 Hz followed by a conversion to 16-bit linear PCM before the converted signal is inputted to the encoder. However, the 8,000 Hz input sampling frequency is also supported. Similarly, the format of the decoder output is 16-bit linear PCM with a sampling frequency of 8,000 or 16,000 Hz. Other input/output characteristics are converted to 16-bit linear PCM with 8,000 or 16,000 Hz sampling before encoding, or from 16-bit linear PCM to the appropriate format after decoding. The bitstream from the encoder to the decoder is defined within this Recommendation.
The G.729EV coder is built upon a three-stage structure: embedded Code-Excited Linear-Prediction (CELP) coding, Time-Domain Bandwidth Extension (TDBWE), and predictive transform coding that is also referred to as Time-Domain Aliasing Cancellation (TDAC). The embedded CELP stage generates Layers 1 and 2, which yield a narrowband synthesis (50 Hz-4,000 Hz) at 8 kbit/s and 12 kbit/s. The TDBWE stage generates Layer 3 and allows producing a wideband output (50 Hz-7,000 Hz) at 14 kbit/s. The TDAC stage operates in the MDCT domain and generates Layers 4 to 12 to improve quality from 14 kbit/s to 32 kbit/s. TDAC coding represents the weighted CELP coding error signal in the 50 Hz-4,000 Hz band and the input signal in the 4,000 Hz-7,000 Hz band.
The G.729EV coder operates on 20 ms frames. However, the embedded CELP coding stage operates on 10 ms frames, such as G.729 frames. As a result, two 10 ms CELP frames are processed per 20 ms frame. In the following, to be consistent with the context of ITU-T Rec. G.729, the 20 ms frames used by G.729EV will be referred to as superframes, whereas the 10 ms frames and the 5 ms subframes involved in the CELP processing will be called frames and subframes, respectively.
TDBWE Encoder
The TDBWE encoder is illustrated in FIG. 1. The TDBWE encoder extracts a fairly coarse parametric description from the pre-processed and down-sampled higher-band signal 101, sHB(n). This parametric description comprises time envelope 102 and frequency envelope 103 parameters. The 20 ms input speech superframe sHB(n) (8 kHz sampling frequency) is subdivided into 16 segments of length 1.25 ms each, i.e., with each segment comprising 10 samples. The 16 time envelope parameters 102, Tenv(i), i=0, . . . , 15, are computed as logarithmic subframe energies before the quantization is performed. For the computation of the 12 frequency envelope parameters 103, Fenv(j), j=0, . . . , 11, the signal 101, sHB(n), is windowed by a slightly asymmetric analysis window. This window is 128 tap long (16 ms) and is constructed from the rising slope of a 144-tap Hanning window, followed by the falling slope of a 112-tap Hanning window.
The maximum of the window is centered on the second 10 ms frame of the current superframe. The window is constructed such that the frequency envelope computation has a lookahead of 16 samples (2 ms) and a lookback of 32 samples (4 ms). The windowed signal is transformed by FFT. The even number of bins of the full length 128-tap FFT are computed using a polyphase structure. Finally, the frequency envelope parameter set is calculated as logarithmic weighted sub-band energies for 12 evenly spaced and equally wide overlapping sub-bands in the FFT domain.
TDBWE Decoder
FIG. 2 illustrates the concept of the TDBWE decoder module. The TDBWE received parameters, which are computed by parameter extraction procedure, and are used to shape an artificially generated excitation signal 202, ŝHBexc(n), according to desired time and frequency envelopes {circumflex over (T)}env(i) and {circumflex over (F)}env(j). This is followed by a time-domain post-processing procedure.
The TDBWE excitation signal 201, exc(n), is generated by 5 ms subframe based on parameters which are transmitted in Layers 1 and 2 of the bitstream. Specifically, the following parameters are used: the integer pitch lag T0=int(T1) or int(T2) depending on the subframe, the fractional pitch lag frac, the energy Ec of the fixed codebook contributions, and the energy Ep of the adaptive codebook contribution. Energy Ec is mathematically expressed as
      E    p    =            ∑              n        =        0            39        ⁢                            (                                                    g                ^                            p                        ·                          v              ⁡                              (                n                )                                              )                2            .      while energy Ep is expressed as
            E      c        =                  ∑                  n          =          0                39            ⁢                        (                                                                      g                  ^                                c                            ·                              c                ⁡                                  (                  n                  )                                                      +                                                            g                  ^                                enh                            ·                                                c                  ′                                ⁡                                  (                  n                  )                                                              )                2              ,A detailed description can be found in the ITU G.729.1 Recommendation.
The parameters of the excitation generation are computed every 5 ms subframe. The excitation signal generation consists of the following steps:                estimation of two gains gv and guv for the voiced and unvoiced contributions to the final excitation signal exc(n);        pitch lag post-processing;        generation of the voiced contribution;        generation of the unvoiced contribution; and        low-pass filtering.        
In G.729.1, TDBWE is used to code the wideband signal from 4 kHz to 7 kHz. The narrow band (NB) signal from 0 to 4 kHz is coded with G729 CELP coder, wherein the excitation consists of adaptive codebook contribution and fixed codebook contribution. The adaptive codebook contribution comes from the voiced speech periodicity. The fixed codebook contributes to unpredictable portion. The ratio ξ of the energies of the adaptive and fixed codebook excitations (including enhancement codebook) is computed for each subframe as:
                    ξ        =                                            E              p                                      E              c                                .                                    (        1        )            
In order to reduce this ratio ξ in case of unvoiced sounds, a “Wiener filter” characteristic is applied:
                              ξ          post                =                  ξ          ·                                    ξ                              1                +                ξ                                      .                                              (        2        )            
This leads to more consistent unvoiced sounds. The gains for the voiced and unvoiced contributions of exc(n) are determined using the following procedure. An intermediate voiced gain g′v is calculated by:
                                          g            v            ′                    =                                                    ξ                post                                            1                +                                  ξ                  post                                                                    ,                            (        3        )            which is slightly smoothed to obtain the final voiced gain gv:
                                          g            v                    =                                                    1                2                            ⁢                              (                                                      g                    v                    ′2                                    +                                      g                                          v                      ,                      old                                        ′2                                                  )                                                    ,                            (        4        )            where g′v,old is the value of g′v of the preceding subframe.
To satisfy the constraint gv2+guv2=1, the unvoiced gain is represented as:guv=√{square root over (1−gv2)}.  (5)
The generation of a consistent pitch structure within the excitation signal exc(n) requires a good estimate of the fundamental pitch lag t0 of the speech production process. Within Layer 1 of the bitstream, the integer and fractional pitch lag values T0 and frac are available for the four 5 ms subframes of the current superframe. For each subframe, the estimation of t0 is based on these parameters.
The aim of the G.729 encoder-side pitch search procedure is to find the pitch lag, which minimizes the power of the LTP residual signal. That is, the LTP pitch lag is not necessarily identical with t0, which is a requirement for the concise reproduction of voiced speech components. The most typical deviations are pitch-doubling and pitch-halving errors, i.e., the frequency corresponding to the LTP lag is a half or double that of the original fundamental speech frequency. Especially, pitch-doubling (or tripling, etc.) errors are preferably avoided. Thus, the following post-processing of the LTP lag information is used. First, the LTP pitch lag for an oversampled time-scale is reconstructed from T0 and frac, and a bandwidth expansion factor of 2 is considered:tLTP=2 ·(3·T0+frac).  (6)
The (integer) factor between the currently observed LTP lag tLTP and the post-processed pitch lag of the preceding subframe tpost,old (see Equation 9) is calculated as:
                    f        =                              int            ⁡                          (                                                                    t                    LTP                                                        t                                          post                      ,                      old                                                                      +                0.5                            )                                .                                    (        7        )            
If the factor f falls into the range 2, . . . , 4, a relative error is evaluated as:
                    e        =                  1          -                                                    t                LTP                                            f                ·                                  t                                      post                    ,                    old                                                                        .                                              (        8        )            
If the magnitude of this relative error is below a threshold ε=0.1, it is assumed that the current LTP lag is the result of a beginning pitch-doubling (-tripling, etc.) error phase. Thus, the pitch lag is corrected by dividing by the integer factor f, thereby producing a continuous pitch lag behavior with respect to the previous pitch lags:
                              t          post                =                  {                                                                      int                  ⁡                                      (                                                                                            t                          LTP                                                f                                            +                      0.5                                        )                                                                                                                                                                e                                                              <                    ɛ                                    ,                                      f                    >                    1                                    ,                                      f                    <                    5                                                                                                                        t                  LTP                                                                              otherwise                  ,                                                                                        (        9        )            
which is further smoothed as:
                              t          p                =                              1            2                    ·                                    (                                                t                                      post                    ,                    old                                                  +                                  t                  post                                            )                        .                                              (        10        )            
Note that this moving average leads to a virtual precision enhancement from a resolution of ⅓ to ⅙ of a sample. Finally, the post-processed pitch lag tp is decomposed into integer and fractional parts:
                                          t                          0              ,              int                                =                      int            ⁡                          (                                                t                  p                                6                            )                                      ⁢                                  ⁢        and        ⁢                                  ⁢                              t                          0              ,              frac                                =                                    t              p                        -                          6              ·                                                t                                      0                    ,                    int                                                  .                                                                        (        11        )            
The voiced components 206, sexc,v(n), of the TDBWE excitation signal are represented as shaped and weighted glottal pulses. The voiced components 206 sexc,v(n) are thus produced by overlap-add of single pulse contributions:
                                                        S                              exc                ,                v                                      ⁡                          (              n              )                                =                                    ∑              p                        ⁢                                          g                Pulse                                  [                  p                  ]                                            ×                                                P                                      n                                          Pulse                      ,                      frac                                                              [                      p                      ]                                                                      ⁡                                  (                                      n                    -                                          n                                              Pulse                        ,                        int                                                                    [                        p                        ]                                                                              )                                                                    ,                            (        12        )            where nPulse,int[p] is a pulse position, PnPulse,frac[p](n−npulse,int[p]) is the pulse shape, and gPulse[p] a gain factor for each pulse. These parameters are derived in the following. The post-processed pitch lag parameters t0,int and t0,frac determine the pulse spacing. Accordingly, the pulse positions may be expressed as:
                                          n                          Pulse              ,              int                                      [              p              ]                                =                                    n                              Pulse                ,                int                                            [                                  p                  -                  1                                ]                                      +                          t                              0                ,                int                                      +                          int              (                                                                    n                                          Pulse                      ,                      frac                                                              [                                              p                        -                        1                                            ]                                                        +                                      t                                          0                      ,                      frac                                                                      6                            )                                      ,                            (        13        )            where p is the pulse counter, i.e., nPulse,int[p] is the (integer) position of the current pulse and nPulse,int[p-1] is the (integer) position of the previous pulse.
The fractional part of the pulse position may be expressed as:
                              n                      Pulse            ,            frac                                [            p            ]                          =                              n                          Pulse              ,              frac                                      [                              p                -                1                            ]                                +                      t                          0              ,              frac                                -                      6            ·                          int              (                                                                    n                                          Pulse                      ,                      frac                                                              [                                              p                        -                        1                                            ]                                                        +                                      t                                          0                      ,                      frac                                                                      6                            )                                                          (        14        )            
The fractional part of the pulse position serves as an index for the pulse shape selection. The prototype pulse shapes Pi(n) with i=0, . . . , 5 and n=0, . . . , 56 are taken from a lookup table as plotted in FIG. 3. These pulse shapes are designed such that a certain spectral shaping, for example, a smooth increase of the attenuation of the voiced excitation components towards higher frequencies, is incorporated and the full sub-sample resolution of the pitch lag information is utilized. Further, the crest factor of the excitation signal is significantly reduced and an improved subjective quality is obtained.
The gain factor gPulse[p] for the individual pulses is derived from the voiced gain parameter gv and from the pitch lag parameters:gPulse[p]=(2·even(nPulse,int[p])−1)·gv·√{square root over (6t0,int+t0,frac)}.  (15)
Therefore, it is ensured that increasing pulse spacing does not result in the decrease in the contained energy. The function even( ) returns 1 if the argument is an even integer number, and returns 0 otherwise.
The unvoiced contribution 207, sexc,uv(n), is produced using the scaled output of a white noise generator:sexc,uv(n)=guv·random(n), n=0, . . . , 39.  (16)
Having the voiced and unvoiced contributions sexc,v(n) and sexc,uv(n), the final excitation signal 202, sHBexc(n), is obtained by low-pass filtering of exc(n)=Sexc,v(n)+Sexc,uv(n).
The low-pass filter has a cut-off frequency of 3,000 Hz and its implementation is identical with the pre-processing low-pass filter for the high band signal.
The shaping of the time envelope of the excitation signal sHBexc(n) utilizes the decoded time envelope parameters {circumflex over (T)}env(i) with i=0, . . . , 15 to obtain a signal 203, ŝHBT(n), with a time envelope which is nearly identical to the time envelope of the encoder side HB signal sHB(n). This is achieved by a simple scalar multiplication of a gain function gT(n) with the excitation signal sHBexc(n). In order to determine the gain function gT(n), the excitation signal sHBexc(n) is segmented and analyzed in the same manner as described for the parameter extraction in the encoder. The obtained analysis results from sHBexc(n) are, again, time envelope parameters {tilde over (T)}env(i) with i=0, . . . , 15. They describe the observed time envelope sHBexc(n). Then, a preliminary gain factor is calculated by comparing {circumflex over (T)}env(i) with {tilde over (T)}env(i). For each signal segment with index i=0, . . . , 15, these gain factors are interpolated using a “flat-top” Hanning window. This interpolation procedure finally yields the desired gain function.
The decoded frequency envelope parameters {circumflex over (F)}env(j) with j=0, . . . , 11 are representative for the second 10 ms frame within the 20 ms superframe. The first 10 ms frame is covered by parameter interpolation between the current parameter set and the parameter set from the preceding superframe. The superframe of 203, ŝHBT(n), is analyzed twice per superframe. This is done for the first (l=1) and for the second (l=2) 10 ms frame within the current superframe and yields two observed frequency envelope parameter sets {tilde over (F)}env,l(j) with j=0, . . . , 11 and frame index l=1, 2. Now, a correction gain factor per sub-band is determined for the first frame and for the second frame by comparing the decoded frequency envelope parameters {circumflex over (F)}env(j) with the observed frequency envelope parameter sets {tilde over (F)}env,l(j). These gains control the channels of a filterbank equalizer. The filterbank equalizer is designed such that its individual channels match the sub-band division. It is defined by its filter impulse responses and a complementary high-pass contribution.
The signal 204, ŝHBF(n), is obtained by shaping both the desired time and frequency envelopes on the excitation signal sHBexc(n) (generated from parameters estimated in lower-band by the CELP decoder). There is in general no coupling between this excitation and the related envelope shapes {circumflex over (T)}env(i) and {circumflex over (F)}env(j). As a result, some clicks may occur in the signal ŝHBF(n). To attenuate these artifacts, an adaptive amplitude compression is applied to ŝHBF(n). Each sample of ŝHBF(n) of the i-th 1.25 ms segment is compared to the decoded time envelope {circumflex over (T)}env(i), and the amplitude of ŝHBF(n) is compressed in order to attenuate large deviations from this envelope. The signal after this post-processing is named as 205, ŝHBbwe(n).