Transform coding in frequency domain has been widely used in various ITU-T MPEG, and 3 GPP standards. If the bit rate is high enough, spectral subbands are often coded with some kinds of vector quantization (VQ) approach; if bit rate is very low, a concept of BandWidth Extension (BWE) can also be used. The VQ approach gives good quality at the cost of high bit rate, while the BWE approach requires a very low bit rate but the quality may not be adequately stable.
Similar concepts as BWE are High Band Extension (HBE), SubBand Replica, Spectral Band Replication (SBR) and High Frequency Reconstruction (HFR). Two examples of prior art BWE include Time Domain Bandwidth Extension (TDBWE), which is used in ITU-T G.729, and SBR, which is employed by the MPEG-4 audio coding standard. TDBWE works with FFT transformation and SBR usually operates in MDCT (Modified Discrete Cosine Transform) domain.
General Description of ITU G.729.1
ITU G.729.1 is also called G.729EV coder which is an 8-32 kbit/s scalable wideband (50-7000 Hz) extension of ITU-T Rec. G.729. By default, the encoder input and decoder output are sampled at 16,000 Hz. The bitstream produced by the encoder is scalable and has 12 embedded layers, which will be referred to as Layers 1 to 12. Layer 1 is the core layer corresponding to a bit rate of 8 kbit/s. This layer is compliant with the G.729 bitstream, which makes G.729EV interoperable with G.729. Layer 2 is a narrowband enhancement layer adding 4 kbit/s, while Layers 3 to 12 are wideband enhancement layers adding 20 kbit/s with steps of 2 kbit/s.
This coder is designed to operate with a digital signal sampled at 16,000 Hz followed by conversion to 16-bit linear PCM for the input to the encoder. However, the 8,000 Hz input sampling frequency is also supported. Similarly, the format of the decoder output is 16-bit linear PCM with a sampling frequency of 8,000 or 16,000 Hz. Other input/output characteristics are generally converted to 16-bit linear PCM with 8,000 or 16,000 Hz sampling before encoding, or from 16-bit linear PCM to the appropriate format after decoding.
The G.729EV coder is built upon a three-stage structure: embedded Code-Excited Linear-Prediction (CELP) coding, Time-Domain Bandwidth Extension (TDBWE) and predictive transform coding that will be referred to as Time-Domain Aliasing Cancellation (TDAC). The embedded CELP stage generates Layers 1 and 2 which yield a narrowband synthesis (50-4,000 Hz) at 8 and 12 kbit/s. The TDBWE stage generates Layer 3 and allows producing a wideband output (50-7000 Hz) at 14 kbit/s. The TDAC stage operates in the Modified Discrete Cosine Transform (MDCT) domain and generates Layers 4 to 12 to improve quality from 14 to 32 kbit/s. TDAC coding represents jointly the weighted CELP coding error signal in the 50-4000 Hz band and the input signal in the 4000-7000 Hz band.
The G.729EV coder operates on 20 ms frames. However, the embedded CELP coding stage operates on 10 ms frames, like G.729. As a result two 10 ms CELP frames are processed per 20 ms frame. The 20 ms frames used by G.729EV are referred to as superframes, whereas the 10 ms frames and the 5 ms subframes involved in the CELP processing are referred to as frames and subframes.
G729.1 Encoder
A functional diagram of the encoder part is presented in FIG. 1. The encoder operates on 20 ms input superframes. By default, the input signal 101, sWB(n), is sampled at 16,000 Hz., therefore, the input superframes are 320 samples long. Input signal sWB(n) is first split into two sub-bands using a QMF filter bank defined by the filters H1(z) and H2(z). Lower-band input signal 102, sLBqmf(n), obtained after decimation is pre-processed by a high-pass filter Hh1(z) with 50 Hz cut-off frequency. The resulting signal 103, sLB(n), is coded by the 8-12 kbit/s narrowband embedded CELP encoder. To be consistent with ITU-T Rec. G.729, the signal sLB(n) is also denoted as s(n). The difference 104, dLB(n), between s(n) and the local synthesis 105, ŝenh(n) of the CELP encoder at 12 kbit/s is processed by the perceptual weighting filter WLB(z). The parameters of WLB(z) are derived from the quantized LP coefficients of the CELP encoder. Furthermore, filter WLB(z) includes a gain compensation that guarantees spectral continuity between the output 106, dLBw(n), of WLB(z) and the higher-band input signal 107, sHB(n).
The weighted difference dLBw(n) is then transformed into frequency domain by MDCT. The higher-band input signal 108, sHBfold(n), obtained after decimation and spectral folding by (−1)n is pre-processed by a low-pass filter Hh2(z) with 3000 Hz cut-off frequency. The resulting signal sHB(n) is coded by the TDBWE encoder. The signal sHB(n) is also transformed into frequency domain by MDCT. The two sets of MDCT coefficients 109, DLBw(k), and 110, SHB(k), are finally coded by the TDAC encoder. In addition, some parameters are transmitted by the frame erasure concealment (FEC) encoder in order to introduce parameter-level redundancy in the bitstream. This redundancy allows improving quality in the presence of erased superframes.
TDBWE Encoder
A TDBWE encoder is illustrated in FIG. 2. The TDBWE encoder extracts a fairly coarse parametric description from the pre-processed and down-sampled higher-band signal 201, sHB(n). This parametric description comprises time envelope 202 and frequency envelope 203 parameters. 20 ms input speech superframe sHB(n) (8 kHz sampling frequency) is subdivided into 16 segments of length 1.25 ms each, i.e., each segment comprises 10 samples. The 16 time envelope parameters 102, Tenv(i), i=0, . . . , 15, are computed as logarithmic subframe energies before the quantization. For the computation of the 12 frequency envelope parameters 203, Fenv(j), j=0, . . . , 11, the signal 201, sHB(n), is windowed by a slightly asymmetric analysis window. This window is 128 tap long (16 ms) and is constructed from the rising slope of a 144-tap Hanning window, followed by the falling slope of a 112-tap Hanning window. The maximum of the window is centered on the second 10 ms frame of the current superframe. The window is constructed such that the frequency envelope computation has a lookahead of 16 samples (2 ms) a lookback of 32 samples (4 ms). The windowed signal is transformed by FFT. The even bins of the full length 128-tap FFT are computed using a polyphase structure. Finally, the frequency envelope parameter set is calculated as logarithmic weighted sub-band energies for 12 evenly spaced and equally wide overlapping sub-bands in the FFT domain.
G729.1 Decoder
A functional diagram of the G729.1 decoder is presented in FIG. 3. The specific case of frame erasure concealment is not considered in this figure. The decoding depends on the actual number of received layers or equivalently on the received bit rate.
If the received bit rate is:                8 kbit/s (Layer 1): The core layer is decoded by the embedded CELP decoder to obtain 301, ŝLB(n)=ŝ(n). Then, ŝLB(n) is postfiltered into 302, ŝLBpost(n), and post-processed by a high-pass filter (HPF) into 303, ŝLBqmf(n)=ŝLBhpf(n). The QMF synthesis filterbank defined by the filters G1(z) and G2(z) generates the output with a high-frequency synthesis 304, ŝHBqmf(n), set to zero.        12 kbit/s (Layers 1 and 2): The core layer and narrowband enhancement layer are decoded by the embedded CELP decoder to obtain 301, ŝLB(n)=ŝenh(n), and ŝLB(n) is then postfiltered into 302, ŝLBpost(n) and high-pass filtered to obtain 303, ŝLBqmf(n)=ŝLBhpf(n). The QMF synthesis filterbank generates the output with a high-frequency synthesis 304, ŝHBqmf(n) set to zero.        14 kbit/s (Layers 1 to 3): In addition to the narrowband CELP decoding and lower-band adaptive postfiltering, the TDBWE decoder produces a high-frequency synthesis 305, ŝHBbwe(n) which is then transformed into frequency domain by MDCT so as to zero the frequency band above 3000 Hz in the higher-band spectrum 306, ŜHBbwe(k). The resulting spectrum 307, ŜHB(k) is transformed in time domain by inverse MDCT and overlap-add before spectral folding by (−1)n. In the QMF synthesis filterbank the reconstructed higher band signal 304, ŝHBqmf(n) is combined with the respective lower band signal 302, ŝLBqmf(n)=ŝLBpost(n) reconstructed at 12 kbit/s without high-pass filtering.        Above 14 kbit/s (Layers 1 to 4+): In addition to the narrowband CELP and TDBWE decoding, the TDAC decoder reconstructs MDCT coefficients 308, {circumflex over (D)}LBw(k) and 307, ŜHB(k), which correspond to the reconstructed weighted difference in lower band (0-4000 Hz) and the reconstructed signal in higher band (4000-7000 Hz). Note that in the higher band, the non-received sub-bands and the sub-bands with zero bit allocation in TDAC decoding are replaced by the level-adjusted sub-bands of ŜHBbwe(k). Both {circumflex over (D)}LBw(k) and ŜHB(k) are transformed into time domain by inverse MDCT and overlap-add. The lower-band signal 309, {circumflex over (d)}LBw(n) is then processed by the inverse perceptual weighting filter WLB(z)−1. To attenuate transform coding artefacts, pre/post-echoes are detected and reduced in both the lower- and higher-band signals 310, {circumflex over (d)}LB(n) and 311, ŝHB(n). The lower-band synthesis ŝLB(n) is postfiltered, while the higher-band synthesis 312, ŝHBfold(n), is spectrally folded by (−1)n. The signals ŝLBqmf(n)=ŝLBpost(n) and ŝHBqmf(n) are then combined and upsampled in the QMF synthesis filterbankTDBWE Decoder        
FIG. 4 illustrates the concept of the TDBWE decoder module. The TDBWE received parameters, which are computed by parameter extraction procedure, are used to shape an artificially generated excitation signal 402, ŝHBexc(n), according to desired time and frequency envelopes 408, {circumflex over (T)}env(i), and 409, {circumflex over (F)}env(j). This is followed by a time-domain post-processing procedure.
The TDBWE excitation signal 401, exc(n), is generated by 5 ms subframe based on parameters which are transmitted in Layers 1 and 2 of the bitstream. Specifically, the following parameters are used: the integer pitch lag T0=int(T1) or int(T2) depending on the subframe, the fractional pitch lag frac, the energy Ec of the fixed codebook contributions, and the energy Ep of the adaptive codebook contribution. Ec is mathematically expressed as
                    E        c            =                        ∑                      n            =            0                    39                ⁢                                  ⁢                              (                                                                                g                    ^                                    c                                ·                                  c                  ⁡                                      (                    n                    )                                                              +                                                                    g                    ^                                    enh                                ·                                                      c                    ′                                    ⁡                                      (                    n                    )                                                                        )                    2                      ;                      E        p            ⁢                          ⁢      is      ⁢                          ⁢              E        p              =                  ∑                  n          =          0                39            ⁢                          ⁢                                    (                                                            g                  ^                                p                            ·                              v                ⁡                                  (                  n                  )                                                      )                    2                .            
The parameters of the excitation generation are computed every 5 ms subframe. The excitation signal generation consists of the following steps:                estimation of two gains gv and guv for the voiced and unvoiced contributions to the final excitation signal exc(n);        pitch lag post-processing;        generation of the voiced contribution;        generation of the unvoiced contribution; and        low-pass filtering.        
In G.729.1, TDBWE is used to code the wideband signal from 4 kHz to 7 kHz. The narrow band (NB) signal from 0 to 4 kHz is coded with G729 CELP coder where the excitation consists of adaptive codebook contribution and fixed codebook contribution. The adaptive codebook contribution comes from the voiced speech periodicity; the fixed codebook contributes to unpredictable portion. The ratio of the energies of the adaptive and fixed codebook excitations (including enhancement codebook) is computed for each subframe:
                    ξ        =                              E            p                                E            c                                              (        1        )            
In order to reduce this ratio ξ in case of unvoiced sounds, a “Wiener filter” characteristic is applied:
                              ξ          post                =                  ξ          ·                      ξ                          1              +              ξ                                                          (        2        )            
This leads to more consistent unvoiced sounds. The gains for the voiced and unvoiced contributions of exc(n) are determined using the following procedure. An intermediate voiced gain g′v is calculated by:
                              g          v          ′                =                                            ξ              post                                      1              +                              ξ                post                                                                        (        3        )            which is slightly smoothed to obtain the final voiced gain gv:
                              g          v                =                                            1              2                        ⁢                          (                                                g                  v                  ′2                                +                                  g                                      v                    ,                    old                                                        ′                    ⁢                                                                                  ⁢                    2                                                              )                                                          (        4        )            where g′v,old is the value of g′v of the preceding subframe.
To satisfy the constraint gv2+guv2=1, the unvoiced gain is given by:guv=√{square root over (1−gv2)}  (5)
The generation of a consistent pitch structure within the excitation signal exc(n) requires a good estimate of the fundamental pitch lag t0 of the speech production process. Within Layer 1 of the bitstream, the integer and fractional pitch lag values T0 and frac are available for the four 5 ms subframes of the current superframe. For each subframe the estimation of t0 is based on these parameters.
The voiced components 406, sexc,v(n), of the TDBWE excitation signal are represented as shaped and weighted glottal pulses. Thus sexc,v(n) is produced by overlap-add of single pulse contributions. The prototype pulse shapes Pi(n) with i=0, . . . , 5 and n=0, . . . , 56 are taken from a lookup table, which is plotted in FIG. 5. These pulse shapes are designed such that a certain spectral shaping, i.e., a smooth increase of the attenuation of the voiced excitation components towards higher frequencies, is incorporated and the full sub-sample resolution of the pitch lag information is utilized. Further, the crest factor of the excitation signal is strongly reduced and an improved subjective quality is obtained.
The unvoiced contribution 407, sexc,uv(n), is produced using the scaled output of a white noise generator:sexc,uv(n)=guv·random(n), n=0, . . . , 39  (6)Having the voiced and unvoiced contributions sexc,v(n) and sexc,uv(n), the final excitation signal 402, sHBexc(n), is obtained by low-pass filtering of exc(n)=sexc,v(n)+sexc,uv(n).
The low-pass filter has a cut-off frequency of 3,000 Hz and its implementation is identical with the pre-processing low-pass filter for the high band signal.
The shaping of the time envelope of the excitation signal sHBexc(n) utilizes the decoded time envelope parameters {circumflex over (T)}env(i) with i=0, . . . , 15 to obtain a signal 403, ŝHBT(n), with a time envelope which is nearly identical to the time envelope of the encoder side HB signal sHB(n). This is achieved by a simple scalar multiplication of a gain function gT(n) with the excitation signal sHBexc(n). In order to determine the gain function gT(n), the excitation signal sHBexc(n) is segmented and analyzed in the same manner as described for the parameter extraction in the encoder. The obtained analysis results from sHBexc(n) are, again, time envelope parameters {tilde over (T)}env(i) with i=0, . . . , 15. They describe the observed time envelope of sHBexc(n). Then, a preliminary gain factor is calculated by comparing {circumflex over (T)}env(i) with {tilde over (T)}env(i). For each signal segment with index i=0, . . . , 15, these gain factors are interpolated using a “flat-top” Hanning window. This interpolation procedure finally yields the desired gain function.
The decoded frequency envelope parameters {circumflex over (F)}env(j) with j=0, . . . , 11 are representative for the second 10 ms frame within the 20 ms superframe. The first 10 ms frame is covered by parameter interpolation between the current parameter set and the parameter set from the preceding superframe. The superframe of 403, ŝHBT(n), is analyzed twice per superframe. This is done for the first (l=1) and for the second (l=2) 10 ms frame within the current superframe and yields two observed frequency envelope parameter sets {tilde over (F)}env,l(j) with j=0, . . . , 11 and frame index l=1, 2. A correction gain factor per sub-band is then determined for the first and for the second frame by comparing the decoded frequency envelope parameters {circumflex over (F)}env(j) with the observed frequency envelope parameter sets {tilde over (F)}env,l(j). These gains control the channels of a filterbank equalizer. The filterbank equalizer is designed such that its individual channels match the sub-band division and is defined by its filter impulse responses and a complementary high-pass contribution.
The signal 404, ŝHBF(n), is obtained by shaping both the desired time and frequency envelopes on the excitation signal sHBexc(n) (generated from parameters estimated in lower-band by the CELP decoder). There is in general no coupling between this excitation and the related envelope shapes {circumflex over (T)}env(i) and {circumflex over (F)}env(j). As a result, some clicks may be present in the signal ŝHBF(n). To attenuate these artifacts, an adaptive amplitude compression is applied to ŝHBF(n). Each sample of ŝHBF(n) of the i-th 1.25 ms segment is compared to the decoded time envelope {circumflex over (T)}env(i), and the amplitude of ŝHBF(n) is compressed in order to attenuate large deviations from this envelope. The signal after this post-processing is named as 405, ŝHBbwe(n).
The SBR Principle
When analyzing the capabilities of today's leading waveform audio codecs it becomes clear that for high compression ratios of for example 20:1 and above, the resulting audio quality is not satisfactory. In this compression range, the psychoacoustic demands to stay below the so-called masking threshold curve in the frequency domain, can not be fulfilled due to bit-starvation. As a result the quantization noise introduced during the en coding process will become audible and annoying to the listener. One way to cope with this problem is to limit the audio bandwidth, such that fewer spectral lines have to be encoded. This basic trade-off is used for most waveform audio codecs. As an example, the typical bandwidth of the latest MPEG waveform codec, AAC at a bit rate of 24 kbps, mono is limited to around 7 kHz, resulting in a reasonable clean, but dull impression.
The basic idea behind SBR is the observation that usually a strong correlation between the characteristics of the high frequency range of a signal (further referred to as ‘highband’) and the characteristics of the low frequency range (further referred to as ‘lowband’) of the same signal is present. Thus, a good approximation for the representation of the original input signal highband can be achieved by a transposition from the lowband to the highband (see FIG. 6 (a)). In addition to the transposition, the reconstruction of the highband incorporates shaping of the spectral envelope as outlined in FIG. 6 (b). This process is controlled by transmission of the highband spectral envelope of the original input signal. Further guidance information sent from the encoder controls other synthesis means, such as inverse filtering, noise and sine addition, in order to cope with program material where transposition alone is insufficient. The guidance information is further referred to as SBR data. SBR data is generally coded as efficiently as possible to achieve a low overhead data rate.
The SBR process can be combined with any conventional waveform audio codec by pre-processing at the encoder side, and post-processing at the decoder side. The SBR encodes the high frequency portion of an audio signal at very low cost, whereas the conventional audio codec is still used to code the lower frequency portion of the signal. Relaxing the conventional codec by limiting its audio bandwidth while maintaining the full output audio bandwidth can, therefore, be realized. At the encoder side, the original input signal is analyzed, the highband's spectral envelope and its characteristics in relation to the lowband are encoded and the resulting SBR data is multiplexed with the core codec bitstream. At the decoder side, the SBR data is first de-multiplexed. The decoding process is organized in two stages: Firstly, the core decoder generates the low band. Secondly, the SBR decoder operates as a postprocessor, using the decoded SBR data to guide the spectral band replication process. A full bandwidth output signal is obtained. Non-SBR enhanced decoders can still decode the backward compatible part of the bit stream, resulting in only a band-limited output signal.
Whereas the basic approach seems to be simple, making it work reasonably well is not. It is a non-trivial task to code the SBR data in a way that that achieves good spectral resolution, allows sufficient time resolution on transients to avoid pre-echoes, and has a low overhead data rate that achieves a significant coding gain, and takes care of cases with low correlation between lowband and highband characteristics to avoid an artificial sound caused by using transposition and envelope adjustment alone.
SBR Combined with Traditional Audio Codecs
As mentioned above, SBR can be combined with any waveform codec. When combining AAC with SBR, the resulting codec is named aacPlus and has recently been standardized within MPEG-4 (1). Another example is mp3PRO, where SBR has been added to MPEG-1/2 Layer-3 (mp3) (3).
SBR Combined with Speech Codecs
Parametric codecs such as HVXC (Harmonic Vector eXitation Coding) or CELP generally reach a point where addition of more bits within the existing coding scheme does not lead to any significant increase in subjective audio quality. However, the SBR method has turned out to be useful also together with speech codecs. Today's listeners are used to the full audio bandwidths of CDs. Although the sound quality obtained from SBR-enhanced speech codecs is far from transparent, an increase in bandwidth from the 4 kHz or less typically offered by speech codecs to 10 kHz or more is generally appreciated. Furthermore, the speech intelligibility under noisy listening conditions increases, since reproduction of fricatives (‘s’, ‘f’ etc) improves once the bandwidth is extended.