SBR (Spectral Band Replication), like other bandwidth extension techniques, is meant to encode and decode spectral high band parts of audio signals on top of a core coder stage. SBR is standardized in [ISO09] and used jointly with AAC in the MPEG-4 Profile HE-AAC, which is employed in various application standards, e.g. 3GPP [3GP12a], DAB+ [EBU10] and DRM [EBU12].
State of the art SBR decoding in conjunction with AAC is described in [ISO09, section 4.6.18].
FIG. 1 illustrates the state of the art SBR decoder which comprises an analysis and a synthesis filterbank, SBR data decoding an HF generator and an HF adjuster:                In the state-of-the-art SBR decoding, the output of the core coder is a low-pass filtered representation of the original signal. It is the input xpcm_in to the QMF analysis filterbank of the SBR decoder.        The output of this filterbank XQMF_ana is handed over to the HF generator, where the patching takes place. Patching basically is a replication of the low-band spectrum up into the high-bands.        The patched spectrum XHF_patched is now given to the HF adjuster, together with the spectral information of the high-bands (envelopes), obtained from the SBR data decoding. Envelope information will be Huffman decoded, then differentially decoded and finally de-quantized in order to obtain the envelope data (see FIG. 2). The obtained envelope data is a set of scale factors which covers a certain amount of time, e.g. a full frame or parts of it. The HF adjuster properly adjusts the energies of the patched high-bands in order to match as good as possible with the original high-band energies at encoder side for every band k. Equation 1 and FIG. 2 clarify this:gsbr[k]=ERef[k]/EEstAvg[I]EAdj[k]=EEst[k]×gsbr[k]  (1)        
where
ERef[k] denotes the energy for one band k, being transmitted in encoded form in the SBR bitstream,
EEst[k] denotes the energy from one high-band k, patched by the HF generator;
EEstAvg[I] denotes the averaged high-band energy inside of one scale factor band I, being defined as a range of bands between a start band kstartl and a stop band kstopl:
                                          E            EstAvg                    ⁡                      [            l            ]                          =                              1                          N              l                                ⁢                                    ∑                              k                =                                  k                  start                  l                                                            k                stop                l                                      ⁢                                                  ⁢                                          E                Est                            ⁡                              (                k                )                                                                        (        2        )            
EAdJ [k] denotes the energy from one high-band k, adjusted by the HF adjuster, using gainsbr;
gsbr[k] denotes one gain factor, resulting from the division shown in equation (1).                The Synthesis QMF filterbank decodes the processed QMF samples xHF_adj to PCM audio        
xpcm_out.
If the reconstructed spectrum has a lack of noise, which was present in the original high-bands but not patched by the HF Generator, there is the possibility to add some additional noise with a certain noise floor Q for each band k.
                              Q          ⁡                      [            k            ]                          =                                            Energy                              Additional                ⁢                _                ⁢                Noise                                      ⁡                          [              k              ]                                                          Energy                              HF                ⁢                _                ⁢                Generated                                      ⁡                          [              k              ]                                                          (        3        )            
Moreover, state of the art SBR allows for moving SBR frame borders within certain limits and multiple envelopes per frame.
SBR decoding in conjunction with CELP/HVXC is described in [EBU12, section 5.6.2.2]. The CELP/HVXC+SBR decoder in DRM is closely related to state of the art SBR decoding in HEAAC, described in section 1.1.1. Basically, FIG. 1 applies.
Decoding of envelope information is adapted to spectral properties of speech-like signals, as described in [EBU12, section 5.6.2.2.4].
In regular AMR-WB decoding, the high-band excitation is obtained by generating white noise uHB1(n). The power of the high-band excitation is set equal to the power of the lower band excitation u2(n),
which means that
                                          u                          HB              ⁢                                                          ⁢              2                                ⁡                      (            n            )                          =                                            u                              HB                ⁢                                                                  ⁢                1                                      ⁡                          (              n              )                                ⁢                                                                      ∑                                      i                    =                    0                                                        G                    ⁢                                                                                  ⁢                    3                                                  ⁢                                                                  ⁢                                                      u                    2                    2                                    ⁡                                      (                    k                    )                                                                                                ∑                                      i                    =                    0                                                        G                    ⁢                                                                                  ⁢                    3                                                  ⁢                                                                  ⁢                                                      u                                          HB                      ⁢                                                                                          ⁢                      1                                        2                                    ⁡                                      (                    k                    )                                                                                                          (        4        )            
Finally the high-band excitation is found byuHB(n)=ĝHB·uHB2(n)  (5)
where ĝHB is a gain factor.
In the 23.85 kbit/s mode, ĝHB is decoded from the received gain index (side information).
In the 6.60, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85 and 23.05 kbit/s modes, gHB is estimated using voicing information bounded by [0.1, 1.0]. First, the tilt of synthesis etilt is found
                              e          tilt                =                                            ∑                              i                =                0                                            G                ⁢                                                                  ⁢                3                                      ⁢                                                  ⁢                                                                                s                    ^                                    hp                                ⁡                                  (                  n                  )                                            ·                                                                    s                    ^                                    hp                                ⁡                                  (                                      n                    -                    1                                    )                                                                                        ∑                              i                =                0                                            G                ⁢                                                                  ⁢                3                                      ⁢                                                  ⁢                                                            s                  ^                                hp                2                            ⁢                              (                n                )                                                                        (        6        )            
where ŝhp is the high-pass filtered lower band speech synthesis ŝhp12,8(n) with cut-off frequency of 400 Hz. gHB is then found bygHB=wSP·gSP+(1−wSP)·gBG  (7)
where gSP=1−etilt is the gain for the speech signal, gBG=1.25 gSP is the gain for the background noise signal, and wSP is a weighting function set to 1, when voice activity detection (VAD) is ON, and 0 when VAD is OFF. gHB is bounded between [0.1, 1.0]. In case of voiced segments where less energy is present at high frequencies, etilt approaches 1 resulting in a lower gain gHB. This reduces the energy of the generated noise in case of voiced segments.
Then the high-band LP synthesis filter AHB (z) is derived from the weighted low-band LP synthesis filter:
                                          A            HB                    ⁡                      (            z            )                          =                              A            ^                    ⁡                      (                          z              0.8                        )                                              (        8        )            
where Â(z) is the interpolated LP synthesis filter. Â(z) has been computed analyzing the signal with the sampling rate of 12.8 kHz but it is now used for a 16 kHz signal. This means that the band 5.1-5.6 kHz in the 12.8 kHz domain will be mapped to 6.4-7.0 kHz in the 16 kHz domain.
uHB(n) is then filtered through AHB(z). The output of this high-band synthesis sHB(n) is filtered through a band-pass FIR filter HHB(z), which has the pass-band from 6 to 7 kHz. Finally, sHB is added to synthesized speech to produce the synthesized output speech signal.
In AMR-WB+ the HF signal is composed out of the frequency components above (fs/4) of the input signal. To represent the HF signal at a low rate, a bandwidth extension (BWE) approach is employed. In BWE, energy information is sent to the decoder in the form of spectral envelope and frame energy, but the fine structure of the signal is extrapolated at the decoder from the received (decoded) excitation signal in the LF signal.
The spectrum of the down sampled signal sHF can be seen as a folded version of the high-frequency band prior to down-sampling. An LP analysis is performed on sHF(n) to obtain a set of coefficients, which model the spectral envelope of this signal. Typically, fewer parameters may be used than in the LF signal. Here, a filter of order 8 is used. The LP coefficients are then transformed into ISP representation and quantized for transmission.
The synthesis of the HF signal implements a kind of bandwidth extension (BWE) mechanism and uses some data from the LF decoder. It is an evolution of the BWE mechanism used in the AMR-WB speech decoder (see above). The HF decoder is detailed in FIG. 3.
The HF signal is synthesized in 2 steps:
1. Calculation of the HF excitation;
2. Computation of the HF signal from the HF excitation.
The HF excitation is obtained by shaping the LF excitation signal in time-domain with scalar factors (or gains) on a 64-sample subframe basis. This HF excitation is post-processed to reduce the “buzziness” of the output, and then filtered by an HF linear-predictive synthesis filter 1/AHF (z). The result is further post-processed to smooth energy variations. For further information please refer to [3GP09].
The packet-loss concealment in SBR in conjunction with AAC is specified in 3GPP TS 26.402 [3GP12a, section 5.2] and was subsequently reused in DRM [EBU12, section 5.6.3.1] and DAB [EBU10, section A2].
In case of a frame loss, the number of envelops per frame is set to one and the last valid received envelope data is reused and decreased in energy by a constant ratio for every concealed frame.
The resulting envelope data are then fed into the normal decoding process where the HF adjuster uses them to calculate the gains, which are used for adjusting the patched highbands out of the HF generator. The rest of SBR decoding takes place as usual.
Moreover, the coded noise floor delta values are being set to zero which lets the delta decoded noise floor remain static. At the end of the decoding process, this means that the energy of the noise floor follows the energy of the HF signal.
Furthermore, the flags for adding sines are cleared.
State of the art SBR concealment takes also care of recovery. It attends for a smooth transition from the concealed signal to the correctly decoded signal in terms of energy gaps that may result from mismatched frame borders.
State of the art SBR concealment in conjunction with CELP/HVXC is described in [EBU12, section 5.6.3.2] and briefly outlined in the following:
Whenever a corrupted frame has been detected, a predetermined set of data values is applied to the SBR decoder. This yields “a static highband spectral envelope at a low relative playback level, exhibiting a roll-off towards the higher frequencies.” [EBU12, section 5.6.3.2]. Here, SBR concealment inserts some kind of comfort noise, which has no dedicated fading in SBR domain. This prevents the listener's ears from potentially loud audio bursts and keeps the impression of a constant bandwidth.
State of the art concealment of the BWE of G.718 is described in [ITU08, 7.11.1.7.1] and briefly outlined as follows:
In the low delay mode, which is exclusively available for layer 1 and 2, the concealment of the high-frequency band 6000-7000 Hz is performed exactly in the same way as when no frame erasures occur. The clean-channel decoder operation for layers 1, 2 and 3 is as follows: a blind bandwidth extension is applied. The spectrum in the range 6400-7000 Hz is filled up with a white noise signal, properly scaled in the excitation domain (energy of the high-band matches the low band energy). It is then synthesized with a filter derived by weighting from the same LP synthesis filter as used in the 12.8 kHz domain. For layers 4 and 5 no bandwidth extension is performed, since those layers cover the full band up to 8 kHz.
In the default operation a low complexity processing is performed to reconstruct the high-frequency band of the synthesized signal at 16 kHz sampling frequency. First, the scaled high-frequency band excitation, u″HB(n), is linearly attenuated throughout the frame asu′″HB(u)=u″HB(n)·gatt(n), for n=|0, . . . ,319  (9)
where the frame length is 320 samples and gatt(n) is an attenuation factor which is given by
                                                        g              att                        ⁡                          (              n              )                                =                      1.0            -                          n              ⁢                                                1.0                  -                                                            g                      _                                        p                                                  320                                                    ,                                  ⁢                              for            ⁢                                                  ⁢            n                    =          0                ,        …        ⁢                                  ,        319                            (        10        )            
In the equation above, gp is the average pitch gain. It is the same gain as used during concealment of the adaptive codebook. Then, the memory of the band-pass filter in the frequency range 6000-7000 Hz is attenuated using gatt(n), as derived in equation 10, to prevent any discontinuities. Finally, the high-frequency excitation signal, u′″(n), is filtered through the synthesis filter. The synthesized signal is then added to the concealed synthesis at a 16 kHz sampling frequency.
State of the art concealment of blind bandwidth extension in AMR-WB is outlined in [3GP12b, 6.2.4] and briefly summarized here:
When a frame is lost or partly lost, the high-band gain parameter is not received and an estimation for the high-band gain is used instead. This means that in case of bad/lost speech frames, the high-band reconstruction operates in the same way for all the different modes.
In case a frame is lost, the high-band LP synthesis filter is derived like usual from the LPC coefficients from the core band. The only exception is that the LPC coefficients have not been decoded from the bitstream, but were extrapolated using the regular AMR-WB concealment approach.
State of the art concealment of bandwidth extension in AMR-WB+ is outlined in [3GP09, 6.2] and briefly summarized here:
In the case of a packet loss, the control data which are internal to the HF decoder are generated from the bad frame indicator vector BFI=(bfi0, bfi1, bfi2, bfi3). These data are bfiisfhf, BFIGAIN, and the number of subframes for ISF interpolation. The nature of these data is defined in more details below:
bfiisfhf is a binary flag indicating the loss of the ISF parameters. As the ISF parameters for the HF signal are transmitted in the first packet (containing the first subframe) being either HF20, 40 or 80, the loss flag is set to the bfi indicator of the first subframe (bfi0). The same holds true for the indication of lost HF gains. If the first packet/subframe of the current mode is lost (HF20, 40 or 80) the gain is lost and needs to be concealed.
The concealment of the HF ISF vectors is very similar to the ISF concealment for the core ISFs. The main idea is to reuse the last good ISF vector, but shift it towards the mean ISF vector (where the mean ISF vector is offline trained):isfq[i]=0.9·isfq[i]+0.1·mean_isf_hf[i]  (11)
The BWE gains (g0, . . . , gnb−1) are estimated according to the following source code (in the code: ĝi{circumflex over (=)}gain_q[i]; 2.807458 is a decoder constant).
/* use the past gains slightly shifted towards the means */*past_q = (0.9f*(*past_q + 20.0f)) − 20.0f;for (i=0; i<4; i++) { gain_q[i] = *past_q + 2.807458f;}tmp = 0.0;for (i=0; i<4; i++) { tmp += gain_q[i];}*past_q = 0.25f*tmp − 2.807458f;
In order to derive the “gains to match the magnitude at fs/4” the same algorithm as in clean channel decoding is performed, but with the exception that the ISFs for the HF and/or the LF part may already be concealed. All following steps like linear!dB interpolation, summation and application of gains are the same as in the clean channel case.
To derive the excitation, the same procedure is applied as in a correctly received frame, where the lower band excitation is used after:                it was randomized        it was amplified in the time-domain with subframe gains        it was shaped in the frequency domain with an LP filter        the energy was smoothed over time        
Then the synthesis is performed according to FIG. 3.
AES convention paper 6789: Schneider, Krauss and Ehret [SKE06] describe a concealment technique which reuses the last valid SBR envelope data. If more than one SBR frame is lost, a fadeout is applied. “The basic principle is to simply lock the last known valid SBR envelope values until SBR processing may be continued with newly transmitted data. In addition a fade-out is performed if more than one SBR frame is not decodable.”
AES convention paper 6962: Sang-Uk Ryu and Kenneth Rose [RR06] describe a concealment technique which estimates the parametric information, utilizing SBR data from the previous and the next frame. High band envelopes are adaptively estimated from energy evolution in the surrounding frames.
The packet-loss concealment concepts may produce a perceptually degraded audio signal during packet loss.