Audio bandwidth extension (BWE) technology is typically used in modern audio codecs to efficiently code wide-band audio signal at low bit rate. Its principle is to use a parametric representation of the original high frequency (HF) content to synthesize an approximation of the HF from the lower frequency (LF) data.
FIG. 1 is a diagram showing such a BWE technology-based audio codec. In its encoder, a wide-band audio signal is firstly separated (101 & 103) into LF and HF part; its LF part is coded (104) in a waveform preserving way; meanwhile, the relationship between its LF part and HF part is analyzed (102) (typically, in frequency domain) and described by a set of HF parameters. Due to the parameter description of the HF part, the multiplexed (105) waveform data and HF parameters can be transmitted to decoder at a low bit rate.
In the decoder, the LF part is firstly decoded (107). To approximate original HF part, the decoded LF part is transformed (108) to frequency domain, the resulting LF spectrum is modified (109) to generate a HF spectrum, under the guide of some decoded HF parameters. The HF spectrum is further refined (110) by post-processing, also under the guide of some decoded HF parameters. The refined HF spectrum is converted (111) to time domain and combined with the delayed (112) LF part. As a result, the final reconstructed wide-band audio signal is outputted.
Note that in the BWE technology, one important step is to generate a HF spectrum from the LF spectrum (109). There are a few ways to realize it, such as copying the LF portion to the HF location, non-linear processing or upsampling.
A most well known audio codec that uses such a BWE technology is MPEG-4 HE-AAC, where the BWE technology is specified as SBR (spectral band replication) or SBR technology, where the HF part is generated by simply copying the LF portion within QMF representation to the HF spectral location.
Such a spectral copying operation, also called as patching, is simple and proved to be efficient for most cases. However, at very low bitrates (e.g. <20 kbits/s mono), where only small LF part bandwidths are feasible, such SBR technology can lead to undesired auditory artifact sensations such as roughness and unpleasant timbre (for example, see Non-Patent Literature (NPL) 1).
Therefore, to avoid such artifacts resulting from mirroring or copying operation presented in low bitrate coding scenario, the standard SBR technology is enhanced and extended with the following main changes (for example, see NPL 2):
(1) to modify the patching algorithm from copying pattern to a phase vocoder driven patching pattern
(2) to increase adaptive time resolution for post-processing parameters.
As a result of the first modification (aforementioned (1)), by spreading the LF spectrum with multiple integer factors, the harmonic continuity in the HF is ensured intrinsically. In particular, no unwanted roughness sensation due to beating effects can emerge at the border between low frequency and high frequency and between different high frequency parts (for example, see NPL 1).
And the second modification (aforementioned (2)) facilitates the refined HF spectrum to be more adaptive to the signal fluctuations in the replicated frequency bands.
As the new patching preserves harmonic relation, it is named as harmonic bandwidth extension (HBE). The advantages of the prior-art HBE over standard SBR have also been experimentally confirmed for low bit rate audio coding (for example, see NPL 1).
Note that the above two modifications only affect the HF spectrum generator (109), the remaining processes in HBE are identical to those in SBR.
FIG. 2 is a diagram showing the HF spectrum generator in the prior art HBE. It should be noted that the HF spectrum generator includes a T-F transform 108 and a HF reconstruction 109. Given a LF part of a signal, suppose its HF spectrum composes of (T−1) HF harmonic patches (each patching process produces one HF patch), from 2nd order (the HF patch with the lowest frequency) to T-th order (the HF patch with the highest frequency). In prior art HBE, all these HF patches are generated independently in parallel derived from phase vocoders.
As shown in FIG. 2, (T−1) phase vocoders (201˜203) with different stretching factors, (from 2 to k) are employed to stretch the input LF part. The stretched outputs, with different lengths, are bandpass filtered (204˜206) and resampled (207˜209) to generate HF patches by converting time dilatation into frequency extension. By setting stretching factor as two times of resampling factor, the HF patches maintain the harmonic structure of the signal and have the double length of the LF part. Then all HF patches are delay aligned (210˜212) to compensate the potential different delay contributions from the resampling operation. In the last step, all delay-aligned HF patches are summed up and transformed (213) into QMF domain to produce the HF spectrum.
Observing the above HF spectrum generator, it has a high computation amount. The computation amount mainly comes from time stretching operation, realized by a series of Short Time Fourier Transform (STFT) and Inverse Short Time Fourier Transform (ISTFT) transforms adopted in phase vocoders, and the succeeding QMF operation, applied on time stretched HF part.
A general introduction on phase vocoder and QMF transform is described as below.
A phase vocoder is a well-known technique that uses frequency-domain transformations to implement time-stretching effect. That is, to modify a signal's temporal evolution while its local spectral characteristics are kept unchanged. Its basic principle is described below.
FIG. 3A and FIG. 3B are diagrams showing the basic principle of time stretching performed by the phase vocoder.
Divide audio into overlap blocks and respace these blocks where the hop size (the time-interval between successive blocks) is not the same at the input and at the output, as illustrated in FIG. 3A. Therein, the input hop size Ra is smaller than the output hop size Rs, as a result, the original signal is stretched with a rate r shown in (Equation 1) below.
                    [                  Math          ⁢                                          ⁢          1                ]                                                            r        =                              R            a                                R            s                                              (                  Equation          ⁢                                          ⁢          1                )            
As shown in FIG. 3B, the respaced blocks are overlapped in a coherent pattern, which requires frequency domain transformation. Typically, input blocks are transformed into frequency, after a proper modification of phases, the new blocks are transformed back to output blocks.
Following the above principle, most classic phase vocoders adopt short time Fourier transform (STFT) as the frequency domain transform, and involve an explicit sequence of analysis, modification and resynthesis for time stretching.
The QMF banks transform time domain representations to joint time-frequency domain representations (and vice versa), which is typically used in parametric-based coding schemes, like the spectral band replication (SBR), parametric stereo coding (PS) and spatial audio coding (SAC), etc. A characteristic of these filter banks is that the complex-valued frequency (subband) domain signals are effectively oversampled by a factor of two. This enables post-processing operations of the subband domain signals without introducing aliasing distortion.
In more detail, given a real valued discrete time signal x(n), with the analysis QMF bank, the complex-valued subband domain signals sk(n) are obtained through (Equation 2) below.
                    [                  Math          ⁢                                                            ⁢                                                          ⁢          2                ]                                                                                  s            k                    ⁡                      (            n            )                          =                              ∑                          l              =              0                                      L              -              1                                ⁢                                          ⁢                                    x              ⁡                              (                                                      M                    ·                    n                                    -                  l                                )                                      ⁢                          p              ⁡                              (                l                )                                      ⁢                          e                              j                ⁢                                  π                  M                                ⁢                                  (                                      k                    +                    0.5                                    )                                ⁢                                  (                                      l                    +                    α                                    )                                                                                        (                  Equation          ⁢                                          ⁢          2                )            
In (Equation 2), p(n) represents a low-pass prototype filter impulse response of order L−1, a represents a phase parameter, M represents the number of bands and k the subband index with k=0, 1, . . . , M−1).
Note that like STFT, QMF transform is also a joint time-frequency transform. That means, it provides both frequency content of a signal and the change in frequency content over time, where the frequency content is represented by frequency subband and timeline is represented by time slot, respectively.
FIG. 4 is a diagram showing QMF analysis and synthesis scheme.
In detail, as illustrated in FIG. 4, a given real audio input is divided into successive overlapping blocks with length of L and hopsize of M (FIG. 4 (a)), the QMF analysis process transforms each block into one time slot, composed of M complex subband signals. By this way, the L time domain input samples are transformed into L complex QMF coefficients, composed of L/M time slots and M subbands (FIG. 4 (b)). Each time slot, combined with the previous (L/M−1) time slots, is synthesized by the QMF synthesis process to reconstruct M real time domain samples (FIG. 4 (c)) with near perfect reconstruction.