The present invention relates to audio processing and, in particular, to the processing of a decoded audio signal for the purpose of quality enhancement.
Recently, further developments regarding switched audio codecs have been achieved. A high quality and low bit rate switched audio codec is the unified speech and audio coding concept (USAC concept). There is a common pre/post-processing consisting of an MPEG surround (MPEGs) functional unit to handle a stereo or multichannel processing and an enhanced SBR (eSBR) unit which handles the parametric representation of the higher audio frequencies in the input signal. Subsequently there are two branches, one consisting of an advanced audio coding (AAC) tool path and the other consisting of a linear prediction coding (LP or LPC domain) based path which, in turn, features either a frequency domain representation or a time domain representation of the LPC residual. All transmitted spectra for both AAC and LPC are represented in the MDCT domain following quantization and arithmetic coding. The time domain representation uses an ACELP excitation coding scheme. Block diagrams of the encoder and the decoder are given in FIG. 1.1 and FIG. 1.2 of ISO/IEC CD 23003-3.
An additional example for a switched audio codec is the extended adaptive multi-rate-wide band (AMR-WB+) codec as described in 3GPP TS 26.290 V10.0.0 (2011-3). The AMR-WB+ audio codec processes input frames equal to 2048 samples at an internal sampling frequency Fs. The internal sampling frequencies are limited to the range 12800 to 38400 Hz. The 2048-sample frames are split into two critically sampled equal frequency bands. This results in two super frames of 1024 samples corresponding to the low frequency (LF) and high frequency (HF) band. Each super frame is divided into four 256-sample frames. Sampling at the internal sampling rate is obtained by using a variable sampling conversion scheme which re-samples the input signal. The LF and HF signals are then encoded using two different approaches: the LF is encoded and decoded using a “core” encoder/decoder, based on switched ACELP and transform coded excitation (TCX). In the ACELP mode, the standard AMR-WB codec is used. The HF signal is encoded with relatively few bits (16 bits per frame) using a bandwidth extension (BWE) method. The AMR-WB coder includes a pre-processing functionality, an LPC analysis, an open loop search functionality, an adaptive codebook search functionality, an innovative codebook search functionality and memories update. The ACELP decoder comprises several functionalities such as decoding the adaptive codebook, decoding gains, decoding the innovative codebook, decode ISP, a long term prediction filter (LTP filter), the construct excitation functionality, an interpolation of ISP for four sub-frames, a post-processing, a synthesis filter, a de-emphasis and an up-sampling block in order to finally obtain the lower band portion of the speech output. The higher band portion of the speech output is generated by gains scaling using an HB gain index, a VAD flag, and a 16 kHz random excitation. Furthermore, an HB synthesis filter is used followed by a band pass filter. More details are in FIG. 3 of G.722.2.
This scheme has been enhanced in the AMR-WB+ by performing a post-processing of the mono low-band signal. Reference is made to FIGS. 7, 8 and 9 illustrating the functionality in AMR-WB+. FIG. 7 illustrates pitch enhancer 700, a low pass filter 702, a high pass filter 704, a pitch tracking stage 706 and an adder 708. The blocks are connected as illustrated in FIG. 7 and are fed by the decoded signal.
In the low-frequency pitch enhancement, two-band decomposition is used and adaptive filtering is applied only to the lower band. This results in a total post-processing that is mostly targeted at frequencies near the first harmonics of the synthesize speech signal. FIG. 7 shows the block diagram of the two-band pitch enhancer. In the higher branch the decoded signal is filtered by the high pass filter 704 to produce the higher band signals sH. In the lower branch, the decoded signal is first processed through the adaptive pitch enhancer 700 and then filtered through the low pass filter 702 to obtain the lower band post-process signal (sLEE). The post-process decoded signal is obtained by adding the lower band post-process signal and the higher band signal. The object of the pitch enhancer is to reduce the inter-harmonic noise in the decoded signal which is achieved by a time-varying linear filter with a transfer function HE indicated in the first line of FIG. 9 and described by the equation in the second line of FIG. 9. α is a coefficient that controls the inter-harmonic attenuation. T is the pitch period of the input signal Ŝ (n) and sLE (n) is the output signal of the pitch enhancer. Parameters T and α vary with time and are given by the pitch tracking module 706 with a value of α=1, the gain of the filter described by the equation in the second line of FIG. 9 is exactly zero at frequencies 1/(2T), 3/(2T), 5/(2T), etc, i.e., at the mid-point between the DC (0 Hz) and the harmonic frequencies 1/T, 3/T, 5/T, etc. When α approaches zero, the attenuation between the harmonics produced by the filter as defined in the second line of FIG. 9 decreases. When α is zero, the filter has no effect and is an all-pass. To confine the post-processing to the low frequency region, the enhanced signal sLE is low pass filtered to produce the signal sLEF which is added to the high pass filter signal sH to obtain the post-process synthesis signal sE.
Another configuration equivalent to the illustration in FIG. 7 is illustrated in FIG. 8 and the configuration in FIG. 8 eliminates the need to high pass filtering. This is explained with respect to the third equation for sE in FIG. 9. The hLP(n) is the impulse response of the low pass filter and hHP(n) is the impulse response of the complementary high pass filter. Then, the post-process signal sE(n) is given by the third equation in FIG. 9. Thus, the post processing is equivalent to subtracting the scaled low pass filtered long-term error signal α.eLT(n) from the synthesis signal ŝ (n). The transfer function of the long-term prediction filter is given as indicated in the last line of FIG. 9. This alternative post-processing configuration is illustrated in FIG. 8. The value T is given by the received closed-loop pitch lag in each subframe (the fractional pitch lag rounded to the nearest integer). A simple tracking for checking pitch doubling is performed. If the normalized pitch correlation at delay T/2 is larger than 0.95 then the value T/2 is used as the new pitch lag for post-processing. The factor α is given by α=0.5 gp, constrained to a greater than or equal to zero and lower than or equal to 0.5. gp is the decoded pitch gain bounded between 0 and 1. In TCX mode, the value of α is set to zero. A linear phase FIR low pass filter with 25 coefficients is used with the cut-off frequency of about 500 Hz. The filter delay is 12 samples). The upper branch needs to introduce a delay corresponding to the delay of the processing in the lower branch in order to keep the signals in the two branches time aligned before performing the subtraction. In AMR-WB+Fs=2× sampling rate of the core. The core sampling rate is equal to 12800 Hz. So the cut-off frequency is equal to 500 Hz.
It has been found that, particularly for low delay applications, the filter delay of 12 samples introduced by the linear phase FIR low pass filter contributes to the overall delay of the encoding/decoding scheme. There are other sources of systematic delays at other places in the encoding/decoding chain, and the FIR filter delay accumulates with the other sources.