The present invention relates to an encoder for encoding an audio signal, an audio transmission system, a method for determining correction values and a computer program. The invention further relates to immittance spectral frequency/line spectral frequency weighting.
In today's speech and audio codecs it is state of the art to extract the spectral envelope of the speech or audio signal by Linear Prediction and further quantize and code a transformation of the Linear Prediction coefficients (LPC). Such transformations are e.g. the Line Spectral Frequencies (LSF) or Immittance Spectral Frequencies (ISF).
Vector Quantization (VQ) is usually advantageous over scalar quantization for LPC quantization due to the increase of performance. However it was observed that an optimal LPC coding shows different scalar sensitivity for each frequency of the vector of LSFs or ISFs. As a direct consequence, using a classical Euclidean distance as metric in the quantization step will lead to a suboptimal system. It can be explained by the fact that the performance of a LPC quantization is usually measured by distance like Logarithmic Spectral Distance (LSD) or Weighted Logarithmic Spectral Distance (WLSD) which don't have a direct proportional relation with the Euclidean distance.
LSD is defined as the logarithm of the Euclidean distance of the spectral envelopes of original LPC coefficients and the quantized version of them. WLSD is a weighted version which takes into account that the low frequencies are perceptually more relevant than the high frequencies.
Both LSD and WLSD are too complex to be computed within a LPC quantization scheme. Therefore most LPC coding schemes are using either the simple Euclidean distance or a weighted version of it (WED) defined as:
            W      ⁢                          ⁢      E      ⁢                          ⁢      D        =                  ∑        i            ⁢                        w          i                ⋆                              (                                          lsf                i                            -                              qlsf                i                                      )                    2                      ,
where lsfi is the parameter to be quantized and qlsfi is the quantized parameter. w are weights giving more distortion to certain coefficients and less to other.
Laroia et al. [1] presented a heuristic approach known as inverse harmonic mean to compute weights that give more importance to LSFs closed to formant regions. If two LSF parameters are close together the signal spectrum is expected to comprise a peak near that frequency. Hence an LSF that is close to one of its neighbors has a high scalar sensitivity and should be given a higher weight:
      w    i    =            1              (                              lsf            i                    -                      lsf                          i              -              1                                      )              +          1              (                              lsf                          i              +              1                                -                      lsf            i                          )            
The first and the last weighting coefficients are calculated with this pseudo LSFs:
lsf0=0 and lsfp+1=π, where p is the order of the LP model. The order is usually 10 for speech signal sampled at 8 kHz and 16 for speech signal sampled at 16 kHz.
Gardner and Rao [2] derived the individual scalar sensitivity for LSFs from a high-rate approximation (e.g. when using a VQ with 30 or more bits). In such a case the derived weights are optimal and minimize the LSD. The scalar weights form the diagonal of a so-called sensitivity matrix given by:Dω(ω)=4βJωT(ω)RAJω(ω)
Where RA is the autocorrelation matrix of the impulse response of the synthesis filter 1/A(z) derived from the original predictive coefficients of the LPC analysis. Jω(ω) is a Jacobian matrix transforming LSFs to LPC coefficients.
The main drawback of this solution is the computational complexity for computing the sensitivity matrix.
The ITU recommendation G.718 [3] expands Gardner's approach by adding some psychoacoustic considerations. Instead of considering the matrix RA, it considers the impulse response of a perceptual weighted synthesis filter W(z):W(z)=WB(z)/(A(z)
Where WB(z) is an IIR filter approximating the Bark weighting filter given more importance to the low frequencies. The sensitivity matrix is then computed by replacing 1/A(z) with W(z).
Although the weighting used in G.718 is theoretically a near-optimal approach, it inherits from Gardner's approach a very high complexity. Today's audio codecs are standardized with a limitation in complexity and therefore the tradeoff of complexity and gain in perceptual quality is not satisfying with this approach.
The approach presented by Laroia et al. may yield suboptimal weights but it is of low complexity. The weights generated with this approach treat the whole frequency range equally although the human's ear sensitivity is highly nonlinear. Distortion in lower frequencies is much more audible than distortion in higher frequencies.
Thus, there is a need for improving encoding schemes.