The present invention relates to audio signal coding, and, in particular, to an apparatus for encoding a speech signal employing ACELP in the autocorrelation domain.
In speech coding by Code-Excited Linear Prediction (CELP), the spectral envelope (or equivalently, short-time time-structure) of the speech signal is described by a linear predictive (LP) model and the prediction residual is modelled by a long-time predictor (LTP, also known as the adaptive codebook) and a residual signal represented by a codebook (also known as the fixed codebook). The latter, the fixed codebook, is generally applied as an algebraic codebook, where the codebook is represented by an algebraic formula or algorithm, whereby there is no need to store the whole codebook, but only the algorithm, while simultaneously allowing for a fast search algorithm. CELP codecs applying an algebraic codebook for the residual are known as Algebraic Code-Excited Linear Prediction (ACELP) codecs (see [1], [2], [3], 4]).
In speech coding, employing an algebraic residual codebook is the approach of choice in main stream codecs such as [17], [13], [18]. ACELP is based on modeling the spectral envelope by a linear predictive (LP) filter, the fundamental frequency of voiced sounds by a long time predictor (LTP) and the prediction residual by an algebraic codebook. The LTP and algebraic codebook parameters are optimized by a least squares algorithm in a perceptual domain, where the perceptual domain is specified by a filter.
The computationally most complex part of ACELP-type algorithms, the bottleneck, is optimization of the residual codebook. The only currently known optimal algorithm would be an exhaustive search of a size Np space for every sub-frame, where at every point, an evaluation of (N2) complexity may be performed. Since typical values are sub-frame length N=64 (i.e. 5 ms) with p=8 pulses, this implies more than 1020 operations per second. Clearly this is not a viable option. To stay within the complexity limits set by hardware requirements, codebook optimization approaches have to operate with non-optimal iterative algorithms. Many such algorithms and improvements to the optimization process have been presented in the past, for example [17], [19], [20], [21], [22].
Explicitly, the ACELP optimisation is based on describing the speech signal x(n) as the output of a linear predictive model such that the estimated speech signal is{circumflex over (x)}(n)=−Σk=1ma(k){circumflex over (x)}(n−k)+{circumflex over (e)}(k)  (1)where a(k) are the LP coefficients and ê(k) is the residual signal. In vector form, this equation can be expressed as{circumflex over (x)}=Hê  (2)where matrix H is defined as the lower triangular Toeplitz convolution matrix with diagonal h(0) and lower diagonals h(1), . . . , h(39) and the vector h(k) is the impulse response of the LP model. It should be noted that in this notation the perceptual model (which usually corresponds to a weighted LP model) is omitted, but it is assumed that the perceptual model is included in the impulse response h(k). This omission has no impact on the generality of results, but simplifies notation. The inclusion of the perceptual model is applied as in [1].
The fitness of the model is measured by the squared error. That is,ϵ2=Σk=1N(x(k)−{circumflex over (x)}(k))2=(e−ê)HHHH(e−ê).  (3)
This squared error is used to find the optimal model parameters. Here, it is assumed that the LTP and the pulse codebook are both used to model the vector e. The practical application can be found in the relevant publications (see [1-4]).
In practice, the above measure of fitness can be simplified as follows. Let the matrix B=HTH comprise the correlations of h(n), let ck be the k'th fixed codebook vector and set ê=g ck, where g is a gain factor. By assuming that g is chosen optimally, then the codebook is searched by maximizing the search criterion
                                          C            k            2                                E            k                          =                                                            (                                                      x                    T                                    ⁢                                                                          ⁢                                      Hc                    k                                                  )                            2                                                      c                k                T                            ⁢                              Bc                k                                              =                                                    (                                                      d                    T                                    ⁢                                      c                    k                                                  )                            2                                                      c                k                T                            ⁢                              Bc                k                                                                        (        4        )            where d=HTx is a vector comprising the correlation between the target vector and the impulse response h(n) and superscript T denotes transpose. The vector d and the matrix B are computed before the codebook search. This formula is commonly used in optimization of both the LTP and the pulse codebook.
Plenty of research has been invested in optimising the usage of the above formula. For example,    1) Only those elements of matrix B are calculated that are actually accessed by the search algorithm. Or:    2) The trial-and-error algorithm of the pulse search is reduced to trying only such codebook vectors which have a high probability of success, based on prior screening (see for example [1,5]).
A practical detail of the ACELP algorithm is related to the concept of zero impulse response (ZIR). The concept appears when considering the original domain synthesis signal in comparison to the synthesised residual. The residual is encoded in blocks corresponding to the frame or sub-frame size. However, when synthesising the original domain signal with the LP model of Equation 1, the fixed length residual will have an infinite length “tail”, corresponding to the impulse response of the LP filter. That is, although the residual codebook vector is of finite length, it will have an effect on the synthesis signal far beyond the current frame or sub-frame. The effect of a frame into the future can be calculated by extending the codebook vector with zeros and calculating the synthesis output of Equation 1 for this extended signal. This extension of the synthesised signal is known as the zero impulse response. Then, to take into account the effect of prior frames in encoding the current frame, the ZIR of the prior frame is subtracted from the target of the current frame. In encoding the current frame, thus, only that part of the signal is considered, which was not already modelled by the previous frame.
In practice, the ZIR is taken into account as follows: When a (sub)frame N−1 has been encoded, the quantized residual is extended with zeros to the length of the next (sub)frame N. The extended quantized residual is filtered by the LP to obtain the ZIR of the quantized signal. The ZIR of the quantized signal is then subtracted from the original (not quantized) signal and this modified signal forms the target signal when encoding (sub)frame N. This way, all quantization errors made in (sub)frame N−1 will be taken into account when quantizing (sub)frame N. This practice improves the perceptual quality of the output signal considerably.
However, it would be highly appreciated if further improved concepts for audio coding would be provided.