Existing stereo, or in general multi-channel, coding techniques require a rather high bit-rate. Parametric stereo is often used at very low bit-rates. However, these techniques are designed for a wide class of generic audio material, i.e. music, speech and mixed content.
In multi-charnel speech coding, very little has been done. Most work has focused on an inter-channel prediction (ICP) approach. ICP techniques utilize the fact that there is correlation between a left and a right channel. Many different methods that reduce this redundancy in the stereo signal are described in the literature, e.g. in [1][2][3].
The ICP approach models quite well the case where there is only one speaker, however it fails to model multiple speakers and diffuse sound sources (e.g. diffuse background noises). Therefore, encoding a residual of ICP is a must in several cases and puts quite high demands on the required bit-rate.
Most existing speech codes are monophonic and are based on the code-excited linear predictive (CELP) coding model. Examples include AMR-NB and AMR-WB (Adaptive Multi-Rate Narrow Band and Adaptive Multi-Rate Wide Band). In this model, i.e. CELP, an excitation signal at an input of a short-term LP syntheses filter is constructed by adding two excitation vectors from adaptive and fixed (innovative) codebooks, respectively. The speech is synthesized by feeding the two properly chosen vectors from these codebooks through the short-term synthesis filter. The optimum excitation sequence in a codebook is chosen using an analysis-by-synthesis search procedure in which the error between the original and synthesized speech is minimized according to a perceptually weighted distortion measure.
There are two types of fixed codebooks. A first type of codebook is the so-called stochastic codebooks. Such a codebook often involves substantial physical storage. Given the index in a codebook, the excitation vector is obtained by conventional table lookup. The size of the codebook is therefore limited by the bit-rate and the complexity.
A second type of codebook is an algebraic codebook. By contrast to the stochastic codebooks, algebraic codebooks are not random and require virtually no storage. An algebraic codebook is a set of indexed code vectors whose amplitudes and positions of the pulses constituting the kth code vector are derived directly from the corresponding index k. This requires virtually no memory requirements. Therefore, the size of algebraic codebooks is not limited by memory requirements. Additionally, the algebraic codebooks are well suited for efficient search procedures.
It is important to note that a substantial and often also major part of the speech codec available bits are allocated to the fixed codebook excitation encoding. For instance, in the AMR-WB standard, the amount of bits allocated to the fixed codebook procedures ranges from 36% up to 76%. Additionally, it is the fixed codebook excitation search that represents most of the encoder complexity.
In [7], a multi-part fixed codebook including an individual fixed codebook for each channel and a shared codebook common to all channels is used. With this strategy it is possible to have a good representation of the inter-channel correlations. However, this comes at an extent of increased complexity as well as storage. Additionally, the required bit rate to encode the fixed codebook excitations is quite large because in addition to each channel codebook index one needs also to transmit the shared codebook index. In [8] and [9], similar methods for encoding multi-channel signals are described where the encoding mode is made dependent on the degree of correlation of the different channels. These techniques are already well known from Left/Right and Mid/Side encoding, where switching between the two encoding modes is dependent on a residual, thus dependent on correlation.
In [10], a method for encoding multichannel signals is described which generalizes different elements of a single channel linear predictive codec. The method has the disadvantage of requiring an enormous amount of computations rendering it unusable in real-time applications such as conversational applications. Another disadvantage of this technology is the amount of bits needed in order to encode the various decorrelation filters used for encoding.
Another disadvantage with the previously cited solutions described above is their incompatibility towards existing standardized monophonic conversational codecs, in the sense that no monophonic signal is separately encoded thus prohibiting the ability to directly decode a monophonic only signal.