General speech codec is achieved by coding the monaural presentation of the speech only. In general, such monaural codec is used in communication equipment (such as mobile telephone and teleconference equipment) where signals are obtained from a single source such as a human voice. While previously this was sufficient for this type of monaural signal as well due to the limitations of the transmission bandwidth and processing speed of the digital signal processor (DSP), advances in technology have improved the bandwidth, making speech quality an important factor that required further consideration. As a result, the shortcomings related to monaural speech became apparent. One example of the shortcomings of monaural speech is failure to provide spatial information (such as sound imaging and caller location). An example of an application wherein the location identification of the caller is useful is high-quality multi-speaker teleconference equipment that identifies the location of the caller under conditions where multiple callers exist simultaneously. Spatial information is realized by presenting speech using multichannel signals. In addition, speech is preferably provided at as low a bit rate as possible.
In comparison to speech coding, audio coding is generally performed by multichannel coding. The multichannel coding of audio coding sometimes utilizes cross-correlation redundancy between channels. For example, for stereo (in other words, two-channel) audio signals, cross-correlation redundancy is realized based on the concept of joint stereo coding. Joint stereo refers to stereo technology that combines middle-side (MS) stereo mode and intensity (I) stereo mode. By using these modes in combination, a better data compression rate is achieved and the coding bit rate is reduced.
However, with MS stereo, when coding is performed at a low bit rate, aliasing distortion readily occurs and signal stereo imaging is affected as well. In addition, while I stereo is useful in high frequency bands where the resolution of the frequency component of the human auditory system decreases, it is not always useful in low frequency bands. General speech codec is viewed as coding (parametric coding) that functions by modeling based on parameters human vocal tract using a type of linear prediction, making the application of joint stereo coding unsuitable for speech codec.
On the other hand, in comparison to audio coding, speech coding has not been sufficiently studied with respect to multichannel coding. An example of a conventional apparatus that encodes multichannel signals during speech codec is the apparatus described in Patent Document 1. The basic concept of the technology disclosed in this document involves the presentation of speech signals using parameters. More specifically, the used band is divided into multiple frequency bands (called sub-bands) and the parameters are calculated for each sub-band. An example of a calculated parameter is the interchannel level difference, i.e., the power ratio between the left (L) channel and right (R) channel. The interchannel level difference is used to correct the spectral coefficient on the decoding side.
Patent Document 1: International Publication No. 03/090208 (Pamphlet)