As broadband transmission in mobile communication and IP communication has become the norm and services in such communications have diversified, high sound quality of and higher-fidelity speech communication is demanded. For example, from now on, hands free speech communication in a video telephone service, speech communication in video conferencing, multi-point speech communication where a number of callers hold a conversation simultaneously at a number of different locations and speech communication capable of transmitting the background sound without losing high-fidelity will be expected to be demanded. In this case, it is preferred to implement speech communication by stereo speech which has higher-fidelity than using a monaural signal, is capable of recognizing positions where a number of callers are talking. To implement speech communication using a stereo signal, stereo speech encoding is essential.
Further, to implement traffic control and multicast communication in speech data communication over an IP network, speech encoding employing a scalable configuration is preferred. A scalable configuration includes a configuration capable of decoding speech data even from fragmentary encoded data at the receiving side. Coding processing in a speech coding scheme employing a scalable configuration is layered, providing a layer for the core layer and a layer for the enhancement layer. Consequently, encoded data generated by this coding processing includes encoded data of the core layer and encoded data of the enhancement layer.
As a result, even when encoding and transmitting stereo speech, it is preferable to implement encoding employing a monaural-stereo scalable configuration where it is possible to select decoding a stereo signal and decoding a monaural signal using part of coded data at the receiving side.
Speech coding methods employing a monaural-stereo scalable configuration include, for example, predicting signals between channels (abbreviated appropriately as “ch”) (predicting a second channel signal from a first channel signal or predicting the first channel signal from the second channel signal) using pitch prediction between channels, that is, performing encoding utilizing correlation between 2 channels (see Non-Patent Document 1).    Non-patent document 1: Ramprashad, S. A., “Stereophonic CELP coding using cross channel prediction”, Proc. IEEE Workshop on Speech Coding, pp. 136-138, September 2000.