In conventional speech communications systems, monaural speech signals are transmitted under the constraint of limited bandwidth. Accompanying development of broadband on communication networks, users' expectation for speech communication has moved from mere intelligibility toward naturalness, and a trend to provide stereophonic speech has emerged. In this transitional points where monophonic systems and stereophonic systems are both present, it is desirable to achieve stereophonic communication while maintaining downward compatibility with monophonic systems.
To achieve the above-described target, it is possible to build a stereophonic speech coding system on monophonic speech codec. With monophonic speech codec, a monaural signal generated by downmixing a stereophonic signal is usually encoded. In the stereo speech coding system, a stereophonic signal is recovered by applying additional processes to a monaural signal decoded in a decoder.
There are a large number of related arts that realize stereo coding while maintaining downward compatibility with monophonic codec. FIGS. 9 and 10 show a coding apparatus and a decoding apparatus in general transform-coded excitation (TCX) codec, respectively. AMR-WB+ is known as a known codec employing an advanced modification of TCX (see Non-Patent Document 1).
In the coding apparatus shown in FIG. 9, first, adder 1 and multiplier 2 transform left signal L(n) and right signal R(n) in a stereo signal into monaural signal M(n), and subtractor 3 and multiplier 4 transform the left signal and the right signal into side signal S(n) (see equation 1).[1]M(n)=(L(n)+R(n))·0.5S(n)=(L(n)−R(n))·0.5  (Equation 1)
Monaural signal M(n) is transformed into an excitation signal Me(n) by a linear prediction (LP) process. Linear prediction is very commonly used in speech coding to separate a speech signal into formant components (parameterized by linear prediction coefficients) and excitation components.
Further, monaural signal M(n) is subject to LP analysis in LP analysis section 5, to generate linear prediction coefficients AM(z). Quantizer 6 quantizes and encodes linear prediction coefficients Am(z), to acquire coded information AqM. Further, dequantizer 7 dequantizes the coded information AqM, to acquire linear prediction coefficients AdM(z). LP inverse filter 8 performs LP inverse filtering process on monaural signal M(n) using linear prediction coefficients AdM(z), to acquire monophonic excitation signal Me(n).
When coding is carried out at a low bit rate, excitation signal Me(n) is encoded using an excitation codebook (see Non-Patent Document 1). When coding is carried out at a high bit rate, T/F transformation section 9 time-to-frequency transforms time-domain monaural excitation signal Me(n) into frequency-domain Me(f). Either discrete Fourier transform (DFT) or modified discrete cosine transform (MDCT) can be employed for this purpose. In the case of MDCT, it is necessary to concatenate two signal frames. Quantizer 10 quantizes part of frequency-domain excitation signal Me(f), to form coded information Mqe. Quantizer 10 is able to further compress the amount of quantized coded information using a lossless coding method such as Huffman Coding.
Side signal S(n) is subject to the same series of processes as monaural signal M(n). LP analysis section 11 performs an LP analysis on side signal S(n), to generate linear prediction coefficients As(z). Quantizer 12 quantizes and encodes linear prediction coefficients As(z), to acquire coded information AqS. Dequantizer 13 dequantizes coded information AqS, to acquire linear prediction coefficients Ads(z). LP inverse filter 14 performs LP inverse filtering process on side signal S(n) using linear prediction coefficients Ads(z), to acquire side excitation signal Se(n). T/F transformation section 15 time-to-frequency transforms time-domain side excitation signal Se(n) into frequency-domain side excitation signal Se(f). Quantizer 16 quantizes part of the frequency-domain side excitation signal Se(f), to form coded information Sqe. All quantized and coded information is multiplexed in multiplexing section 17, to form a bit stream.
When monophonic decoding is performed in a decoding apparatus shown in FIG. 10, coded information AqM of linear prediction coefficients and coded information Mqe of frequency-domain monaural excitation signal are demultiplexed and processed from the bit stream in demultiplexing section 21. Dequantizer 22 decodes and dequantizes coded information AqM, to acquire linear prediction coefficients AdM(z). Meanwhile, dequantizer 23 decodes and dequantizes coded information Mqe, to acquire monophonic excitation signal Mde(f) in the frequency domain. F/T transformation section 24 transforms frequency-domain monophonic excitation signal Mde(f) into time-domain Mde(n). LP synthesis section 25 performs LP synthesis on Mde(n) using linear prediction coefficients AdM(z), to recover monaural signal Md(n).
When stereo decoding is carried out, information about the side signal is demultiplexed from a bit stream in demultiplexing section 21. The side signal is subject to the same series of processes as the monaural signal. That is, the processes are: decoding and dequantizing for coded information AqS in dequantizer 26; lossless-decoding and dequantizing for coded information Sqe in dequantizer 27; F/T transformation from the frequency domain to the time domain in F/T transformation section 28; and LP synthesis in LP synthesis section 29.
Upon recovering monaural signal Md(n) and side signal Sd(n), adder 30 and subtractor 31 can recover left signal Lout(n) and right signal Rout(n) as following equation 2.[2]Lout(n)=Md(n)+Sd(n)Rout(n)=Md(n)−Sd(n)  (Equation 2)
Another example of a stereo codec with downward compatibility with monophonic systems employs intensity stereo (IS). Intensity stereo provides an advantage of realizing very low coding bit rates. Intensity stereo utilizes psychoacoustic property of the human ear, and therefore is regarded as a perceptual coding tool. At frequency about 5 kHz or more, the human ear is insensitive to the phase relationship between the left and right signals. Accordingly, although the left and right signals are replaced with monaural signals set up to the same energy level, the human perceives almost the same stereo sensation of the original signals. With intensity stereo, to preserve the original stereo sensation in the decoded signals, only monaural signals and scale factors need to be encoded. Since the side signals are not encoded, and therefore it is possible to decrease the bit rate. Intensity Stereo is used in MPEG2/4 AAC (See Non-Patent Document 2).
FIG. 11 shows a block diagram showing the configuration of a general coding apparatus using intensity stereo. time-domain left signal L(n) and right signal R(n) are subject to time-to-frequency transformation in T/F transformation sections 41 and 42, to make frequency-domain L(f) and R(f), respectively. Adder 43 and multiplier 44 transform frequency-domain left signal L(f) and right signal R(f) to frequency-domain monaural signal M(f), and subtractor 45 and multiplier 46 transform frequency-domain left signal L(f) and right signal R(f) to frequency-domain side signal S(f) (equation 3).[3]M(f)=V(f)+R(f))·0.5S(f)=V(f)−R(f))·0.5  (Equation 3)
Quantizer 47 quantizes and performs lossless coding on M(f), to acquire coded information Mg. It is not appropriate to apply intensity stereo to a low frequency range, and therefore spectrum split section 48 extracts the low frequency part of S(f) (i.e. the part lower than 5 kHz). Quantizer 49 quantizes and performs lossless coding on the extracted low frequency part, to acquire coded information Sq1.
To compute the scale factors for intensity stereo, the high frequency parts of left signal L(f), right signal R(f) and monaural signal M(f) are extracted from spectrum split sections 51, 52 and 53, respectively. These outputs are represented by Lh(f), Rh(f) and Mh(f). Scale factor calculation sections 54 and 55 calculate the scale factor for the left signal, α, and the scale factor for the right signal, β, respectively, by the following equation 4.
                    (                  Equation          ⁢                                          ⁢          4                )                                                                      α          =                                                    ∑                                  f                  >                                      5                    ⁢                    khz                                                              ⁢                                                                    L                    h                    2                                    ⁡                                      (                    f                    )                                                  /                                                      ∑                                          f                      >                                              5                        ⁢                        khz                                                                              ⁢                                                            M                      h                      2                                        ⁡                                          (                      f                      )                                                                                                          ⁢                                  ⁢                  β          =                                                    ∑                                  f                  >                                      5                    ⁢                    khz                                                              ⁢                                                                    R                    h                    2                                    ⁡                                      (                    f                    )                                                  /                                                      ∑                                          f                      >                                              5                        ⁢                        khz                                                                              ⁢                                                            M                      h                      2                                        ⁡                                          (                      f                      )                                                                                                                              [        4        ]            
Quantizers 56 and 57 quantize scale factors α and β, respectively. Multiplexing section 58 multiplexes all quantized and encoded information, to form a bit stream.
FIG. 12 shows a block diagram showing a configuration of a general decoding apparatus using intensity stereo. First, demultiplexing section 61 demultiplexes all bit stream information. Dequantizer 62 performs lossless decoding and dequantizes a monaural signal, to recover frequency-domain monaural signal Md(f). When only monaural decoding is carried out, Md(f) is transformed into Md(n), and the decoding process is finished.
When stereo decoding is carried out, spectrum split section 63 splits Md(f) into high frequency components Mdh(f) and low frequency components Md1(f). Further, when stereo decoding is carried out, dequantizer 64 performs lossless decoding and dequantizes low frequency part Sq1 of encoded information of the side signal, to acquire Sd1(f).
Adder 65 and subtractor 66 recover the low frequency parts of left and right signals Ld1(f) and Rd1(f) by following equation 5 using Md1(f) and Sd1(f).[5]Ld1(f)=Md1(f)+Sd1(f)Rd1(f)=Md1(f)−Sd1(f)  (Equation 5)
Dequantizers 67 and 68 dequantize scale factors for intensity stereo αq and βq, to acquire αd and βd, respectively. Multipliers 69 and 70 recover the high frequency parts Ldh(f) and Rdh(f) of the left and right signals using Mdh(f), αd and βd by following equation 6.[6]Ldh(f)=Mdh(f)·αd Rdh(f)=Mdh(f)·βd  (Equation 6)
Combination section 71 combines the low frequency part Ld1(f) and the high frequency part Ldh (f) of the left signal, to acquire full spectrum Lout(f) of the left signal. Likewise, combination section 71 combines low frequency part Rd1(f) and high frequency part Rdh(f) of the right signal, to acquire full spectrum Rout(f) of the right signal.
Finally, F/T transformation sections 73 and 74 frequency-to-time transform frequency-domain Lout(f) and Rout(f), to acquire time-domain Lout(n) and Rout(n).    Non-Patent Document 1: 3GPP TS 26.290 “Extended AMR Wideband Speech Codec (AMR-WB+)”    Non-Patent Document 2: Jurgen Herre, “From Joint Stereo to Spatial Audio Coding—Recent Progress and Standardization”, Proc of the 7th International Conference on Digital Audio Effects, Naples, Italy, Oct. 5-8, 2004.