In the field of digital mobile communication such as cellular telephones, there is a demand for a low bit rate speech compression coding method to cope with an increasing number of subscribers, and various research organizations are carrying forward research and development focused on this method.
In Japan, a coding method called “VSELP” with a bit rate of 11.2 kbps developed by Motorola, Inc. is used as a standard coding system for digital cellular telephones and digital cellular telephones using this system are on sale in Japan since the fall of 1994.
Furthermore, a coding system called “PSI-CELP” with a bit rate of 5.6 kbps developed by NTT Mobile Communications Network, Inc. is now commercialized. These systems are the improved versions of a system called “CELP” (described in “Code Excited Linear Prediction: M. R. Schroeder “High Quality Speech at Low Bit Rates”, Proc. ICASSP '85, pp. 937-940).
This CELP system is characterized by adopting a method (A-b-S: Analysis by Synthesis) consisting of separating speech into excitation information and vocal tract information, coding the excitation information using indices of a plurality of excitation samples stored in a codebook, while coding LPC (linear prediction coefficients) for the vocal tract information and making a comparison with input speech taking into consideration the vocal tract information during coding of the excitation information.
In this CELP system, an autocorrelation analysis and LPC analysis are conducted on the input speech data (input speech) to obtain LPC coefficients and the LPC coefficients obtained are coded to obtain an LPC code. The LPC code obtained is decoded to obtain decoded LPC coefficients. On the other hand, the input speech is assigned perceptual weight by a perceptual weighting filter using the LPC coefficients.
Two synthesized speeches are obtained by applying filtering to respective code vectors of excitation samples stored in an adaptive codebook and stochastic codebook (referred to as “adaptive code vector” (or adaptive excitation) and “stochastic code vector” (or stochastic excitation), respectively) using the obtained decoded LPC coefficients.
Then, a relationship between the two synthesized speeches obtained and the perceptual weighted input speech is analyzed, optimal values (optimal gains) of the two synthesized speeches are obtained, the power of the synthesized speeches is adjusted according to the optimal gains obtained and an overall synthesized speech is obtained by adding up the respective synthesized speeches. Then, coding distortion between the overall synthesized speech obtained and the input speech is calculated. In this way, coding distortion between the overall synthesized speech and input speech is calculated for all possible excitation samples and the indexes of the excitation samples (adaptive excitation sample and stochastic excitation sample) corresponding to the minimum coding distortion are identified as the coded excitation samples.
The gains and indexes of the excitation samples calculated in this way are coded and these coded gains and the indexes of the coded excitation samples are sent together with the LPC code to the transmission path. Furthermore, an actual excitation signal is created from two excitations corresponding to the gain code and excitation sample index, these are stored in the adaptive codebook and at the same time the old excitation sample is discarded.
By the way, excitation searches for the adaptive codebook and for the stochastic codebook are generally carried out on a subframe-basis, where subframe is a subdivision of an analysis frame. Coding of gains (gain quantization) is performed by vector quantization (VQ) that evaluates quantization distortion of the gains using two synthesized speeches corresponding to the excitation sample indexes.
In this algorithm, a vector codebook is created beforehand which stores a plurality of typical samples (code vectors) of parameter vectors. Then, coding distortion between the perceptual weighted input speech and a perceptual weighted LPC synthesis of the adaptive excitation vector and of the stochastic excitation vector is calculated using gain code vectors stored in the vector codebook from the following expression 1:
                    En        =                              ∑                          i              =              0                        I                    ⁢                                    (                              Xi                -                                  gn                  ×                  Ai                                -                                  hn                  ×                  Si                                            )                        2                                              Expression        ⁢                                  ⁢        1            
where:
En: Coding distortion when nth gain code vector is used
Xi: Perceptual weighted speech
Ai: Perceptual weighted LPC synthesis of adaptive code vector
Si: Perceptual weighted LPC synthesis of stochastic code vector
gn: Code vector element (gain on adaptive excitation side)
hn: Code vector element (gain on stochastic excitation side)
n: Code vector number
i: Excitation data index
I: Subframe length (coding unit of input speech) Then, distortion En when each code vector is used by controlling the vector codebook is compared and the number of the code vector with the least distortion is identified as the gain vector code. Furthermore, the number of the code vector with the least distortion is found from among all the possible code vectors stored in the vector codebook and identified to be the vector code.
Expression 1 above seems to require many computational complexity for every n, but since the sum of products on i can be calculated beforehand, it is possible to search n with a small amount of computationak complexity.
On the other hand, by determining a code vector based on the transmitted code of the vector, a speech decoder (decoder) decodes coded data and obtains a code vector.
Moreover, further improvements have been made over the prior art based on the above algorithm. For example, taking advantage of the fact that the human perceptual characteristic to sound intensity is found to have logarithmic scale, power is logarithmically expressed and quantized, and two gains normalized with that power is subjected to VQ. This method is used in the Japan PDC half rate CODEC standard system. There is also a method of coding using inter-frame correlations of gain parameters (predictive coding). This method is used in the ITU-T international standard G.729. However, even these improvements are unable to attain performance to a sufficient degree.
Gain information coding methods using the human perceptual characteristic to sound intensity and inter-frame correlations have been developed so far, providing more efficient coding performance of gain information. Especially, predictive quantization has drastically improved the performance, but the conventional method performs predictive quantization using the same values as those of previous subframes as state values. However, some of the values stored as state values are extremely large (small) and using those values for the next subframe may prevent the next subframe from being quantized correctly, resulting in local abnormal sounds.