The present invention relates to a speech coding/decoding method for coding a speech signal with high quality at a low bit rate, specifically at 4.8 kb/s or less, by a relatively small amount of operations.
As methods of coding a speech signal at a low bit rate of about 4.8 kb/s or less, speech coding methods disclosed in, e.g., Japanese Patent Application No. 63-208201 disclosed as Japanese Patent Laid-Open No. HEI 02-58100 (reference 1) and M. Schroeder and B. Atal, "Code-excited linear prediction: High quality speech at very low bit rates," ICASSP, pp. 937-940, 1985 (reference 2) are known.
According to the method in reference 1, on the transmission side, a spectrum parameter representing the spectrum characteristics of a speech signal and a pitch parameter representing the pitch thereof are extracted from a speech signal of each frame. Speech signals are classified into a plurality of types of signals (e.g., vowel, explosive, and fricative sound signals) using acoustic features. A one-frame sound source signal in a vowel sound interval is represented by improved pitch interpolation in the following manner. A signal component in one pitch interval (representative interval) of a plurality of pitch intervals obtained by dividing one frame is represented by a multipulse. In other pitch intervals in the same frame, amplitude and phase correction coefficients for correcting the amplitude and phase of the multipulse in the representative interval are obtained in units of pitch intervals. Subsequently, the amplitude and position of the multipulse in the representative interval, the amplitude and phase correction coefficients in other pitch intervals, and the spectrum and pitch parameters are transmitted. In an explosive sound interval, a multipulse in the entire frame is obtained. In a fricative sound interval, one type of noise signal is selected from a codebook constituted by predetermined types of noise signals so as to minimize differential power between a signal obtained by synthesizing noise signals and the input speech signal, and an optimal gain is calculated. As a result, an index representing the type of noise signal and the gain are transmitted. A description associated with the reception side will be omitted.
In the conventional speech coding methods disclosed in reference 1, with respect to a female speaker having a short pitch period, since a large number of pitch intervals are present in a frame, improved pitch interplation can be effectively performed, and a sufficient number of pulses can be equivalently obtained for the entire frame. For example, if the frame length is 20 ms, the pitch period is 4 ms, and the number of pulses in a representative interval is 4, 20 pulses can be equivalently obtained for the entire frame.
With respect to a male speaker having a long pitch period, however, since a sufficient number of pulses cannot be equivalently obtained for the entire frame, improved pitch interpolation does not exhibit a satisfactory effect. Therefore, a problem is posed in terms of sound quality. For example, if the pitch period is 10 ms, and the number of pulses per pitch is 4, the number of pulses in the entire frame is 8, which is very small as compared with the case of a female speaker. In order to increase the number of pulses in the entire frame, the number of pulses per pitch must be increased. However, if this number is increased, the bit rate is increased. For this reason, it is difficult to increase the number of pulses.
In addition, if the bit rate is decreased from 4.8 kb/s to 3 kb/s or 2.4 kb/s, the number of pulses per pitch must be decreased to 2 or to 3. Therefore, a problem worse than the above-described problem will be posed. At such a low bit rate, the effect of improved pitch interpolation becomes insufficient even for a female speaker.
In the code-excited linear prediction (CELP) method disclosed in reference 2, if the bit rate is decreased below 4.8 kb/s, the number of bits of a codebook must be decreased, resulting in abrupt degradation of sound quality. For example, at 4.8 kb/s, a 10-bit codebook is generally used for a subframe of 5 ms. However, at 2.4 kb/s, the number of bits of the codebook must be decreased to 5, provided that the period of the subframe is kept to be 5 ms. Since 5 bits are too small as the number of bits to cover various types of sound source signals, the sound quality is abruptly degraded at a bit rate lower than about 4.8 kb/s.