This invention relates to a speech coding system, and particularly to a system for improving the quality of coded and decoded speech when compressing the speech information to about 8 kbps (kilobits per second).
For the PCM transmission of a speech signal over a broad-band cable, it is sampled, quantized and transformed into a binary digital signal. The transmission bit rate is 64 kbps.
In establishing a communication network using leased digital lines, reduction in the communication cost is a critical concern, and speech signals which contain as much information volume as 60 kbps cannot be transmitted directly. To cope this problem, it is necessary to compress the information (i.e., low bit-rate coding) for the transmission of such speech signals.
A known method of compressing a speech signal to about 8 kbps is to separate the speech signal into spectrum envelope information and excitation information, and code the information individually. A method of separating the speech signal into the spectrum envelope information and excitation information will be described in the following. It is assumed that the speech waveform is already sampled and transformed into a series of sample values x.sub.i, in which the present sample value is x.sub.t and the preceding p pieces of sample values are {x.sub.t-i } (where i=1, 2, . . . , p). Another assumption is that the speech waveform can be predicted approximately from p pieces of preceding samples. Among the prediction schemes, the simplest linear prediction approximates the current value by summing old sample values each multiplied by a certain coefficient. The difference between the real value x.sub.t and predicted value y.sub.t at present time t is the prediction error .epsilon., which is also called "prediction residual" or simply "residual". The prediction residual waveform of a speech waveform is supposed to be the sum of two kinds of waveforms. One is an error component, which has a moderate amplitude and is similar to a random noise waveform. The other is an error attributable to the entry of a voiced sound pulse, which is very unpredictable, resulting in a residual waveform with a large amplitude. The error component appears cyclically in the periodicity of the source sound.
Speech has sections with periodicity (voiced sound) and sections without significant periodicity (unvoiced sound), and correspondingly the prediction residual waveform has periodicity in its voiced sound sections.
The so-called PARCOR (Partial Autocorrelation) method produces a model of residual waveform using a single pulse train for the voiced sound and using the white noise for the unvoiced sound, and it works for low bit-rate coding, while it suffers a significant quality degradation. Other methods which express the original sound by several pulse trains include: the multi-pulse excitation method (refer to Transactions of the Committee on Speech Research pp. 617-624, The Acoustical Society of Japan, entitled "Quality Modification in Multi-pulse Speech Coding System", S83-78 (Jan. 1984), by Ozawa, et al.) and the thinned-out residual method (refer to Digests of Conference in Oct. 1984, pp. 169-170, The Acoustical Society of Japan, entitled "Speech Synthesis Using Residual Information", by Yukawa, et al.).
In the above conventional techniques, an excitation pulse train is generated based on a certain formulation for each frame independently. The frame is a time unit for the speech analysis and it is set to about 20 ms in general. In the multi-pulse method and thinned-out residual method, generated pulse trains can be regarded as the approximation of the residual, and therefore voiced sound sections seem to have a periodicity. However, since a pulse train is generated independently of the preceding and following frames, each frame has a different relative positional relation in the pulse train, resulting possibly in the fluctuation of periodicity. Synthesizing speech based on such pulse trains unfavorably results in a quality degradation, such as the creation of rumbling.