The present invention relates to a speech signal processing system wherein the prediction residual waveform is obtained by removing the short-time correlation from the speech waveform and the prediction residual waveform is used for coding, for example, a speech waveform.
Prior art speech signal coding systems have two classes of waveform coding and analysis-synthesizing system (vocoder). In a linear predictive coding (LPC) vocoder belonging to the latter class of the analysis-synthesizing system, coefficients of an all-pole filter (prediction filter) representing a speech spectrum envelope are given by the linear prediction analysis of an input speech waveform and then the input speech waveform is passed through an all-zero filter (inverse-filter) whose characteristics are inverse to the prediction filter so as to obtain a prediction residual waveform, and a parameter extracting part serves to extract periodicity as a parameter characterizing said residual waveform (discrimination of voiced or unvoiced sound), a pitch period and average power of the residual waveform and then these extracted parameters and the prediction filter coefficients are sent out. In the synthesizing part, a train of periodic pulses of the received pitch period in the case of a voiced sound or a noise waveform in the case of an unvoiced sound is outputted from an excitation source generating part, in place of the prediction residual waveform, so as to be supplied to a prediction filter which outputs a speech waveform by setting filter coefficients of the prediction filter as the received filter coefficients.
On the other hand, in an adaptive predictive coding (APC) system belonging to the former class of the waveform coding, a prediction residual waveform is obtained in a manner similar to the case of vocoder and then sampled values of this residual waveform are directly quantized (coded) so as to be sent out along with coefficients of a prediction filter. In the synthesizing section, the received coded residual waveform is decoded and supplied to a prediction filter which serves to generate a speech waveform by setting the received predictions filter coefficients in filter coefficients of the prediction filter.
The difference between these two conventional systems resides in the method of coding a prediction residual waveform. The above-stated LPC vocoder can achieve large reduction in bit rate in comparison with the above-stated APC system for transmitting a quantized value of each sample of the residual waveform, because relative to the residual waveform, the LPC vocoder is required to transmit only the characterizing parameters (periodicity, a pitch period, and average electric power). However, on the contrary, in the LPC vocoder, it is impossible to avoid degradation in speech quality caused by replacing a residual waveform with a pulse train or noise, resulting in such as, what is called, a mechanical synthesizing voice. Even though the bit rate increases, enhancement in quality would saturate at about 6 kb/s. As a result, the LPC vocoder has a disadvantage that it cannot provide natural voice quality. Another factor of the lowering quality is that the timing for controlling the prediction filter coefficients cannot be suitably determined relative to each pulse position (phase) in the pulse train supplied to the prediction filter because of lack of information indicating each pitch position. Further the LPC vocoder also has the disadvantage that the lowering of quality is brought about by the extracting of erroneous characterizing parameters from a residual waveform. On the other hand, the above-stated APC system has an advantage that it is possible to enhance speech quality so that it is very close to the original speech by increasing the number of quantizing bits for a residual waveform, but on the contrary, it has the disadvantage that when the bit rate is lowered less than 16 kb/s, quantization distortion increases to abruptly degrade the speech quality.
Moreover, in the prior art systems, there is a possibility that such as an alteration in pitch of a speech signal and combining of speech signal frames happen to be carried out at time locations where signal energy is concentrated, resulting in generation of unnatural speech.
Furthermore, in the prior art as is disclosed in U.S. Pat. No. 4,214,125, F. S. MOZER, "Method and apparatus for speech synthesizing" or U.S. Pat. No. 3,892,919, A. ICHIKAWA, "Speech synthesizing system", it has been proposed to carry out the following processing procedure. After the Fourier transform is carried out on samples in each waveform section of one pitch length cut out from a speech waveform and the resultant sine component is set to zero, that is, the phase of each harmonic component is set to zero, the resultant is subjected to the inverse Fourier transform to zero-phase the cut-out speech waveform, thereby temporarily concentrating the signal energy into a pulsative waveform. Each zero-phased waveform of the pitch length is coded. In the synthesizing part the resultant codes are decoded and the zero-phased waveform sections each having a pitch period duration are concatenated to one another to restore the speech waveform. In this method, erroneous extraction of a pitch period greatly influences the speech quality. The processing distortion is caused by the zero-phasing process applied to a speech waveform. Furthermore, in this method, the location of energy concentration (pulse) caused by the zero-phasing has nothing to do with the portion where energy of the original speech waveform in each pitch length is comparatively concentrated, that is, the pitch location and thus the restored speech waveform synthesized by successively concatenating zero-phased speech waveform sections is far from the original speech waveform and excellent speech quality cannot be obtained.
Further, in J. IECE Jpn. Trans. A, vol. 62-t. No. 3, March 1979, "Function and basic characteristics of SPAC" by Takasugi, the following method is proposed: The auto-correlation function of a speech waveform is obtained, a certain kind of zero-phasing operation is conducted on the speech waveform and each speech waveform section of a pitch length is coded. In the decoding part, the decoded waveform sections are successively concatenated one another. Moreover, the operation of obtaining the auto-correlation function is somewhat similar to performing a square operation, so that the low frequency components with large energy are emphasized, resulting in square-law distortion in the spectrum of the processed signal. In this case, said zero-phasing serves to concentrate energy in the form of a pulse in each pitch period of the auto-correlation function, but, the pulse location does not necessarily coincide with the location where the energy in each pitch period of speech waveform is concentrated and therefore when the decoded waveform sections are connected to one another to reconstruct a speech waveform, the reconstructed speech waveform may be far from the original speech waveform.