The present invention relates to a speech analysis-synthesis method and apparatus in which a linear filter representing the spectral envelope characteristic of a speech is excited by an excitation signal to synthesize a speech signal.
Heretofore, linear predictive vocoder and multipulse predictive coding have been proposed for use in speech analysis-synthesis systems of this kind. The linear predictive vocoder is now widely used for speech coding in a low bit rate region below 4.8 kb/s and this system includes a PARCOR system and a line spectrum pair (LSP) system. These systems are described in detail in Saito and Nakata, "Fundamentals of Speech Signal Processing," ACADEMIC PRESS, INC., 1985, for instance . The linear predictive vocoder is made up of an all-pole filter representing the spectral envelope characteristic of a speech and an excitation signal generating part for generating a signal for exciting the all-pole filter. The excitation signal is a pitch frequency impulse sequence for a voiced sound and a white noise for an unvoiced sound. Excitation parameters are the distinction between voiced and unvoiced sounds, the pitch frequency and the magnitude of the excitation signal. These parameters are extracted as average features of the speech signal in an analysis window about 30 msec. In the linear predictive vocoder, since speech feature parameters extracted for each analysis window as mentioned above are interpolated temporarily to synthesize a speech, features of its waveform cannot be reproduced with sufficient accuracy when the pitch frequency, magnitude and spectrum characteristic of the speech undergo rapid changes. Furthermore, since the excitation signal composed of the pitch frequency impulse sequence and the white noise is insufficient for reproducing features of various speech waveforms, it is difficult to produce highly natural-sounding synthesized speech. To improve the quality of the synthesized speech in the linear predictive vocoder, it is considered in the art to use excitation which permits more accurate reproduction of features of the speech waveform.
On the other hand, multipulse predictive coding is a method that uses excitation of higher producibility than in the conventional vocoder. With this method, the excitation signal is expressed using a plurality of impulses and two all-pole filters representing proximity correlation and pitch correlation characteristics of speech are excited by the excitation signal to synthesize the speech. The temporal positions and magnitudes of the impulses are selected such that an error between input original and synthesized speech waveforms is minimized. This is described in detail in B. S. Atal, "A New Model of LPC Excitation for Producing Natural-Sounding Speech at Low Bit Rates," IEEE Int. Conf. on ASSP, pp 614-617, 1982. With multipulse predictive coding, the speech quality can be enhanced by increasing the number of impulses used, but when the bit rate is low, the number of impulses is limited, and consequently, reproducibility of the speech waveform is impaired and no sufficient speech quality can be obtained. It is considered in the art that an amount of information of about 8 kb/s is needed to produce high speech quality.
In multipulse predictive coding, excitation is determined so that the input speech waveform itself is reproduced. On the other hand, there has also been proposed a method in which a phase-equalized speech signal resulting from equalization of a phase component of the speech waveform to a certain phase is subjected to multipulse predictive coding, as set forth in U.S. Pat. No. 4,850,022 issued to the inventor of this application. This method improves the speech quality at low bit rates, because the number of impulses for reproducing the excitation signal can be reduced by removing from the speech waveform the phase component of a speech which is dull in terms of human hearing. With this method, however, when the bit rate drops to 4.8 kb/s or so, the number of impulses becomes insufficient for reproducing features of the speech waveform with high accuracy and no high quality speech can be produced, either.