The present invention relates to a speech analysis-synthesis system, in particular, relates to such a system of a linear prediction type, for a narrow band transmission of a speech signal.
A linear prediction type speech analysis-synthesis system is advantageous for a high speed digital transmission of a speech signal. The general concept of that linear prediction type speech analysis-synthesis system is that a transmit side separates an input speech signal to an exciting signal and a spectrum information (vocal track information), and said information is transmitted separately. Then, a receive side synthesizes the original speech by attaching a spectrum information received from the transmit side to the exciting information, which is a pulse signal in case of a voiced sound, or a white noise in case of an unvoiced sound. The linear prediction type speech analysis-synthesis system has the features that (1) a spectrum information (vocal track information) is expressed by an all pole filter H(Z): ##EQU1## and that (2) an exciting information in a receive side is either a periodical pulse signal or a white noise, or the combination of those signals. Accordingly, it is enough to transmit the coefficients .alpha..sub.i of an all pole filter, average amplitude or average energy V.sub.0 of a speech signal, and the information for indicating whether the speech is a voiced sound or an unvoiced sound (V/UV), for synthesizing a speech in a receive side. In case of an unvoiced sound, a period of a pulse signal which is used as a driving signal is also transmitted.
The fact that the spectrum information is expressed by an all pole filter ##EQU2## corresponds to the fact that the speech signal S.sub.t at the designated time can be predicted by p number of preceeding signals S.sub.t-i (i=1 through p) in the form of ##EQU3## in the sense of the least square error method. Further, since the prediction in the above sense is possible, there exists a strong correlation between adjacent signals. Said coefficient .alpha..sub.i is called a linear prediction coefficient or a spectrum information.
On the other hand, exciting information is provided by obtaining a linear prediction coefficient from a time series signal S.sub.t, providing an exciting signal .epsilon..sub.t which is the difference between the original time series signal S.sub.t and the predicted time series signal S.sub.t ', and providing the amplitude and the nature of the exciting information from the value .epsilon..sub.t. Alternatively, the exciting signal .epsilon..sub.t is obtained by deleting the adjacent correlation components from the time series signal S.sub.t.
In analyzing a speech signal, it is assumed that spectrum information and exciting information are constant in a short duration (for instance 30 msec). Therefore, an input speech signal is picked up through an analyzing window (the width of which is for instance 20 msec), and then, a speech signal within that window duration is analyzed, and the average features in that window of the speech signal are transmitted.
Although a prior speech analysis-synthesis system of a linear prediction type can provide a synthesized speech with enough inteligibility, it is not still satisfactory for differentiating individual speakers. The important reasons for that are that (1) an actual driving signal .epsilon..sub.t can not be approximated by a pulse train in case of a voiced sound although a prior system utilizes a pulse train or a white noise for an exciting signal or a driving signal, and (2) spectrum information is not constant during 20 or 30 msec. That disadvantage might be overcome by transmitting a driving signal or an exciting signal .epsilon..sub.t completely. However, in that case, it takes a rather wide frequency band, and therefore, it does not match with a narrow band transmission.
Further, a prior system has the disadvantage to synthesize an explosive sound (p, t or k), since the analysis window is constant (for instance, the width of the window is 20 msec as mentioned above). However, spectrum information and/or exciting information of an explosive sound is not constant during 20 msec, and it is preferable that the width of the analysis window is less than 5 msec when an explosive sound is analyzed or synthesized. However, if the analysis window is designed to be less than 5 msec for analyzing all the input speeches, a voiced sound is not analyzed clearly. That is to say, a voiced sound has a pitch period, which usually 15 msec, and therefore, if a voiced sound is analyzed with the analysis window less than that pitch period, the result of the analysis is not satisfactory.