The present invention relates to improvements in a speech analysis-synthesis apparatus.
The method, by which speech is separated into spectral envelope information mainly for bearing information such as "a" or "i" in Japanese, and source information carrying an accent or intonation so that it may be processed or transmitted, is called the "source coding method". This is exemplified by the PARCOR (i.e., Partial Auto-Correlation) coding method or the LSP (i.e., Line Spectrum Pair) coding method.
The source coding method can compress speech information so that it finds suitable application to voice mail, toys and educational devices. The aforementioned information separability of the source coding method is indispensable for characters for the speech synthesis-by-rule. In the source coding method of the prior art, as shown in FIG. 1(a), either model white noise 1 or an impulse train 2 is switched for use as the source information. At this time, the source information applied to a synthesizer is therefore (1) voiced/unvoiced information 3, (2) information amplitude 4, and (3) a pitch period (or pitch or fundamental frequency) 5.
By using the above-specified information (1), more specifically, the impulse train is generated in the voiced case, whereas the white noise is generated in the unvoiced case. The amplitudes of those signals are given by the aforementioned amplitude (2). Moreover, the interval of generating the impulse train is given by the aforementioned pitch period (3).
By making use of such model sound sources, the following speech quality degradations result so that the analysis-synthesis speech according to the source coding method of the prior art has failed to clear a predetermined limit in the quality:
(1) Speech quality degradation due to the misjudgement of the voiced/unvoiced information in the analysis;
(2) Speech quality degradation due to an erroneous pitch extraction or detection;
(3) Speech quality degradation based upon the incompleteness of separation between the formant component and pitch component in the speech "i" or "u";
(4) Speech quality degradation caused by the limit of the AR-model (i.e., Auto-Regressive) of the PARCOR coding method because the zero or anti-pole information of the spectrum cannot be carried; and
(5) Speech quality degradation caused because the non-stationary component or the fluctuating information important for naturalness of the speech is lost.
One means for eliminating those causes for the speech quality degradations is the "Multi-Pulse Exciting Method (which will hereafter be referred to as the MPE method)", by which a plurality of pulses generated for a one-pitch period or for a period corresponding to the former in the unvoiced case are used as the sound source in place of the "single-impulse/white noise" of the prior art.
Methods relating to that exciting method of the above-specified kind are enumerated, as follows:
(1) B. S. Atal and J. R. Remde: A New Model of LPC Excitation for Producing Natural-Sounding Speech at Low Bit Rates, Proc. ICASSP82, pp614-617 (1982);
(2) Ozawa, Arazeki and Ono: Examinations of Speech Coding Method of Multi-Pulse Exciting Type, Reports of Communication Association, CS82-161, pp115-122 (1983-3); and
(3) Ozawa, Ono and Arazeki: Improvements in Quality of Speech Coding Method of Multi-Pulse Exciting Type, Materials of Speech Research Party of Japanese Audio Association, S83-78 (1984-1).
Such multi-pulse method is schematically shown in FIG. 1(b). According to this exciting method, it is true that the quality of synthesized speech is improved, but a problem remains in that the quality is so saturated that it cannot be improved beyond a certain quality even if the quantity of speech information (e.g., the number of pulses) is increased.