The present invention relates to speech analysis and synthesis system and apparatuses thereof in which spectrum parameter analyzed based on cepstrum and sound source signal obtained according thereto are analyzed for each of a plurality of speech units (for example, several hundred numbers of CV and VC etc.) used for synthesis, the sound source signal is controlled with respect to its prosody (pitch, amplitude and time duration etc.), and a synthesizing filter is driven with the sound source signal to synthesize speech.
There is known system of synthesizing arbitrary words in which linear predictive coefficient according to linear predictive analysis etc. is used as spectrum parameter for speech unit, the spectrum parameter is applied to speech unit to effect analysis to obtain predictive residual signal so that a part thereof is used as sound source signal, and a synthesizing filter constituted according to the linear predictive coefficient is driven by this sound source signal to thereby synthesize speech. Such method is, for example, disclosed in detail in the paper authored by Sato and entitled "Speech Synthesis based on CVC and Sound Source Element (SYMPLE)", Transaction of the Committee on Speech Research, The Acoustic Society of Japan, S83-69, 1984 (hereinafter, referred to as "reference 1"). According to the method of the reference 1, LSP coefficient is used as the linear predictive coefficient, predictive residual signal obtained through linear predictive analysis of original speech unit is used as sound source signal in un-voiced period, and predictive residual signal sliced from a representative one pitch period interval of vowel interval is used as sound source signal in a voiced period to drive the synthesizing filter to thereby synthesize speech. This method has improved speech quality as compared to another method in which a train of impulses is used in the voiced period and noise signal is used in the un-voiced signal.
A plurality of speech units are concatenated to synthesize speech in the speech synthesis, particularly in arbitrary word synthesis. In order to intonate the synthesized speech as natural speech of human speaker, it is necessary to change pitch period of speech signal or sound source signal according to prosodic information or prosodic rule. However, in the method of reference 1, when changing the pitch period of residual signal which is sound source in the voiced period, since the pitch period of original speech unit used in the analysis of coefficient of the synthesizing filter is different from that of speech to be synthesized, mismatching is generated between the changed pitch of residual signal and the spectrum envelope of synthesizing filter. Consequently, the spectrum of synthesized speech is considerably distorted and causes serious drawbacks such as the synthesized speech is greatly distorted, noise is superimposed, and the clearity is greatly reduced. Further, these drawbacks cause a first problem that these drawbacks are particularly noticeable when changing greatly pitch period in case of female speaker who has short pitch period.
Further, conventionally as in the case of reference 1, LPC analysis has been frequently used in the analysis of spectrum parameter representative of spectrum envelope of speech signal. However, in principle, the LPC analysis method has a drawback that the predicted spectrum envelope is easily affected by pitch structure of speech signal to be analyzed. This drawback is particularly remarkable to vowels ("i", "u" and "o" etc.) and nasal consonants in which the first Formant frequency and pitch frequency are close to each other as in the case of female speaker who has high pitch frequency. In the LPC analysis, prediction of Formant is affected by the pitch frequency to thereby cause shift of the Formant frequency and underestimation of band width. Accordingly, there is a second problem that great degradation in speech quality is generated when changing pitch to effect synthesis particularly in case of female speaker.
Moreover, in the foregoing method of reference 1, since the predictive residual signal of the representative one pitch interval of the same vowel interval is repeatedly used in general for vowel intervals, change with the passage of time in spectrum and phase of the residual signal cannot be fully represented for vowel intervals. Consequently, there has been a third problem that the speech quality is degraded in the vowel intervals.
With regard to the first problem, there is known a method to somewhat solve the problem in which peak Formant in lower range of the spectrum envelope is shifted to coincide with a position of the pitch frequency when effecting synthesis. For example, such method is disclosed in a paper authored by Sagisaka et al. and entitled "Synthesizing Method of Spectrum Envelope in Taking Account of Pitch Structure", The Acoustic Society of Japan, lecture Gazette pages 501-502, October 1979 (hereinafter, referred to as "reference 2"). However, in the foregoing method of reference 2, since the Formant peak position is shifted to that of the changed pitch frequency, this is not the fundamental modification, thereby causing another problem that the clearity and speech quality are degraded due to the shift of Formant position.
With regard to the second problem, in order to reduce the affect of pitch structure, there have been proposed various analysis methods such as Cepstrum method, LPC Cepstrum analysis method which is an intermediate analysis method between the foregoing LPC analysis and the Cepstrum method and the modified Cepstrum method which is a modification of the Cepstrum method. Further, there has been proposed a method to directly constitute a synthesizing filter by using these Cepstrum coefficients. The Cepstrum method is disclosed, for example, in a paper authored by Oppenheim et al. and entitled "Homomorphic analysis of speech", IEEE Trans. Audio & Electroacoustics, AU-16, p. 221, 1968 (hereinafter, referred to as "reference 3"). With regard to the LPC Cepstrum method, there is known a method to effect conversion from the linear predictive coefficient obtained by the LPC analysis into the Cepstrum. Such method is disclosed in, for example, a paper authored by Atal et al. and entitled "Effectiveness of Linear Prediction Characteristics of the Speech Wave for Automatic Speaker Identification and Verification", J. Acoustical Soc. America, pp. 1304-1312, 1974 (hereinafter, referred to as reference 4). Further, the modified Cepstrum method is disclosed in, for example, a paper authored by Imai et al. and entitled "Extraction of Spectrum Envelope According to Modified Cepstrum Method", Journal of Electro Communication Society, J62-A, pp. 217-223, 1979 (hereinafter, referred to as "reference 5"). The constructing method of a synthesizing filter using directly Cepstrum coefficient is disclosed in, for example, a paper authored by Imai et al. and entitled "Direct Approximation of Logarithmic Transmission Characteristic in Digital Filter", Journal of Electro Communication Society, J59-A, pp. 157-164, 1976 (hereinafter, referred to as "reference 6"). Therefore, detailed explanation may be omitted. However, though the Cepstrum analysis method and the modified Cepstrum analysis method can solve the forementioned problem of the LPC analysis, the structure of synthesizing filter using directly these coefficients is considerably complicated and requires a great amount of calculation and causes delay, thereby causing another problem that the construction of device is not easy.