The invention relates to a speech synthesis method which is intended to prevent a quality degradation of synthesized speech which occurs when the fundamental frequency pattern of a speech produced significantly deviates from a pattern of speech segments during conversion from a text into a speech using speech segments, and which is also intended to prevent a quality degradation of synthesized speech which occurs when producing synthesized speech which significantly deviates from the fundamental frequency pattern of an original speech during the analysis and synthesis of speech.
In the prior art practice, the transformation from a text into a speech takes place by cutting out a waveform for one period from a pre-recorded speech segment every fundamental period, and rearranging the waveform in conformity to a fundamental frequency pattern which is produced from a result of analysis of the text. This technique is referred to as PSOLA technique, which is disclosed, for example, in M. Moulines et al. "Pitch-synchronous waveform, processing techniques for text-to-speech synthesis using diphones" Speech Communication, vol. 9, pp. 453-467 (1990-12).
In the analysis and systhesis, an original speech is analyzed to retain spectral features, which are utilized to synthesize the original speech.
In the prior art practice, the quality of synthesized speech is markedly degraded if the fundamental frequency pattern of a speech which is desired to be synthesized significantly deviates from the fundamental frequency pattern exhibited by a pre-recorded speech segment. For detail, refer T. Hirokawa et al. "Segment Selection and Pitch Modification for High Quality Speech Synthesis using Waveform Segments" ICSLP90, pp. 337-340, D. H. Klatt et al. "Analysis, synthesis, and perception of voice quality variations among female and male talkers" J. Acoust. Soc. Am. 87(2), February 1990, pp. 820-857. Accordingly, in the conventional PSOLA technique, if the waveform is rearranged directly in conformity to the fundamental frequency pattern produced as a result of analysis of the text, a substantial quality degradation may result, and resort had to be had to a flat waveform which exhibits a minimal variation in the fundamental frequency pattern.
It is considered that a quality degradation of synthesized speech which results from largely changing the fundamental frequency of a speech segment is caused by an acoustical mismatch between the fundamental frequency and the spectrum. Thus synthesized speech of good quality can be obtained by providing many speech segments having a spectral structure which matches well with the fundamental frequency. However, it is difficult to utter every speech segment at its desired fundamental frequency, and if this is possible, the required storage capacity will become voluminous, and its implementation will be prohibitive.
In view of this, Japanese Laid-Open Patent Application No. 171,398 (laid open Oct. 21, 1982) proposes that spectrum envelope parameter values for a plurality of voices having different fundamental frequencies are stored for each vocal sound, and a spectrum envelope parameter for the closest fundamental frequency is chosen for use. This involves a drawback that the quality improvement is minimal because of a reduced number of available fundamental frequencies, and the storage capacity becomes voluminous.
In Japanese Laid-Open Patent Application No. 104,795/95 (laid open Apr. 21, 1995), a human voice is modelled to prepare a conversion rule, and the spectrum is modified as the fundamental frequency changes. With this technique, the voice modelling is not always accurate, and accordingly, the conversion rule cannot properly match the human voice, foreclosing an expectation for better quality.
A modification of the fundamental frequency and the spectrum for purpose of speech synthesis is proposed in Assembly of Lecture Manuscripts, pp. 337 to 338, in a meeting held March 1996 by the Acoustical Society of Japan. The proposal is directed to a rough transformation of spreading an interval in a spectrum as the fundamental frequency F.sub.0 increases, and cannot provide synthesized speech of good quality.
In the analysis and synthesis, there remains a problem of a quality degradation of synthesized speech when the synthesized speech to be produced has a pitch periodicity which significantly differs from the pitch periodicity of an original speech.
It is to be noted that the present invention has been published in part or in whole by the present inventors at times later than the claimed priority date of the present Application in the following institutes and associations and their associated journals:
A. Kimihiko Tanaka, and Masanobu Abe, "A New Fundamental Frequency Modification Algorithm With Transformation of Spectrum Envelope According to F0", 1997 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 97) Vol. II, pp. 951-954, The Institute of Electronics Engineers (IEEE) Signal Processing Society, Apr. 21-24, 1997. PA0 B. Kimihiko Tanaka and Masanobu Abe, "Text Speech Synthesis System Modifying Spectrum Envelope in accordance with Fundamental Frequency", Institute of Electronics, Information and Communication of Japan, Research Report Vol. 96, No. 566, pp. 23-30, SP96-130, Mar. 7, 1997 (published on 6th). Corporation: Institute of Electronics, Information and Communication of Japan. PA0 C. Kimihiko Tanaka and Masanobu Abe, "Speech Synthesis Technique Modifying Spectrum Envelope according to F0", in Assembly of Lecture Manuscripts I, pp. 217-218, for 1997 Spring Meeting of Acoustical Society of Japan held on Mar. 17, 1997. Corporation: Acoustical Society of Japan. PA0 D. Domestic Divulgation and Assembly of Manuscripts Kimihiko Tanaka and Masanobu Abe, "Speech Synthesis Technique Modifying Spectrum Envelope according to Fundamental Frequency", in Assembly of Lecture Manuscripts .vertline., pp. 217-218, for 1996 Autumn Meeting of Acoustical Society of Japan held on Sep. 25, 1996. Corporation: Acoustical Association of Japan.