Artificial generation of a speech signal from an arbitrary sentence is called “text speech synthesis”. In general, the text speech synthesis includes three steps of: language processing, prosody processing, and speech synthesis.
First, a language processing section morphologically and semantically analyzes an input text. Next, a prosody processing section processes accent and intonation of the text based on the analysis result, and outputs a phoneme sequence/prosodic information (fundamental frequency, phoneme segmental duration, power). Third, a speech synthesis section synthesizes a speech signal based on the phoneme sequence/prosodic information. In this way, text speech synthesis can be realized.
A principle of a synthesizer to synthesize arbitrary phoneme symbol sequence is explained. Assume that a vowel is represented as “V” and a consonant is represented as “C”. Feature parameters (speech units) of a base unit such as CV, CVC and VCV are previously stored. By concatenating the speech units with control of pitch and duration, speech is synthesized. In this method, quality of the synthesized speech largely depends on the stored speech units.
As one of such speech synthesis method, a plurality of speech units is selected for each synthesis unit (each segment) by targeting an input phoneme sequence/prosodic information. A new speech unit is generated by fusing the plurality of speech units, and speech is synthesized by concatenating new speech units. Hereinafter, this method is called a plural unit selection and fusion method. For example, this method is disclosed in JP-A No. 2005-164749 (Kokai).
In the plural unit selection and fusion method, first, speech units are selected based on the input phoneme sequence/prosodic information (target) from a large number of speech units previously stored. As the unit selection method, a distortion degree between a synthesized speech and the target is defined as a cost function, and the speech units are selected so that a value of the cost function minimizes. For example, a target distortion representing a difference of prosody/phoneme environment between a target speech and each speech unit, and a concatenation distortion occurred by concatenating speech units, are numerically evaluated as a cost. Speech units used for speech synthesis are selected based on the cost, and fused using a particular method, i.e., pitch waveforms of the speech units are averaged, or centroids of the speech segments are used. As a result, synthesized speech is stably obtained while suppressing fall of quality in editing/concatenating speech units.
Furthermore, as a method for generating speech units having high quality, the speech units stored are represented using formant frequency. For example, this method is disclosed in Japanese Patent No. 3732793. In this method, a waveform of formant (Hereafter, it is called “formant waveform”) is represented by multiplying a window function with a sinusoidal wave having a formant frequency. A speech waveform is represented by adding each formant waveform.
However, in speech synthesis of the plural unit selection and fusion method, waveforms of the speech units are directly fused. Accordingly, a spectral of a synthesized speech becomes unclear and quality of the synthesized speech falls. This problem is caused by fusing speech units having different formant frequencies. As a result, a formant of fused speech units is unclear and the quality falls.