The production of speech consists of a combination of three elements: generation of a sound source, articulation by the vocal tract, and radiation from the lips and nostrils. By simplifying these elements and separating sound source and articulation, a generation model of speech waveform can be represented.
Generally, speech has two characteristics. One, relating to articulation, is the phonemic characteristic, which is mainly shown in the change patterns of the spectrum envelope of the sound. The other, relating to the sound source, is the prosody characteristic, which is mainly shown in the fundamental frequency patterns of the sound.
In speech synthesis based on text data, the required information for synthesizing the phonemic characteristic can be obtained from the text data by using morphological analysis. In contrast, the waveform of fundamental frequency required for synthesizing the prosody characteristic is not shown in the text data. Therefore, this waveform must be obtained according to the accent pattern of a word, the syntax of a sentence, the discourse structure of sentences, and so on.
The Fujisaki model is one of the well-known models for generation of fundamental frequency. A focus of this model is that the contour of fundamental frequency will remain nearly constant, regardless of the overall fundamental frequency, when the pattern of time curves of fundamental frequency is expressed with a logarithm. Further, the model assumes that the fundamental frequency pattern actually observed is represented by the sum of the phrase component, which moderately falls from the beginning through the end of the phrase, and the accent component, which indicates the accent on each word. From this assumption, both components are approximated by a second-order critical damping linear system response against the impulse phrase command, and a step accent command.
As described above, based on the word's accent pattern, the syntax of a sentence, and the discourse structure of sentences, the phrase command and the accent command are calculated, for which fundamental frequency can then be determined.
However, the above model for the generation of fundamental frequency has the problem that the fundamental frequency cannot be controlled more precisely, because only rise in fundamental frequency is taken into consideration. In other words, there is a limitation in adding a various expression into synthesized speech sound. Another problem is that the phrase command and the accent command can uncertainly be obtained when analyzing the observed fundamental frequency pattern.
Another problem is that a time lag occurs between the timing of designating the phrase command and the timing when the phrase component actually appears because the response of a second-order critical damping linear system against the impulsive phrase command is regarded as a phrase component.