Recently, a text-to-speech synthesis system to artificially generate a speech signal from an arbitrary sentence is developed. In general, the text-to-speech synthesis system is composed of a language processing unit, a prosody generation unit, and a speech signal generation unit.
In component of the text-to-speech synthesis system, ability of the prosody generation unit relates with naturalness of synthesized speech. Especially, a fundamental frequency pattern (a change pattern of loudness of a voice) has large influence on naturalness of synthesized speech.
With regard to a conventional method for generating a fundamental frequency pattern for text speech synthesis, the fundamental frequency pattern is generated using relatively a simple model. As a result, a synthesized speech having unnatural intonation is mechanically generated.
In order to solve above-problem, for example, another method for generating a fundamental frequency pattern is disclosed in JP-A No. 2007-33870 (KOKAI). In this method, a large number of fundamental frequency patterns (extracted from a natural speech) are hierarchically clustered. By subjecting a statistic processing to each cluster (a set of fundamental frequency pattern), a typical pattern is generated for each cluster.
However, a set of fundamental frequency pattern at a lower layer is necessarily small because of hierarchical clustering. Accordingly, statistical reliability of the typical pattern is low, and robustability and naturalness drop. In order to generate a fundamental frequency pattern having naturalness, the set of fundamental frequency pattern at each lower layer must be maintained as a predetermined scale, and all types of fundamental frequency patterns must be prepared. In other words, a large number of speech data must be previously prepared.