1. Field of the Invention
The present invention relates to a prosody-pattern generating apparatus, a speech synthesizing apparatus, and a computer program product and a method thereof.
2. Description of the Related Art
A technique of applying a hidden Markov model (HMM), which is used in speech recognition, to speech synthesizing technology of synthesizing speech from a text has been receiving attention. In particular, a speech is synthesized by generating a prosody pattern (fundamental frequency pattern and phoneme duration length pattern) that defines the characteristics of speech by use of a prosody model, which is an HMM (see, for instance, Non-patent Document 1 of “Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis” by T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, Proc. EUROSPEECH '99, pp. 2347-2350, September 1999).
With the speech synthesizing technology of outputting speech parameters by use of an HMM itself and thereby synthesizing a speech, various speech styles of various speakers can be readily realized.
In addition to the above HMM-based fundamental frequency pattern generation, a technique has been suggested, with which the naturalness of a fundamental frequency pattern can be improved by generating the pattern in consideration of the distribution of fundamental frequencies of the entire sentence (see, for instance, Non-patent Document 2 of “Speech parameter generation algorithm considering global variance for HMM-based speech synthesis” by T. Toda and K. Tokuda, Proc. INTERSPEECH 2005, pp. 2801-2804, September 2005).
However, there is a problem in the technique suggested by Non-patent Document 2. Because optimal parameter strings are searched for by repeatedly using algorithms, an amount of calculation increases at the time of generating the fundamental frequency pattern.
Furthermore, because the technique of Non-patent Document 2 employs the distribution of the fundamental frequencies of the entire text sentence, a pattern cannot be generated sequentially for each segment of the sentence or the like. Thus, there is a problem that the speech cannot be output until the fundamental frequency pattern of the entire text is completed.