The invention relates to a method for determining the time characteristic of a fundamental frequency of a voice response to be synthesized.
At the ICASSP 97 conference in Munich, a method for synthesizing voice from a text, which is completely trainable and assembles and generates the prosody of a text by prosody patterns stored in a database, was presented under the title “Recent Improvements on Microsoft's Trainable Text-to-Speech System Whistler”, X. Huang et al. The prosody of a text is essentially defined by the fundamental frequency which is why this known method can also be considered as a method for generating a fundamental frequency on the basis of corresponding patterns stored in a database. To achieve a type of speech which is as natural as possible, elaborate correction methods are provided which interpolate, smooth and correct the contour of the fundamental frequency.
At the ICASSP 98 in Seattle, a further method for generating a synthetic voice response from a text was presented under the title “Optimization of a Neural Network for Speaker and Task Dependent F0 Generation”, Ralf Haury et al. To generate the fundamental frequency, this known method uses, instead of a database with patterns, a neural network by which the time characteristic of the fundamental frequency for the voice response is defined.
The methods described above are to be used for creating a voice response which does not have a metallic, mechanical and unnatural sound as is known from conventional speech synthesis systems. These methods represent a distinct improvement compared with the conventional speech synthesis systems. Nevertheless, there are considerable tonal differences between the voice response based on this method and a human voice.
In a speech synthesis in which the fundamental frequency is composed of individual fundamental-frequency patterns, in particular, a metallic, mechanical sound is still generated which can be clearly distinguished from a natural voice. If, in contrast, the fundamental frequency is defined by a neural network, the voice is more natural but it is somewhat dull.
One aspect of the invention is, therefore, based on the object of creating a method for determining the time characteristic of a fundamental frequency of a voice response to be synthesized which imparts a natural sound to the voice response which is very similar to a human voice.