To convert text to speech, a stream of text is typically converted into a speech wave form. This process generally includes determining the timing of speech events from a phonetic representation of the text. Typically, this involves the determination of the durations of speech segments that are associated with some speech elements, typically phones or phonemes. That is, for purposes of generating the speech, the speech is considered as a sequence of segments during each of which, some particular phoneme or phone is being uttered. (A phone is a particular manner in which a phoneme or part of a phoneme may be uttered. For example, the `t` sound in English, may be represented in the synthesized speech as a single phone, which could be a flap, a glottal stop, a `t` closure, or a `t` release. Alternatively, it could be represented by two phones, a `t` closure followed by a `t` release.) Speech timing is established by determining the durations of these segments.
In the prior art, rule-based systems generate segment durations using predetermined formulas with parameters that are adjusted by rules that act in a manner determined by the context in which the phonetic segment occurs, along with the identity of the phone to be generated during the phonetic segment. Present neural network-based systems provide full phonetic context information to the neural network, making it easy for the network to memorize, rather than generalize, which leads to poor performance on any phone sequence other than one of those on which the system has been trained.
Thus, there is a need for a neural network system that avoids the effects when the neural network depends only on chance correlations in training data and instead provides efficient segment durations.