Text-to-speech conversion involves converting a stream of text into a speech wave form. This conversion process generally includes the conversion of a phonetic representation of the text into a number of speech parameters. The speech parameters are then converted into a speech wave form by a speech synthesizer. Concatenative systems are used to convert phonetic representations into speech parameters. Concatenative systems store patterns produced by an analysis of speech that may be diphones or demisyllabes and concatenate the stored patterns adjusting their duration and smoothing transitions to produce speech parameters in response to the phonetic representation. One problem with concatenative systems is the large number of patterns that must be stored. Generally, over 1000 patterns must be stored in a concatenative system. In addition, the transition between stored patterns is not smooth. Synthesis-by-rule systems are also used to convert phonetic representations into speech parameters. The synthesis-by-role systems store target speech parameters for every possible phonetic representation. The target speech parameters are modified based on the transitions between phonetic representations according to a set of rules. The problem with synthesis-by-rule systems is that the transitions between phonetic representations are not natural, because the transition rules tend to produce only a few styles of transition. In addition, a large set of rules must be stored.
Neural networks are also used to convert phonetic representations into speech parameters. The neural network is trained to associate speech parameters with the phonetic representation of the text of recorded messages. The training results in a neural network with weights that represents the transfer function required to produce speech wave forms from phonetic representations. Neural networks overcome the large storage requirements of concatenative and synthesis-by-rule systems, since the knowledge base is stored in the weights rather than in a memory.
One neural network implementation used to convert a phonetic representation consisting of phonemes into speech parameters uses as its input a group or window of phonemes. The number of phonemes in the window is fixed and predetermined. The neural network generates several frames of speech parameters for the middle phoneme of the window, while the other phonemes in the window surrounding the middle phoneme provide a context for the neural network to use in determining the speech parameters. The problem with this implementation is that the speech parameters generated don't produce smooth transitions between phonetic representations and therefore the generated speech is not natural and may be incomprehensible.
Therefore a need exist for a text-to-speech conversion system that reduces storage requirements and provides smooth transitions between phonetic representations such that natural and comprehensible speech is produced.