The present invention relates to a method and an arrangement for speech synthesis and provides an automatic mechanism for simulating human speech. The method according to the present invention provides a number of control parameters for controlling a speech synthesis device.
In natural speech, the phonemes contained therein overlap one another. This phenomenon is called coarticulation. The present invention combines diphonic synthesis and formant synthesis for handling coarticulation. Furthermore, the present invention provides the possibility for polyphonic synthesis, especially diphonic synthesis, but also triphonic synthesis and quadraphonic synthesis.
It is known that the synthesis of text and/or speech often starts with a syntactic analysis of the text in which words, which are capable of being interpreted in more than one way, are given a correct pronunciation, that is to say, a suitable phonetic transcription is selected. An example of this is the Swedish word "buren" which can be interpreted as a noun, or as the participle form of a verb.
By using syntactic analysis and the syllabic structure of the sentence as a starting point, a fundamental sound curve can be created for the whole phrase and the durations of the phonemes contained therein can be determined. After this process, the phonemes can be realised acoustically in a number of different ways.
A known method of speech synthesis is formant synthesis. With this method, the speech is produced by applying different filters to a source. The filters are controlled by means of a number of control parameters including, inter alia, formants, bandwidths and source parameters. A prototype set of control parameters is stored by allophone. Coarticulation is handled by moving start/end points of the control parameters with the aid of rules, i.e. rule synthesis. One problem with this method is that it needs a large quantity of rules for handling the many possible combinations of phonemes. Furthermore, the method is difficult to survey.
Another known method of speech synthesis is diphonic synthesis. With this method, the speech is produced by linking together segments of recorded wave forms from recorded speech, and the desired basic sound curve and duration is produced by signal processing. An underlying prerequisite of this method is that there is a range which is spectrally stationary, in each diphone, and that spectral similarity prevails there; otherwise, a spectral discontinuity is obtained there, which is a problem. It is also difficult with this method to change the waveforms after recording and segmentation. It is also difficult to apply rules since the waveform segments are fixed.
There are no problems with spectral discontinuities in formant speech synthesis. Diphonic speech synthesis does not need any rules for handling the coarticulation problem.
It is an object of the present invention to use a diphonic synthesis method, that is to say, the use of stored control parameters which have been extracted by copying natural speech with the aid of synthesis, for generating speech by means of formant synthesis. An interpolation mechanism automatically handles coarticulation. If it is nevertheless desirable to apply rules and this can, in fact, be done.