Speech is the most customary and most natural means for human-machine communications. The technology for converting a text input into a speech output is called text-to-speech (TTS) conversion or speech synthesis technology. It relates to a plurality of fields such as acoustics, linguistics, digital signal processing multimedia technology and is a cutting-edge technology in the field of Chinese information processing.
FIG. 1 illustrates a signal flow of a speech synthesis system provided by the prior art. With reference to FIG. 1, in a training phase, a prosodic structure prediction model 103, an acoustics model 104 and a candidate unit 105 may be obtained based on the training of annotated data in a text corpus 101 and a speech corpus 102. The prosodic structure prediction model 103 provides a reference for prosodic structure prediction 107 in a speech synthesis phase; the acoustics model 104 provides a basis for speech synthesis 109; and the candidate unit 105 is a software unit for retrieving common candidate waveforms in the speech synthesis 109 of waveform concatenation type.
In the speech synthesis phase, firstly, text analysis 106 is performed on input text; then prosodic structure prediction 107 is performed on the input text according to the prosodic structure prediction model 103; and then parameter prediction/unit selection 108 is performed according to various speech synthesis patterns, that is, speech synthesis parameter synthesis type or speech synthesis of waveform concatenation type; and finally, the final speech synthesis 109 is performed.
By adopting the existing speech synthesis system to perform prosodic structure prediction, regarding some input text, a prosodic hierarchy structure determined by the input text may already be obtained. However, the prosodic hierarchy structure of speech is often affected by a variety of factors in people's actual communications. FIG. 2 is a schematic diagram illustrating the principle of influencing factors of a prosodic structure in real person speech. With reference to FIG. 2, the prosodic structure of the real person speech may be affected by the characteristics, emotions, basic frequency and the meaning of sentences of a speaker. Take the characteristics of the speaker as an example, the prosodic structure of speaking of a man aged 70 is different from the prosodic structure of speaking of a woman aged 30.
Therefore, the prosodic structure of a sentence obtained via prediction according to a uniform prosodic structure prediction model 103 has a poor flexibility, thus resulting in a poor naturalness of speech finally synthesized by the speech synthesis system.