The ideal of the TTS system and method is to convert the input text to the synthesized speech as natural as possible. The natural speech character hereinafter is refer to the speech character with natural voice as the voice of human being. The natural voice is usually archived by recording the real human being voice of read aloud text. TTS technology, especially TTS for natural speech, usually uses a speech corpus which comprises a huge amount of text with corresponding recorded speech, prosody label and other basic information label. In general, a TTS system and method includes three components: text analysis, prosody parameter prediction and speech synthesis. For a plain text to be converted to speech based on the corpus, text analysis is responsible for parsing the plain text to be rich text with descriptive prosody annotations such as prosody structure information including phrase boundaries and pauses, pronunciation, and accent annotation of the text. Prosody parameter prediction is responsible for predicting the phonetic representation of prosody, i.e. prosody parameters, such as values of pitch, duration and energy according to the result of text analysis. Speech synthesis is responsible for generating speech of the text based on the prosody parameters. Based on a nature speech corpus, the speech is intelligible voice as a physical result of the representation of semantics and prosody information implicitly in the plain text.
Statistics based approaches are an important tendency in current TTS technologies. In these kinds of approaches, text analysis and prosody parameter prediction models are trained with a large labeled corpus, and speech synthesis is always based on selection from multiply candidates for each synthesis segment to obtain required synthesized speech.
Nowadays, prosody structure of the text as an important component in test analysis is always regarded as the result of semantics and syntax analysis of the text. Prior art technologies on prosody structure prediction hardly realize and consider the influence from speed adjustment. However, comparison between two different speech speed corpuses shows that the relationship between speed and prosody structure is significant.
Moreover, when different speech speed is required for TTS, prior art will adjust the duration of the prosody parameter in the speech synthesis phase to meet the speech speed requirement. This measure will degrade the quality of the synthesized speech due to not having considered the relationship between the speech speed and the prosody structure.