Speech synthesis, or text-to-speech (TTS), involves the use of a computer-based system to convert a written document into audible speech. A good TTS system should generate natural, or human-like, and highly intelligible speech. In the early years, the rule-based TTS systems, or the formant synthesizers, were used. These systems generate intelligible speech, but the speech sounds robotic, and unnatural.
To generate natural sounding speech, the unit-selection speech synthesis systems were invented. The system requires the recording of large amount of speech. During synthesis, the input text is first converted into phonetic script, segmented into small pieces, and then find the matching pieces from the large pool of recorded speech. Those individual pieces are then stitched together. Obviously, to accommodate arbitrary input text, the speech recording must be gigantic. And it is very difficult to change the speaking style. Therefore, for decades, alternative speech synthesis systems which has the advantages of both formant systems, small and versatile, and the unit-selection systems, naturalness, have been intensively sought.
In a related patent application, a system and method for speech synthesis using timbre vectors are disclosed. The said system and method enable the parameterization of recorded speech signals into a highly amenable format, timbre vectors. From the said timbre vectors, the speech signals can be regenerated with substantial degree of modifications, and the quality is very close the original speech. For speech synthesis, the said modifications include prosody, which comprises the pitch contour, the intensity profile, and durations of each voice segments. However, in the previous application U.S. Ser. No. 13/692,584, no systems and methods for the generation of prosody is disclosed. In the current application, the systems and methods for generating prosody for an input text are disclosed.