Text-to-speech (TTS) synthesis refers to a technique for generating synthetic speech that is artificially produced. The synthetic speech is generally composed by a computer system and designed to sound like human speech. Another technique, referred to as the Personalization of TS, seeks to modify the synthesized speech from the TTS system to sound like a target speaker. One of the challenges in doing so is to match the rhythm and speaking style using a small amount of data generally limited to a small number of utterances from that speaker. As a result, the syllable durations of a typical speaker do not match the syllable durations of a TTS system output for the same sentence.
The mismatch between a typical speaker and corresponding TTS output is illustrated in FIG. 1, which shows the waveform 100 and spectrum 110 of speech signal from a representative speaker (top) and the waveform 120 and spectrum 130 of same sentence generated by a TTS system (bottom). As is evident from the speech boundary lines 140 in FIG. 1, the difference in syllable durations associated with the speaker are sometimes longer and sometimes shorter than the TTS output for the same sentence. The temporal duration of different segments of the speech vary widely depending on the linguistic contexts such as the phonetic contents of the syllable, preceding and following syllables, and the tones of these syllables. Even within a syllable, uniform expansion or compression is not sufficient to address the individual differences necessary to adapt the synthetic speech to the speaker.
There is therefore a need for a technique for adapting the TTS system speech to match the target speaker, thereby generating synthetic speech that realistically sounds like the target speaker.