Generating speech with desirable properties has been a focus in text to speech. Efforts have been made to produce synthesized speech with a more natural sound. One approach to generating natural sounding synthesized speech is to select phonetic units from a large unit database to produce a realization of a target unit sequence which was predicted based on the input text. To specify a desired sound, the predicted target unit sequence may be annotated with prosodic patterns and/or target that represent linguistic prosodic characteristics. FIG. 1 (Prior Art) illustrates a conventional framework 100 for unit-selection based text to speech processing. The conventional framework 100 typically comprises a text to speech (TTS) front end 110, a unit selection mechanism 160, a unit database 170, and a speech synthesis mechanism 180.
The TTS front end 110 takes text as input and produces a target unit sequence with an acoustic target as its output. The target unit sequence is predicted according to the text input. The acoustic target annotates the target units in the target unit sequence with acoustic prosodic characteristics. The acoustic prosodic characteristics may be generated with the goal that the synthesized speech using units selected according to the annotated target unit sequence has some desired speech properties.
To generate the target unit sequence with an acoustic target, the TTS front end 110 may process the text at different stages. The TTS front end 110 may typically include a text normalization mechanism 120, a linguistic analysis mechanism 130, a linguistic target generation mechanism 140, and an acoustic target generation mechanism 150. Input text with any abbreviated words is first converted into normalized text. This is achieved by the text normalization mechanism 120. During such processing, an abbreviated word such as “Corp.” may be converted into a normalized word such as “corporation”.
The linguistic analysis mechanism 130 analyzes the normalized text and produces a sequence of phonetic units predicted based on the words contained in the normalized text. For instance, for the word “pot”, the linguistic analysis mechanism 130 may produce three phonemes arranged in the order of /p/, /a/, and /t/. The sequence of units produced at this stage specifies the necessary phonetics to produce the synthesized speech.
To produce desired prosodic properties, the linguistic target generation mechanism 140 annotates the units with desired linguistic prosodic characteristics. For example, if the word “pot” is to be stressed, the vowel in “pot” (i.e., phoneme /a/) may be annotated as “stressed”. If a word is the last word of a phrase (it is often lengthened), so all appropriate phonetic units within this word may be annotated as “end of phrase”. Such linguistic annotations specify a relevant linguistic prosodic context, and therefore influence what the synthesized speech sounds like.
Linguistic annotation is at a symbolic level. To realize the intended speech effect, the conventional framework 100 maps such symbolic annotations to corresponding acoustic annotations. The acoustic annotations specify how to realize the intended speech effect. For each linguistic annotation at a symbolic level, the acoustic target generation mechanism 150 translates the linguistic annotation into one or more acoustic annotations. For instance, for a phoneme /a/ annotated with a linguistic prosodic characteristic “stressed”, three acoustic annotations, associated individually with acoustic features pitch, energy, and duration, may be generated. The acoustic annotations are generated in such a way that by complying with the annotated acoustic features, the synthesized speech will have the intended linguistic prosodic characteristics. For example, using the acoustic annotations in terms of pitch, energy, and duration features translated from a linguistic annotation “stressed” in synthesis, a stressed vowel /a/ may be produced.
In the conventional framework 100, the unit selection mechanism 160 takes the target unit sequence annotated with acoustic target and selects units from the unit database 170 according to the acoustically annotated target unit sequence. That is, the selected units not only satisfy what is required according to the target unit sequence but also possess, to the greatest extent possible, the acoustic properties specified by the acoustic target. The output of the unit selection mechanism 160 is a selected unit sequence which is then fed to the speech synthesis mechanism 180 to synthesize the speech.