Artificially creating a speech signal from an arbitrary text will be referred to as text-to-speech synthesis. Normally, text-to-speech synthesis is performed by three steps of text analysis, synthesis parameter generation, and speech synthesis.
In a typical text-to-speech synthesis system, first, a text analysis unit performs morphological analysis and parsing or the like of an input text and outputs language information. The language information includes phonetic symbol strings corresponding to the reading of the text, information of accent phrases serving as units of prosodic control, accent positions, and parts of speech. Next, the synthesis parameter generation unit generates synthesis parameters by performing prosodic control while referring to a prosodic control dictionary based on the language information. The synthesis parameters include prosodic parameters such as a fundamental frequency pattern (F0 pattern), phoneme duration, and power, and phonological parameters such as a phonemic symbol string. The speech synthesis unit generates synthesized speech in accordance with the synthesis parameters.
Such text-to-speech synthesis usually synthesizes speech having a tone as in text reading by human (so-called reading style). There are recently proposed a number of methods of implementing a variety of prosodic features. For example, a method is proposed which generates new prosodic parameters by performing interpolation processing between a plurality of prosodic parameters, and generates synthesized speech using these prosodic parameters, thereby offering synthesized speech having a variety of prosodic features.
In this method, however, the interpolation result may be inappropriate depending on the relationship between the prosodic parameters (for example, when the features amounts of the prosodic parameters have a large difference). For example, an F0 pattern will be exemplified as a prosodic parameter. Assume that interpolation is performed between the prosodic parameter of a male target speaker and that of a female speaker. Since the F0 pattern is generally higher in women, the F0 average value of the prosodic pattern generated by interpolation becomes higher than the average value of the F0 pattern of the target speaker (male speaker). As a result, the personality of the speaker is lost in the generated prosodic parameter.