In text-to-speech (TTS), in order to output speech which is easily understandable and natural for a listener, it is desirable to accurately determine a way of reading (hereafter, simply called a reading) incorporating not only pronunciations but also accents. In conventional text-to-speech techniques, generation of accents is realized in a manner that numerous rules for determining appropriate accents are found out on a trial-and-error basis by analyzing standard speeches of an announcer or the like. However, generation of the appropriate rules requires various kinds of work performed by experts, and there has been a risk of requiring enormous costs and time.
There has been proposed a technique for determining a pronunciation and an accent of a phrase in inputted text by using statistical information, instead of rules, such as appearance frequencies of pronunciations and accents of the phrase in previously provided learning data. See Nagano, Mori, and Nishimura, “Kakuritsuteki model wo mochiita yomikata oyobi akusento suitei (Reading and Accent Estimation Using Stochastic Model),” SIG-SLP57 (July, 2005). According to this technique, accurate appearance frequencies can be computed on the premise that a sufficient amount of the learning data is available, and the processing for generating accents can be made more efficient since it is not necessary to generate rules.
However, in the abovementioned technique using the statistical information requires a large amount of learning data for which accurate pronunciations and accents are provided. In order to generate such learning data, it is required that experts who are conversant with classification of accents and the like manually provide information on accents to each phrase. On the other hand, in sound processing for generating actual speech from information on reading such as pronunciations and accents, data on waveforms of speech actually vocalized by an announcer or the like are often utilized. See Eide, E., et al., “Recent Improvements to the IBM. Trainable Speech Synthesis System” Proc. ICASSP 2003, Hong Kong, Vol. 1, pp. 708-711 (April, 2003). For this reason, outputted speech sometimes becomes unnatural because inconsistency occurs between the information on accents manually provided, and synthesized speech utilizing the actual speech.
Consequently, an object of the present invention is to provide a system, a method and a program which are capable of solving the abovementioned problem. This object can be achieved by a combination of characteristics described in the independent claims in the scope of claims. Additionally, the subordinate claims therein define further advantageous specific examples.