1. Field of Invention
The present invention relates to a voice synthesis device which makes it possible to generate a voice that can express tension and relaxation of a phonatory organ, emotion, expression of the voice, or an utterance style.
2. Description of the Related Art
Conventionally, as a voice synthesis device or method thereof by which emotion or the like is able to be expressed, it has been proposed to firstly synthesize standard or expressionless voices, then select a voice with a characteristic vector, which is similar to the synthesized voice and is perceived like a voice with expression such as emotion, and connects the selected voices (see Patent Reference 1, for example).
It has been further proposed to previously learn, using a neural network, a function for converting a synthesis parameter used to convert a standard or expressionless voice into a voice having expression such as emotion, and then convert, using the learned conversion function, the parameter sequence used to synthesize the standard or expressionless voice (see Patent Reference 2, for example).
It has been still further proposed to convert voice quality, by transforming a frequency characteristic of the parameter sequence used to synthesize the standard or expressionless voice (see Patent Reference 3, for example).
It has been still further proposed to convert parameters using parameter conversion functions whose change rates are different depending on degrees of emotion in order to control the degrees of emotion, or generate parameter sequences by compensating for two kinds of synthesis parameter sequences whose expressions are different from each other in order to mix multiple kinds of expressions (see Patent Reference 4, for example).
In addition to the above propositions, a method has been proposed to statistically learn, from natural voices including respective emotion expressions, voice generation models using hidden Markov models (HMM) which correspond to the respective emotions, then prepare respective conversion equations between the models, and convert a standard or expressionless voice into a voice expressing emotion (see Non-Patent Reference 1, for example).
FIG. 1 is a diagram showing the conventional voice synthesis device described in Patent Reference 4.
In FIG. 1, an emotion input interface unit 109 converts inputted emotion control information into parameter conversion information, which represents temporal changes of proportions of respective emotions as shown in FIG. 2, and then outputs the resulting parameter conversion information into an emotion control unit 108. The parameter conversion information 108 converts the parameter conversion information into a reference parameter according to predetermined conversion rules as shown in FIG. 3, and thereby controls operations of a prosody control unit 103 and a parameter control unit 104. The prosody control unit 103 generates an emotionless prosody pattern from a sequence of phonemes (hereinafter, referred to as a “phonologic sequence”) and language information, which are generated by a language processing unit 101 and selected by a selection unit 102, and after that, converts the resulting emotionless prosody pattern into a prosody pattern having emotion, based on the reference parameter generated by the emotion control unit 108. Furthermore, the parameter control unit 104 converts a previously generated emotionless parameter such as a spectrum or an utterance speed, into an emotion parameter, using the above-mentioned reference parameter, and thereby adds emotion to the synthesized speech.    Patent Reference 1: Japanese Unexamined Patent Application Publication No. 2004-279436, pages 8-10, FIG. 5.    Patent Reference 2: Japanese Unexamined Patent Application Publication No. 7-72900, pages 6 and 7, FIG. 1.    Patent Reference 3: Japanese Unexamined Patent Application Publication No. 2002-268699, pages 9 and 10, FIG. 9.    Patent Reference 4: Japanese Unexamined Patent Application Publication No. 2003-233388 pages 8-10, FIGS. 1, 3, and 6.    Non-Patent Reference 1: “Consideration of Speaker-Adapting Method for Voice Quality Conversion based on HMM Voice Synthesis”, Masanori Tamura, Takashi Mashiko, Eiichi Tokuda, and Takao Kobayashi, The Acoustical Society of Japan, Lecture Papers, volume 1, pp. 319-320, 1998.