(1) Field of the Invention
The present invention is a speech synthesis apparatus which synthesizes speech using speech elements, and a speech synthesis method thereof, and, in particular, to a speech synthesis apparatus which transforms voice characteristics of the speech elements, and a speech synthesis method thereof.
(2) Description of the Related Art
Conventionally, there is proposed a speech synthesis apparatus which performs voice characteristic transformation (e.g., see Patent Reference 1: Japanese Laid-Open Patent Application No. 7-319495, paragraphs 0014 to 0019, Patent Reference 2: Japanese Laid-Open Patent Application No. 2003-66982, paragraphs 0035 to 0053, and Patent Reference 3: Japanese Laid-Open Patent Application No. 2002-215198).
The speech synthesis apparatus disclosed in the patent reference 1 has speech element sets, each of which has a different voice characteristic, and performs voice characteristic transformation by switching the speech element sets.
FIG. 1 is a block diagram showing a structure of the speech synthesis apparatus disclosed in the patent reference 1.
This speech synthesis apparatus includes a synthesis unit data information table 901, an individual code book storing unit 902, a likelihood calculating unit 903, a plurality of individual-specific synthesis unit databases 904, and a voice characteristic transforming unit 905.
The synthesis unit data information table 901 holds data elements (synthesis unit data) respectively relating to synthesis units to be speech synthesized. Each synthesis unit data has a synthesis unit data ID for uniquely identifying the synthesis unit. The individual code book storing unit 902 holds information which indicates identifiers of all the speakers (individual identification ID) and characteristics of the speaker's voice. The likelihood calculating unit 903 selects a synthesis unit data ID and an individual identification ID by referring to the synthesis unit data information table 901 and the individual code book storing unit 902, based on standard parameter information, synthesis unit names, phonetic environmental information, and target voice characteristic information.
Each of the individual-specific synthesis unit databases 904 holds a different speech element set which has a unique voice characteristic. Also, the individual-specific synthesis unit database is associated with an individual identification ID.
The voice characteristic transforming unit 905 obtains the synthesis unit data ID and individual identification ID selected by the likelihood calculating unit 903. The voice characteristic transforming unit 905 then generates a speech waveform by obtaining speech elements corresponding to the synthesis unit data indicated by the synthesis unit data ID from the individual-specific synthesis unit database 904 identified by the individual identification ID.
On the other hand, the speech synthesis apparatus disclosed in the patent reference 2 transforms a voice characteristic of an ordinary synthesized speech using a transformation function for performing the voice transformation.
FIG. 2 is a block diagram showing a structure of the speech synthesis apparatus disclosed in the patent reference 2.
This speech synthesis apparatus includes a text input unit 911, an element storing unit 912, an element selecting unit 913, a voice characteristic transforming unit 914, a waveform synthesizing unit 915, and a voice characteristic transformation parameter input unit 916.
The text input unit 911 obtains text information indicating the details of words to be synthesized or phoneme information, and prosody information indicating accents and intonation of an overall speech. The element storing unit 912 holds a set of speech elements (synthesis speech unit). The element selecting unit 913, based on the phoneme information and prosody information obtained by the text input unit 911, selects optimum speech elements from the element storing unit 912, and outputs the selected speech elements. The voice characteristic transformation parameter input unit 916 obtains a voice characteristic parameter indicating a parameter relating to the voice characteristic.
The voice characteristic transforming unit 914 performs voice characteristic transformation on the speech elements selected by the element selecting unit 913, based on the voice characteristic parameter obtained by the voice characteristic transformation parameter input unit 916. Accordingly, a linear or non-linear frequency transformation is performed on the speech elements. The waveform synthesizing unit 915 generates a speech waveform based on the speech elements whose voice characteristics are transformed by the voice characteristic transforming unit 914.
FIG. 3 is an explanatory diagram for explaining transformation functions used for the voice transformation of the respective speech elements performed by the voice characteristic transforming unit 914 disclosed in the patent reference 2. Here, a horizontal axis (Fi) in FIG. 3 indicates an input frequency of a speech element inputted to the voice characteristic transforming unit 914, and a vertical axis (Fo) in FIG. 3 indicates an output frequency of the speech element outputted by the voice characteristic transforming unit 914.
The voice characteristic transforming unit 914 outputs the speech element selected by the speech element selecting unit 913 without performing voice transformation in the case where a transformation function f101 is used as a voice characteristic parameter. Also, the voice transforming unit 914 transforms and outputs, in the case where a transformation function f102 is used as a voice characteristic parameter, the input frequency of the speech element selected by the speech selecting unit 913 linearly; and transforms and outputs, in the case where a transformation function f103 is used as a voice characteristic parameter, the input frequency of the speech element selected by the element selecting unit 913 non-linearly.
In addition, a speech synthesis apparatus (voice characteristic transformation apparatus) disclosed in the patent reference 3 determines a group to which a phoneme whose voice characteristic is to be transformed belongs, based on an acoustic characteristic of the phoneme. The speech synthesis apparatus then transforms the voice characteristic of the phoneme using a transformation function set for the group to which the phoneme belongs.