The present invention relates to voice synthesis techniques.
Heretofore, various techniques have been proposed for synthesizing voices imitative of real human voices. In Japanese Patent Application Laid-open Publication No. 2003-255974, for example, there is disclosed a technique for synthesizing a desired voice by cutting out a real human voice (hereinafter referred to as “input voice”) on a phoneme-by-phoneme basis to thereby sample voice segments of the human voice and then connecting together the sampled voice segments. Each voice segment (particularly, voice segment including a voiced sound, such as a vowel) is extracted out of the input voice with a boundary set at a time point where a waveform amplitude becomes substantially constant. FIG. 8 shows a manner in which an example of a voice segment [s_a], comprising a combination of a consonant phoneme [s] and vowel phoneme [a], is extracted out of an input voice. As shown in the figure, a region Ts from time point T1 to time point T2 is designated as the phoneme [s] and a next region Ta from time point T2 to time point T3 is selected as the phoneme [a], so that the voice segment [s_a] is extracted out of the input voice. At that time, time point T3, which is the end point of the vowel phoneme [a] is set after time point T0 where the amplitude of the input voice becomes substantially constant (such time point T0 will hereinafter be referred to as “stationary point”). For example, a voice sound “sa” uttered by a person is synthesized by connecting the start point of the vowel phoneme [a] to the end point T3 of the voice segment [s_a].
However, because the voice segment [s_a] has the end point T3 set after the stationary point T0, the conventional technique can not necessarily synthesize a natural voice. Since the stationary point T0 corresponds to a time point when the person has gradually opened his or her mouth into a fully-opened position for utterance of the voice, the voice synthesized using the voice segment extending over the entire region including the stationary point T0 would inevitably become imitative of the voice uttered by the person fully opening his or her mouth. However, when actually uttering a voice, a person does not necessarily do so by fully opening the mouth. For example, in singing a fast-tempo music piece, it is sometimes necessary for a singing person to utter a next word before fully opening the mouth to utter a given word. Also, to enhance a singing expression, a person may sing without sufficiently opening the mouth at an initial stage immediately after the begining of a music piece and then gradually increasing the opening degree of the mouth as the tune rises or livens up. Despite such circumstances, the conventional technique is arranged to merely synthesize voices fixedly using voice segments corresponding to fully-opened mouth positions, it can not appropriately synthesize subtle voices like those uttered with the mouth insufficiently opened.
It is possible, in a fashion, to synthesize voices corresponding to various opening degrees of the mouth, by sampling a plurality of voice segments from different input voices uttered with various opening degrees of the mouth and selectively using any of the sampled voice segments. In this case, however, a multiplicity of voice segments must be prepared, involving a great amount of labor to create the voice segments; in addition, a storage device of a great capacity is required to hold the multiplicity of voice segments.