Conventionally, voice conversion or voice synthesis technologies have been developed aiming for expressing emotion, vocal expression, attitude, situation, and the like using voices, and particularly for expressing the emotion and the like, not using verbal expression of voices, but using para-linguistic expression such as a way of speaking, a speaking style, and a tone of voice. These technologies are indispensable to speech interaction interfaces of electronic devices, such as robots and electronic secretaries.
Among para-linguistic expression of voices, various methods have been proposed to change prosody patterns. A method is disclosed to generate prosody patterns such as a fundamental frequency pattern, a power pattern, a rhythm pattern, and the like based on a model, and modify the fundamental frequency pattern and the power pattern using periodic fluctuation signals according to emotion to be expressed by voices, thereby generating prosody patterns of voices having the emotion to be expressed (refer to Patent Reference 1, for example). As described in paragraph [0118] of Patent Reference 1, the method of generating voices with emotion by modifying prosody patterns needs periodic fluctuation signals having cycles each exceeding a duration of a syllable in order to prevent voice quality change caused by variation.
On the other hand, for methods of achieving expression using voice quality, there have been developed: a voice conversion method of analyzing input voices to calculate synthetic parameters and changing the calculated parameters to change voice quality of the input voices (refer to Patent Reference 2, for example); and a voice synthesis method of generating parameters to be used to synthesize standard voices or voices without emotion and changing the generated parameters (refer to Patent Reference 3, for example).
Further, in technologies of speech synthesis using concatenation of speech waveforms, a technology is disclosed to previously synthesize standard voices or voices without emotion, select voices having feature vectors similar to those of the synthesized voices from among voices having expression such as emotion, and concatenates the selected voices to each other (refer to Patent Reference 4, for example).
Furthermore, in voice synthesis technologies of generating synthesis parameters using statistical learning models based on synthesis parameters generated by analyzing natural speeches, a method is disclosed to statistically learn a voice generation model corresponding to each emotion from the natural speeches including the emotion expressions, then prepare formulas for conversion between models, and convert standard voices or voices without emotion to voices expressing emotion.
Among the above-mentioned conventional methods, however, the technology having the synthesis parameter conversion performs the parameter conversion according to a uniform conversion rule that is predetermined for each emotion. This prohibits the technology from reproducing various kinds of voice quality such as voice quality having a partial strained rough voice which are produced in natural utterances.
In addition, in the above method of extracting voices with vocal expressions such as emotion having feature vectors similar to those of standard voices and concatenating the extracted voices to each other, voices having characteristic and special voice quality such as “strained rough voice” that is significantly different from voice quality of normal utterances are hardly selected. This prohibits the method from eventually reproducing various kinds of voice quality which are produced in natural utterances.
Moreover, in the above method of learning statistical voice synthesis models from natural speeches including emotion expressions, although there is a possibility of learning also variations of voice quality, voices having voice quality characteristic to express emotion are not frequently produced in the natural speeches, thereby making the learning of voice quality difficult. For example, the above-mentioned “strained rough voice”, a whispery voice produced characteristically in speaking politely and gently, and a breathy voice that is also called a soft voice (refer to Patent References 4 and 5) are impressing voices having characteristic voice quality drawing attention of listeners and thereby significantly influence impression of a whole utterance. However, such a voice occurs in a portion of a whole real utterance, and occurrence frequency of such a voice is not high. Since a rate of a duration of such a voice to an entire utterance duration is low, models for reproducing “strained rough voice”, “breathy voice”, and the like are not likely to be learned in the statistical learning.
That is, the above-described conventional methods have problems of difficulty in reproducing variations of partial voice quality and impossibility of richly expressing vocal expression with texture, reality, and fine time structures.
In order to address the above problems, there is conceived a method of performing voice quality conversion especially for voices with characteristic voice quality so as to achieve the reproduction of variations of voice quality. As physical features (characteristics) of voice quality that are basis of the voice quality conversion, a “pressed (“rikimi” in Japanese)” voice having definition different from that of the “strained rough (“rikimi” in Japanese)” voice in this description, and the above-mentioned “breathy” voice are studied.
The “breathy voice” has features of: a low spectrum in harmonic components; and a great amount of noise components due to airflow. The above features of “breathy voice” result from that a glottis is opened in uttering a “breathy voice” more than in uttering a normal voice or a modal voice and that a “breathy voice” is a medium voice between a modal voice and a whisper. A modal voice has less noise components, and a whisper is a voice uttered only by noise components without any periodic components. The feature of “breathy voice” is detected as a low correlation between an envelope waveform of a first formant band and an envelope waveform of a third formant band, in other words, a low correlation between a shape of an envelope of band-pass signals having vicinity of the first formant band as a center and a shape of an envelope of band-pass signals having vicinity of the third formant band as a center. By adding the above feature to synthetic voice in voice synthesis, the “breathy” voice can be generated (refer to Patent Reference 5).
Moreover, as a “pressed voice” different from the “strained rough voice” in this description produced in an utterance in anger or excitement, a voice called “creaky” or “vocal fry” is studied. In this study, acoustic features of the “creaky voice” are: (i) significant partial change of energy; (ii) lower and less stable fundamental frequency than fundamental frequency of normal utterance; (iii) smaller power than that of a section of normal utterance. This study reveals that these features sometimes occur when a larynx is pressed to produce an utterance and thereby disturbs periodicity of vocal fold vibration. The study also reveals that a “pressed voice” often occurs in a duration longer than an average syllable-basis duration. The “breathy voice” is considered to have an effect of enhancing impression of sincerity of a speaker in emotion expression such as interest or hatred, or attitude expression such as hesitation or humble attitude. The “pressed voice” described in this study often occurs in (i) a process of gradually ceasing a speech generally in an end of a sentence, a phrase, or the like, (ii) ending of a word uttered to be extended in speaking while selecting words or in speaking while thinking, (iii) exclamation or interjection such as “well . . . ” and “um . . . ” uttered in having no ready answer. The study further reveals that each of the “creaky voice” and the “vocal fry” includes a diplophonia that causes a new period of a double beat or a double of a fundamental period. For a method of generating the diplophonia occurred in “vocal fry”, there is disclosed a method of superposing voices with a phase being shifted from another by a half period of a fundamental frequency (refer to Patent Reference 6).    Patent Reference 1: Japanese Unexamined Patent Application Publication No. 2002-258886 (FIG. 8, paragraph [0118])    Patent Reference 2: Japanese Patent No. 3703394    Patent Reference 3: Japanese Unexamined Patent Application Publication No. 7-72900    Patent Reference 4: Japanese Unexamined Patent Application Publication No. 2004-279436    Patent Reference 5: Japanese Unexamined Patent Application Publication No. 2006-84619    Patent Reference 6: Japanese Unexamined Patent Application Publication No. 2006-145867    Patent Reference 7: Japanese Unexamined Patent Application Publication No. 3-174597