Conventionally, voice conversion or voice synthesis technologies have been developed aiming for expressing emotion, vocal expression, attitude, situation, and the like using voices, and particularly for expressing the emotion and the like, not using verbal expression of voices, but using para-linguistic expression such as a way of speaking, a speaking style, and a tone of voice. These technologies are indispensable to speech interaction interfaces of electronic devices, such as robots and electronic secretaries. Moreover, technologies used in Karaoke machines or music sound effect devices have been developed to process a waveform of a speech in order to add musical expression such as tremolo or vibrato or emphasize expression of the speech.
In order to provide expression using voice quality as para-linguistic expression or musical expression of an input speech, there has been developed a voice conversion method of analyzing the input speech to calculate synthetic parameters and then changing the calculated parameters to convert quality of a voice in the input speech (refer to Patent Reference 1, for example). However, by the above conventional method, the parameter conversion is performed according to a uniform conversion rule that is predetermined for each emotion. This fails to reproduce various kinds of voice quality such as voice quality having a partially strained rough voice which are produced in natural utterances. Furthermore, in the conventional method, the uniform conversion rule is applied on the entire input speech. Therefore, it is impossible to convert only a part of the input speech where a speaker desires to emphasize, or to convert the input speech to emphasize a strength of emotion or expression originally expressed in the input speech.
In the meanwhile, there has been disclosed a method of converting singing voices of a user to imitate how an original singer of the song sings (refer to Patent Reference 2, for example). In more detail, based on singing data indicating musical expression of a way of singing of the original singer, namely, information of which section of the song has tremolo or vibrato, a “strained rough voice”, or a “unari (growling or groaning voice) at how much degree, the above conventional method converts the user's singing voices changing amplitude or fundamental frequency or adding with noise.
Moreover, in order to address a time lag in singing a song between singing data of a user and singing of an original singer of the song, a method has been disclosed to compare the user's singing data and data of the song (namely, the original singer's singing) (refer to Patent Reference 3, for example). The combination of these conventional technologies makes it possible to convert input singing voices (user's singing data) to imitate a way of singing of the original singer, as far as singing timings of the user's singing data match singing timings of the original singer's singing closely, even if not precisely.
As one of various kinds of voice quality partially produced in a speech, a voice called “creaky” or “vocal fry” is studied being referred to as a “pressed voice” that is different from the “strained rough voice” or “unari (growling or groaning voice)” described in this description and produced in an utterance in excitement or as expression in singing voices. Non-Patent Reference 1 discloses that acoustic features of the “creaky voice” are: significant partial change of energy; lower and less-stable fundamental frequency than fundamental frequency of normal utterance; and smaller power than that of a section of normal utterance. Non-Patent Reference 1 also discloses that these features sometimes occur when a larynx is pressed thereby disturbing periodicity of vocal cord vibration. It is further disclosed that a “pressed voice” often occurs in a duration longer than an average syllable-basis duration. The “creaky voice” is considered to have an effect of enhancing impression of sincerity of a speaker in emotion expression such as interest or hatred, or attitude expression such as hesitation or humble attitude. The “pressed voice” described in Non-Patent Reference 1 often occurs in: a process of gradually ceasing a speech generally in an end of a sentence, a phrase, or the like; ending of a word uttered to be extended in speaking while selecting words or in speaking while thinking; and exclamation or interjection such as “well . . . ” and “um . . . ” uttered in having no ready answer. Non-Patent Reference 1 still further discloses that each of the “creaky voice” and the “vocal fry” includes a diplophonia that causes a new period of a double beat or a double of a fundamental period. For a method of generating such diplophonia occurred in “vocal fry”, there is disclosed a method of superposing voices with a phase being shifted from another by a half period of a fundamental frequency.    Patent Reference 1: Japanese Patent No. 3703394    Patent Reference 2: Japanese Unexamined Patent Application Publication No. 2004-177984    Patent Reference 3: Japanese Patent No. 3760833    Non-Patent Reference 1: “Acoustic analysis for automatic detection of pressed voice”, Carlos Toshinori ISHII, Hiroshi ISHIGURO, and Norihiro HAGITA, Technical Report of the Institute of Electronics, Information and Communication Engineers, SP2006, vol. 7, pp. 1-6, 2006