1. Field of the Invention
This invention relates to a method and apparatus for speech synthesis, program, recording medium for receiving information on the emotion to synthesize the speech, method and apparatus for generating constraint information, and robot apparatus outputting the speech.
2. Description of Related Art
A mechanical apparatus for performing movements simulating the movement of the human being using electrical or magnetic operation is termed a “robot”. The robots started to be used widely in this country towards the end of the sixtieth. Most of the robots used were industrial robots, such as manipulators or transporting robots, aimed at automation or unmanned operations in plants.
Recently, developments in practically useful robots, supporting the human life as a partner for the human being, that is supporting human activities in variable aspects of our everyday life, are proceeding. In distinction from the industrial robots, these useful robots have the ability of learning the method for adaptation to the human being with different personality or to variable environments under variable aspects of the human living environment. For example, a pet type robot, simulating the bodily mechanism of animals walking on four feet, such as dogs or cats, or a ‘humanoid’ robot, designed after the bodily mechanism or movements of the human being walking on two feet, are already put to practical use.
These robots can perform various operations, aimed principally at entertainments, as compared to industrial robots, and hence are sometimes termed entertainment robots. Some of these robot apparatus autonomously operate responsive to the information from outside or to their internal states.
The artificial intelligence (AI), used in these autonomously operating robots, represents artificial realization of intellectual functions, such as inference or judgment. Attempts are also being made to artificially realize the functions, such as emotion or instincts. As an illustration of the acoustic means, among the means of expression of the artificial intelligence to outside, including the visual means, is the use of speech.
For example, in the robot apparatus simulating the human being, such as dogs or cats, the function of appealing the own emotion to the human user using the speech, is effective. The reason is that, even if the user is unable to understand what is said by actual dogs or cats, he or she is able to empirically understand the condition of the dog or cat, and that one of the elements in judgment is the pet's speech. In the case of the human being, the emotion of the person who uttered the speech is judged on the basis of the meaning or contents of the word or the speech uttered.
Among the robot apparatus, now on market, there is known such a one which expresses the auditory emotion by the electronic sound. Specifically, short sound with a high pitch represents happiness, while the slow low sound represents sadness. These electronic sounds are pre-composed and assorted to different emotion classes so as to be used for reproduction based on the subjective turn of mind of the human being. The emotion class is the class of emotion classified under happiness, anger etc. In the customary auditory emotion representation, employing the electronic sound, such points as                (i) monotony;        (ii) repetition of the same expression and        (iii) indefiniteness as to whether or not the power of expression is proper are pointed out as being the principal difference from the emotion expression by the pets, such as dogs or cats, such that further improvement has been desired.        
In the specification and drawings of the JP Patent Application 2000-372091, the present Assignee proposed a technique which enables an autonomous robot apparatus to make the auditory emotion expression more proximate to that of the living creatures. In this technique, there is first prepared a table showing certain parameters, such as pitch, time duration and sound volume (intensity) of at least part of phonemes contained in the sentence or the sound array to be synthesized, in association with the emotion, such as happiness or anger. This table is switched, depending on the emotion of the robot, as verified, to execute speech synthesis to produce utterances representing the emotion. By the robot uttering the so generated nonsensical utterances, tuned to emotion representation, the human being is able to be informed of the emotion entertained by the robot, even though the contents of the utterances uttered by the robot are not quite clear.
However, the technique disclosed in the specification and drawings of the JP Patent Application 2000-372091 is premised on the robot making nonsensical utterances. Therefore, various problems are presented if the above technique is applied to a robot apparatus simulating the human being and which has the function of outputting the meaningful synthesized speech of a specific language.
That is, if the emotion is added to the nonsensical utterances, there is no particular constraint, imposed from a specified language to another, as to which portion of the output sound a change is to be made. Thus, the portion of the output sound can be identified on the basis of the probability or the position in the sentence. However, if the same technique is applied to the emotion-synthesis of the meaningful sentence, it is not clear which portion of the sentence to be synthesized is to be modified or how the portion not allowed to be changed is to be determined. As a result, the prosody, inherently essential in imparting the language information, is changed, so that the meaning can hardly be transmitted, or the meaning different from the original meaning is imparted to the listener.
The case of using an approach of changing the pitch is taken as an example for explanation. The Japanese is a language which expresses the accent based on the pitch of speech. In Japanese words, the accent position is determined, such that the accent position as expected by a Japanese native speaker from a given sentence is determined approximately. Therefore, if the pitch of a phoneme is changed using the approach of expressing the emotion by changing the pitch, the risk is high that the resulting synthesized speech imparts an extraneous feeling to the Japanese native speaker.
There is also a possibility that not only an extraneous emotion is transmitted but also the meaning is not transmitted. In the case of a word ‘hashi’, meaning ‘chopstick,’ ‘bridge’ or ‘end’, the hearer discriminates the ‘chopstick,’ ‘bridge’ or ‘end’ based on whether the sound of ‘ha’ is higher or lower than the sound ‘shi’. Therefore, if, when the emotion is to be expressed based on the relative pitch, the relative pitch of the speech portion essential in the meaning discrimination is changed in the language of the speech being synthesized, the hearer is unable to understand the meaning correctly.
The same holds for the case of using an approach towards changing the time duration. For example, if, in synthesizing the word ‘Oka-san’ meaning Mr.Oka, the duration of the phoneme ‘a’ of a sound ‘ka’ is changed to be longer than the duration of the other phonemes, the hearer may take the output synthesized speech as ‘Okaasan’ (meaning my mother).
The Japanese is not a language discriminating the meaning based on the relative intensity of the sound and hence changes in the sound intensity scarcely lead to the ambiguous meaning. In a language in which the relative intensity of the sound leads to different meanings, as in English, the relative sound intensity is used to differentiate words of the same spell but of different meanings, and hence there may arise the situation that the meaning is not transmitted correctly. For example, in the case of a word ‘present’, the stress in the first syllable gives a noun meaning a ‘gift’, whereas the stress in the second syllable gives a verb meaning ‘offer’ or ‘present oneself’.
If the speech is to be synthesized for a meaningful sentence, seasoned with emotion, there is a risk that, except if control is made so that the prosodic characteristics of the language in question, such as accent positions, duration or loudness, are maintained, the hearer is unable to understand the meaning of the synthesized speech correctly.