1. Technical Field
The present invention relates to emotion recognition apparatuses for recognizing a speaker's emotion based on his or her speech. More specifically, the present invention relates to speech-based emotion recognition apparatuses for recognizing a speaker's emotion by detecting an occurrence of a characteristic tone in a speech, which is caused by tension or relaxation of a vocal organ that varies momentarily according to the speaker's emotion, expression, attitude, or speaking style.
2. Background Art
In an interactive system provided with a voice interactive interface, such as an automatic telephone answering system, an electronic secretary, and an interactive robot, it is an important requirement to perceive an emotion of a user from his or her speech, in order to respond to the user's request more appropriately. For example, when the aforementioned automatic telephone answering system or interactive robot communicates with the user by voice, the interactive system may not necessarily be able to correctly recognize the user's speech. In the case where the interactive system fails to correctly recognize the user's speech, the interactive system requests the user to input the speech again. In such a situation, the user may more or less get angry or frustrated. This becomes worse when the false recognition repeatedly occurs. The anger or frustration causes the user's way of speaking or voice quality to change, as a result of which the user's speech exhibits a different pattern from when he or she speaks in a normal state. This makes the interactive system, which stores the user's voice in the normal state as a model for recognition, more prone to false recognition. As a result, the interactive system makes even more annoying requests to the user, such as by requesting a same answer from the user again and again. When the interactive system falls into such a vicious circle, it becomes useless as an interactive interface.
To stop this vicious circle and normalize the device-user voice communication, it is necessary to recognize the user's emotion from his or her speech. That is, if the interactive system is capable of perceiving the user's anger or frustration, the interactive system can ask the user again more politely or apologize for the false recognition. By doing so, the interactive system can bring the user's emotion close to normal, and draw a normal-state speech from the user. As a result, a recognition rate can be recovered, and a device operation by the interactive system can be performed smoothly.
Conventionally, for speech-based emotion recognition, a method of extracting prosodic features such as a voice pitch (fundamental frequency), a volume (power), and a speech rate from a speech inputted by a speaker and recognizing an emotion based on a judgment such as “high-pitched” or “loud” for the entire input speech, has been proposed (for example, see Patent Document 1 and Patent Document 2). Also, a method of making a judgment such as “energy is high in a high frequency region” for an entire input speech, has been proposed (for example, see Patent Document 1). Further, a method of obtaining, from sequences of power and fundamental frequency of a speech, their statistical representative values such as a mean value, a maximum value, and a minimum value and recognizing an emotion has been proposed (for example, see Patent Document 3). Moreover, a method of recognizing an emotion by using a time pattern of prosody such as an intonation and an accent in a sentence or a word, has been proposed (for example, see Patent Document 4 and Patent Document 5).
FIG. 20 shows a conventional speech-based emotion recognition apparatus described in Patent Document 1.
A microphone 1 converts an input speech to an electrical signal. A speech code recognition unit 2 performs speech recognition on the speech inputted from the microphone 1, and outputs a recognition result to a sensitivity information extraction unit 3 and an output control unit 4.
Meanwhile, a speech rate detection unit 31, a fundamental frequency detection unit 32, and a volume detection unit 33 in the sensitivity information extraction unit 3 extract a speech rate, a fundamental frequency, and a volume from the speech inputted from the microphone 1, respectively.
A speech level judgment criterion storage unit 34 stores a criterion for comparing the speech rate, fundamental frequency, and volume of the input speech respectively with a reference speech rate, fundamental frequency, and volume and determining a speech level. A reference speech feature parameter storage unit 35 stores the reference speech rate, fundamental frequency, and volume that are used as a reference when judging the speech level. A speech level analysis unit 36 determines the speech level, that is, a speech rate level, a fundamental frequency level, and a volume level, based on a ratio between a feature parameter of the input speech and a reference speech feature parameter.
A sensitivity level analysis knowledge base storage unit 37 stores a rule for judging a sensitivity level according to each speech level determined by the speech level analysis unit 36. A sensitivity level analysis unit 38 judges the sensitivity level, that is, a sensitivity type and level, from the output of the speech level analysis unit 36 and the output of the speech code recognition unit 2, based on the rule stored in the sensitivity level analysis knowledge base storage unit 37.
The output control unit 4 generates an output corresponding to the sensitivity level of the input speech by controlling an output device 5, in accordance with the sensitivity level outputted from the sensitivity level analysis unit 38. Here, information used for determining the speech level includes a speech rate of how many morae are spoken per second, an average fundamental frequency, and other prosodic information obtained in a unit such as an utterance, a sentence, or a phrase.
However, prosodic information is also used for transferring linguistic information. Besides, a method of transferring such linguistic information differs between languages. For example, in Japanese, there are many homophones, such as “hashi” (“bridge”) and “hashi” (“chopsticks”), that have different meanings depending on an accent formed by rise and fall in fundamental frequency. In Chinese, it is known that a same sound can represent completely different meanings (characters) depending on a change in fundamental frequency called four tones. In English, an accent is expressed by a voice emphasis called a stress rather than a fundamental frequency, where a position of the stress assists in distinguishing different meanings of a word or a phrase, or different word classes. To perform prosody-based emotion recognition, it is necessary to take such prosodic pattern differences among languages into consideration. Therefore, data for emotion recognition needs to be generated in a manner that separates prosodic changes as emotional expressions and prosodic changes as language information, for each language. Also, even in a same language, there are individual differences such as a person who speaks fast and a person who speaks in a high (or low) voice. This being so, in prosody-based emotion recognition, for example, a person who usually speaks loud and fast in a high voice will end up being always recognized to be angry. To prevent such wrong emotion recognition caused by individual differences, it is also necessary to perform emotion recognition tailored to each individual, by storing reference data for each individual and comparing a speech of each individual with corresponding reference data (for example, see Patent Document 2 and Patent Document 5).    Patent Document 1: Japanese Patent Application Publication No. H09-22296 (pp. 6 to 9, tables 1 to 5, FIG. 2)    Patent Document 2: Japanese Patent Application Publication No. 2001-83984 (pp. 4 to 5, FIG. 4)    Patent Document 3: Japanese Patent Application Publication No. 2003-99084    Patent Document 4: Japanese Patent Application Publication No. 2005-39501 (p. 12)    Patent Document 5: Japanese Patent Application Publication No. 2005-283647
As described above, prosody-based emotion recognition requires a large amount of voice data, analytical processing, and statistical processing, because variations in prosodic information used for expressing language information and variations in prosodic information as emotional expressions need to be separated for each language. Also, even in a same language, there are large regional differences, as well as individual differences attributable to age and the like. Besides, a voice of one person can greatly vary depending on his or her physical condition and the like. Therefore, without reference data corresponding to each user, it is difficult to always produce stable results for an indefinite number of people, since emotional expressions by prosody have large regional differences and individual differences.
Moreover, the method of preparing reference data for each individual cannot be employed in a system that is intended for use by an indefinite number of people, such as a call center or an information system in a public place like a station, because it is impossible to prepare reference data of each speaker.
Furthermore, prosodic data, which includes statistical representative values such as a number of morae per second, a mean value, and a dynamic range, or time patterns, need to be analyzed in a cohesive unit of voice such as an utterance, a sentence, or a phrase. Therefore, when a feature of a speech changes rapidly, it is difficult to perform the analysis so as to keep up with such a change. This causes a problem of being unable to perform speech-based emotion recognition with high accuracy.
The present invention was conceived to solve the above conventional problems, and aims to provide a speech-based emotion recognition apparatus that can detect an emotion in a small unit, namely, a phoneme, and perform emotion recognition with high accuracy by using a relationship between a characteristic tone which has relatively small individual, language, and regional differences and a speaker's emotion.