Much of the work that has been done to date in connection with the analysis of speech signals has concentrated on the recognition of the linguistic content of spoken words, i.e., what was said by the speaker. In addition, some efforts have been directed to automatic speaker identification, to determine who said the words that are being analyzed. However, the automatic analysis of prosodic information conveyed by speech has largely been ignored. In essence, prosody represents all of the information in a speech signal other than the linguistic information conveyed by the words, including such factors as its duration, loudness, pitch and the like. These types of features provide an indication of how the words were spoken, and thus contain information about the emotional state of the speaker.
Since the affective content of the message is conveyed by the prosody, it is independent of language. In the field of affective computing, therefore, automatic recognition of prosody can be used to provide a universal interactive interface with a speaker. For example, detection of the prosody in speech provides an indication of the "mood" of the speaker, and can be used to adjust colors and images in a graphical user interface. In another application, it can be used to provide interactive feedback during the play of a video game, or the like. As other examples, task-based applications such as teaching programs can employ information about a user to adjust the pace of the task. Thus, if a student expresses frustration, the lesson can be switched to less-demanding concepts, whereas if the student is bored, a humorous element can be inserted. For further information regarding the field of affective computing, and the possible applications of the prosodic information provided by the present invention, reference is made to Affective Computing by R. W. Picard, MIT Press, 1997.
Accordingly, it is desirable to provide a system which is capable of automatically classifying the prosodic information in speech signals, to detect the emotional state of the speaker. In the past, systems have been developed which classify the spoken affect in speech, which are based primarily upon analysis of the pitch content of speech signals. See, for example, Roy et al, "Automatic Spoken Affect Classification and Analysis", IEEE Face and Gesture Conference, Killington, Vt., pages 363-367, 1996.