It is long known that certain voice characteristics carry information regarding the emotional state of the speaker. As far back as 1934, Lynch noted differences in timing and pitch characteristics between factual and emotional speech. (Lynch, G. E. (1934). A Phonophotographic Study of Trained and Untrained Voices Reading Factual and Dramatic Material, Arch. Speech. 1, 9-25.)
Since then, many studies have demonstrated correlations between various non-verbal speech characteristics and specific emotional states, and research efforts have been directed to different aspects of the emotional speech phenomenon. One line of research focuses on identifying the carriers of emotion within the speech signal, and studies have shown complex correlation patterns between pitch (the fundamental voice tone, dependent on the number of vibrations of the vocal cords per second), amplitude, timing, duration, pace, envelope contours and other speech variables and the emotional state of the speaker. A second research area tries to explore the expression of different emotional dimensions in speech, and the studies suggest correlations between constituent elements of speech and dimensions characterizing the emotional state of the subject. A further research effort focuses on revealing the distinctive correlations between parts of speech and various emotional states including primary emotions, such as anger, secondary emotions, such as boredom, for example, and specific stressful situations, such as anxiety, workload and lying, for example. Yet another area of research tries to point out the differences in emotional speech patterns between different individuals, different groups of individuals, as categorized by sex, age, culture and personality type, for example, and even between the voice patterns corresponding to different physiological states of the same individuals.
Three extensive literature reviews, summarizing the various findings regarding the vocal expression of emotion, were published by Murray, L. R. and Arnott, J. L., (1993), Towards the Simulation of Emotion in Synthetic Speech: A Review of the Literature on Human Vocal Emotion, Journal of the Acoustical Society of America, vol. 93 (2), 1097-1108, by Frick, R. W. (1985), Communicating Emotion The Role of Prosodic Features, Psychology Bulletin, 97, 412-429, and by Scherer, K. R. (1986), Vocal Affect Expression: A Review and a Model for Future Research, Psychology Bulletin, 99, 143-165. All these writers emphasize the fragmented nature of the research in this field, and point out that the vocal emotion research forms only a very small and isolated part of the general emotion literature and the general speech analysis literature. These reviews support the notion that human voice characteristics vary in relation to expression of emotion; yet, they highlight the complexity of the interplay between physiology, psychology and speech regarding emotions. They also stress the need for generalized models for a more coherent understanding of the phenomena.
In recent years, a few studies have approached the task of automatic classification of vocal expression of different emotional states by utilizing statistical pattern recognition models. Relative success has been achieved, see Dellaert, F., Polzin, T. S. and Waibel, A. (1996), Recognizing emotions in speech. In Proc. ICSLP, Philadelphia Pa., USA, 1996 and Amir, N. and Ron, S. (1998), Towards an automatic classification of emotions in speech. In Proc. ICSLP, Sydney, 1998, for example.
The field of emotion in speech is attracting increasing interest, and a special workshop dedicated to this topic was held in Belfast in September 2001 (ISCA workshop on Speech and Emotion—presented papers: http://www.qub.ac.uk/en/isca/proceedings/index.html). The papers, theoretical and empirical, reveal once more the complexity of the phenomenon, the lack of data and the various aspects that are involved.
In respect to the detection of emotion through speech analysis, the literature highlights several problems, yet to be resolved. We would like to emphasize two of the major problems:
The first problem is the lack of a unified model of emotional acoustic correlates, enabling the different emotional content in speech to be addressed by one general indicator; the current state of the research only enables the pointing out of isolated acoustic correlations with specific emotional states.
The second problem is the difficulty in overcoming the different speech expression patterns of different speakers, which tend to mask the emotional differences. Prior research has tried to confront the latter problem by obtaining reference speech characteristics of the tested individual, or of specific groups of individuals. The references being prior baseline measurements (non-emotional) of a specific subject, or the specific emotional speech profiles of relatively homogenous groups of subjects, such as all subjects suffering from depression, for example.
Several patents regarding this field have been registered over the years. These patents are mainly characterized as having the same limitations described above in regard to the academic research, namely, they focus on specific emotional states and depend on prior reference measurements. The patents also vary significantly in their measurement procedures and parameters.
Fuller, in three U.S. Patents from 1974, (U.S. Pat. No. 3,855,416; U.S. Pat. No. 3,855,417 and U.S. Pat. No. 3,855,418), suggests a method for indicating stress in speech and for determining whether a subject is lying or telling the truth. The suggested method measures vibratto content (rapid modulation of the phonation) and the normalized peak amplitude of the speech signal, and is particularly directed to analyzing the speech of a subject under interrogation.
Bell et. al., in 1976 (U.S. Pat. No. 3,971,034), also suggested a method for detecting psychological stress through speech. The method described is based mainly on the measurement of infrasonic modulation changes in the voice.
Williamson, in two patents from 1978 and 1979 (U.S. Pat. No. 4,093,821 and U.S. Pat. No. 4,142,067) describes a method for determining the emotional state of a person, by analyzing frequency perturbations in the speech pattern. Analysis is based mainly on measurements of the first formant frequency of speech, however, the differences corresponding to the different emotional states are not specified clearly: in the first patent, the apparatus mainly indicates stress versus relaxation, whereas in the second patent, the user of the device should apply “visual integration and interpretation of the displayed output” for “making certain decisions with regard to the emotional state”.
Jones, in 1984 (U.S. Pat. No. 4,490,840), suggests a method for determining patterns of voice-style (resonance, quality), speech-style (variable-monotone, choppy-smooth, etc.) and perceptual-style (sensory-internal, hate-love, etc.), based on different voice characteristics, including six spectral peaks and pauses within the speech signal. However, the inventor states that “the presence of specific emotional content is not of interest to the invention disclosed herein.”
Silverman, in two U.S. patents from 1987 and 1992 (U.S. Pat. No. 4,675,904 and U.S. Pat. No. 5,148,483) suggests a method for detecting suicidal predisposition from a person's speech patterns, by identifying substantial decay on utterance conclusion and low amplitude modulation during the utterance.
Ron, in 1997 (U.S. Pat. No. 5,647,834), describes a speech-based biofeedback regulation system that enables a subject to monitor and to alter his emotional state. An emotional indication signal is extracted from the subject's speech (the method of measurement is not described in the patent) and compared to online physiological measurements of the subject that serve as a reference for his emotional condition. The subject can then try to alter the indication signal in order to gain control over his emotional state.
Bogdashevsky, et. al., in a U.S. patent from 1999, (U.S. Pat. No. 6,006,188) suggests a method for determining psychological or physiological characteristics of a subject based on the creation of specific prior knowledge bases for certain psychological and physiological states. The process described involves creation of homogenous groups of subjects by their psychological assessment (e.g. personality diagnostic groups according to common psychological inventories), analyzing their unique speech patterns (based on cepstral coefficients) and forming specific knowledge bases for these groups. Matching to certain psychological and physiological groups can be accomplished by comparing the speech patterns of an individual (who is asked to speak a 30-phrase text similar to the text used by the reference group), to the knowledge bases characteristics of the group. The patent claims to enable verbal psychological diagnosis of relatively steady conditions, such as comparing mental status before and after therapy and personality profile, for example.
Pertrushin, in 2000 (U.S. Pat. No. 6,151,571), describes a method for monitoring a conversation between a pair of speakers, detecting an emotion of at least one of the speakers, determining whether the emotion is one of three negative emotions (anger, sadness or fear) and then reporting the negative emotion to a third party. Regarding the emotion recognition process, the patent details the stages required for obtaining such results: First, conducting an experiment with the target subjects is recommended, in order “to determine which portions of a voice are most reliable as indicators of emotion”. It is suggested to use a set of the most reliable utterances of this experiment as “training and test data for pattern recognition algorithms run by a computer”. The second stage is the feature extraction for the emotional states based on the collected data. The patent suggests several possible feature extraction methods using a variety of speech features. The third stage is recognizing the emotions based on the extracted features. Two approaches are offered—neural networks and ensembles of classifiers. The previously collected sets of data (representing the emotions) can be used to train the algorithms to determine the emotions correctly. Exemplary apparatuses as well as techniques to improve emotion detection are presented.
Slaney, in a U.S. patent from 2001 (U.S. Pat. No. 6,173,260), describes an emotional speech classification system. The system described, is based on an empirical procedure that extracts the best combination of speech features (different measures of pitch and spectral envelope shape), that characterizes a given set of speech utterances labeled in accordance with predefined classes of emotion. After the system has been “trained” on the given set of utterances, it can use the extracted features for further classification of other utterances into these emotional classes. The procedure doesn't present any general emotional indicator however, and only assumes that different emotional features can be empirically extracted for different emotional situations.
Two published PCT applications by Liberman also relate to emotion in speech. Liberman, in 1999 (WO 99/31653), suggests a method for determining certain emotional states through speech, including emotional stress and lying related states, such as untruthfulness, confusion and uncertainty, psychological dissonance, sarcasm, exaggeration. The procedure is based on measuring speech intonation information, in particular, plateaus and thorns in the speech signal envelope, using previous utterances of the speaker as a baseline reference.
Liberman, in 2000 (WO 00/62270), describes an apparatus for monitoring unconscious emotional states of an individual from speech specimens provided over the telephone to a voice analyzer. The emotional indicators include a sub-conscious cognitive activity level, a sub-conscious emotional activity level, an anticipation level, an attention level, a “love report” and sexual arousal. The method used, is based on frequency spectrum analysis of the speech, wherein the frequency spectrum is divided into four frequency regions and it is claimed that a higher percentage of frequencies in one of the regions reflects dominance of one of the emotional states above. It is suggested that cognitive activity would be correlated with the lowest frequencies, attention/concentration with main spectrum frequencies, emotional activity with high frequencies, and anticipation level with the highest frequencies.
Most of the abovementioned patents (Fuller, Bell, Jones, Silverman and Liberman) identify specific emotional states such as stress, lying or a tendency to commit suicide, by correlating specific speech features to these emotional conditions. Two of the patents (Williamson, Ron) assume that the appropriate speech correlates of the emotional states are given as input and totally ignore the task of describing any general indicator of emotional speech features. Three of the patents (Bogdashevsky, Petrushin and Slaney), suggest procedures for the extraction of specific speech correlates by “learning” given emotional classes of speech utterances. Thus, none of the abovementioned patents suggest a generalized speech based indicator of emotional arousal per se. that describes the speech expression of the emotional response created by a wide range of different emotional states.
Furthermore, in order to overcome the differences between individuals, some of these patents (Fuller, Williamson), require a skilled expert to manually analyze the results. Other patents (Ron, Liberman) require a comparison of the subject's speech measurements to prior baseline measurements of the same individual, as reference. Other patents (Bogdashevsky, Petrushin and Slaney), require a prior learning process of the speech characteristics of specific groups of individuals or specific psychological phenomena, to be used as reference.
Thus none of the above reviewed patents in this crowded art suggests an emotional speech indicator that is robust, having validity beyond different emotions and beyond the differences between specific individuals and specific groups. It is to the providing of such a robust, general indicator of emotional arousal by speech analysis, which is insensitive to the differences between subjects and to particular emotion types, but sensitive to emotional arousal per se. that the present invention is directed.