1. Field of the Invention
The present invention relates to a method and systems for processing a speech signal for the purpose of efficiently and accurately recognizing the emotional state of the utterer.
2. Description of Related Art
Current state-of-the-art emotion detectors only have an accuracy of around 40-50% at identifying the most dominate emotion from four to five different emotions. Thus, a problem for emotional speech processing is the limited functionality of speech recognition systems. Historically, the two most common algorithms in speech recognition systems are Dynamic Time Warping (DTW) and Hidden Markov Models (HMMs). DTW is an approach that was historically used for speech recognition but has now been largely displaced by the more successful HMM approach. HMMs are statistical models which output a sequence of symbols or quantities. They are popular in speech recognition because they can be trained automatically and are simple and computationally feasible to use. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach of HMMs, including utilizing Mel-Frequency Cepstral Coefficients (MFCCs).
A mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. In the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warming can allow for better representation of sound, for example, in audio compression. Thus, MFCCs are commonly used features in speech recognition systems, such as systems which can automatically recognize numbers spoken into a telephone. Speech recognition algorithms, such as those that utilize MFCCs, can also help people suffering from social impairments, such as autism, recognize others' emotions.
People with autism often lack the intuition about others that many people take for granted. Autistic children are less likely to exhibit social understanding, approach others spontaneously, imitate and respond to emotions, and communicate nonverbally. Despite the common belief that children with autism prefer to be alone, children with high-functioning autism suffer from more intense and frequent loneliness compared to non-autistic peers. Making and maintaining friendships often proves to be difficult for those with autism. Therefore, improvements in speech audio processing, or more specifically, emotional speech processing could be a powerful tool to help autistic children identify others emotions. Autistic children's new found ability to communicate with others could improve their lives.
So as to reduce the complexity and length of the Detailed Specification, and to fully establish the state of the art in certain areas of technology, Applicant herein expressly incorporates by reference, in their entirety, all of the following materials identified in each numbered paragraph below.
U.S. Pat. No. 7,340,393 by Mitsuyoshi describes a method for speech-based emotion apparatuses capable of accurately detecting emotions of a human. The speech-based emotion apparatuses detect intensity, tempo, and intonation in the inputted voice based on parameters of change in the inputted voice signal, and video detects the change of expression on the subject's face. The emotion detecting method can be utilized for emotion detection in the medical field for a variety of systems as part of artificial intelligence and artificial sensibility. As well as utilization for a variety of systems used in many ways for controlling the sensibility of virtual humans and robots.
U.S. Pub. No. 2010/0088088 by Bollano et al describes a system and method for an automated emotional recognition system adapted to improve processing capability of artificial intelligence in a more flexible, portable and personalized manner by utilizing a method for an automated emotional recognition system to determine the emotional states of a speaker based on the analysis of a speech signal involving use of telecommunications terminals of scarce processing capability like telephones, mobile phones, PDAs, and similar devices.
U.S. Pat. No. 7,451,079 by Oudeyer describes an emotion apparatus and method which improves the accuracy and processing capability by using a low-pass filtering of the voice signal to extract at least one feature from a signal and processes the extracted feature to detect the emotion in a short utterance. It then generates an emotion detecting algorithm using teaching algorithm, exploiting at least one feature extracted from a low-passed filtered voice signal.
U.S. Pub. No. 2009/0313019 by Kato et al also describes a speech-based emotion recognition apparatus which improves both accuracy and processing capability for recognizing a speaker's emotion by detecting a variation caused by tension or relaxation of a vocal organ, or an emotion, an expression or speaking style. The speech-based emotion recognition apparatus can detect an emotion in a small unit, a phoneme, and perform emotion recognition with high accuracy by using by using a relationship between characteristic tone, language, and regional differences and a speaker's emotion.
Kim, et al. discloses an attempt at building a real-time emotion detection system which utilizes multi-modal fusion of different timescale features of speech. “Conventional spectral and prosody features are used for intra-frame and supra-frame features respectively, and a new information fusion algorithm which takes care of the characteristics of each machine learning algorithm is introduced. In this framework, the proposed system can be associated with additional features, such as lexical or discourse information, in later steps. To verify the realtime system performance, binary decision tasks on angry and neutral emotion are performed using concatenated speech signal simulating realtime conditions.” (Kim, S., Georgiou, P., Lee, S., & Narayanan, S. (2007). Real-Time Emotion Detection System, Using Speech: Multi-modal Fusion of Different Timescale Features. Proceedings of the IEEE Multimedia Signal Processing Workshop, 48-51).
Kwon, et al. discloses selecting pitch, log energy, formant, met-band energies, and mel frequency cepstral coefficients (MFCCs) as the base features, and added velocity/acceleration of pitch and MFCCs to form feature streams. “We extracted statistics used for discriminative classifiers, assuming that each stream is a one-dimensional signal. Extracted features were analyzed by using quadratic discriminant analysis (QDA) and support vector machine (SVM). Experimental results showed that pitch and energy were the most important factors. Using two different kinds of databases, we compared emotion recognition performance of various classifiers: SVM, linear discriminant analysis (LDA), QDA and hidden Markov model (HMM). With the text-independent SUSAS database, we achieved the best accuracy of 96.3% for stressed/neutral style classification and 70.1% for 4-class speaking style classification using Gaussian SVM, which is superior to the previous results. With the speaker-independent AIBO database, we achieved 42.3% accuracy for 5-class emotion recognition.” (Kwon, O. W., Chan, K., Hao, J., & Lee, T. W. (2003). Emotion Recognition by Speech signals. Eurospeech'03, 125-128).
Lee, et al. discloses a report on “the comparison between various acoustic feature sets and classification algorithms for classifying spoken utterances based on the emotional state of the speaker, [using] three different techniques—linear discriminant classifier (LDC), k-nearest neighborhood (k-NN) classifier, and support vector machine classifier (SVO) for classifying utterances into 2 emotion classes: negative and non-negative.” (Lee, C, M., Narayanan, S., & Pieraccini, R. (2002). Classifying emotions in human-machine spoken dialogs, International Conference on Multimedia & Expo '02, 1, 737-740).
Nwe, et al. discloses a text independent method of emotion classification of speech. The disclosed method “makes use of short time tog frequency power coefficients (LFPC) to represent the speech signals and a discrete hidden Markov model (HMM) as the classifier.” (Nwe, T., Foo, S., & Silva, L. D. (2003). Speech emotion recognition using hidden Markov models. Elsevier Speech Communications Journal, 41(4), 603-623).
Petrushin discloses “two experimental studies on vocal emotion expression and recognition. The first study deals with a corpus of 700 short utterances expressing five emotions: happiness, anger, sadness, fear, and normal (unemotional) state, which were portrayed by thirty non-professional actors. The second study uses a corpus of 56 telephone messages of varying length (from 15 to 90 seconds) expressing mostly normal and angry emotions that were recorded by eighteen non-professional actors.” (Petrushin, V. (1999). Emotion in Speech: Recognition and Application to Call Centers. Artificial Neural Networks in Engineering '99, 7-10).
Picard provides a thoughtful and thorough discussion on giving computers “affective abilities,” or the ability to process emotions. (Picard, R. W. (1997). Affective computing. Cambridge, Mass.: MIT Press).
Schuller, et al. discloses the challenge, the corpus, the features, and benchmark results of two popular approaches towards emotion recognition from speech: dynamic modeling of low level descriptors by hidden Markov models and static modeling using supra-segmental information on the chunk level. (Schuller, B., Steidl S., Batliner, A. (2009). The interspeech 2009 emotion challenge, Proceedings of Interspeech 2009, 1, 312-315).
While general emotion recognition concepts are known, based on above disclosures, there are stilt still issues with emotional speech processing's poor accuracy of detecting the correct emotion and insufficient processing capability. Poor accuracy and insufficient processing capability are grave problems because they make a potentially life-altering technology, emotional speech processing, unreliable and functionally impractical. Therefore, improvements to both the accuracy and processing capability of emotional speech processing are needed to make the technology more reliable and practical.
Applicant believes that the material incorporated above is “non-essential” in accordance with 37 CFR 1.57, because it is referred to for purposes of indicating the background of the invention or illustrating the state of the art. However, if the Examiner believes that any of the above-incorporated material constitutes “essential material” within the meaning of 37 CFR 1.57(c)(1)-(3), applicant will amend the specification to expressly recite the essential material that is incorporated by reference as allowed by the applicable rules.