The present invention relates to preserving emotion across voice and text communication transformations.
Human voice communication can be characterized by two components: content and delivery. Therefore, understanding and replicating human speech involves analyzing and replicating the content of the speech as well as the delivery of the content. Natural speech recognition systems enable an appliance to recognize whole sentences and interpret them. Much of the research has been devoted to deciphering text from continuous human speech, thereby enabling the speaker to speak more naturally (referred to as Automatic Speech Recognition (ASR)). Large vocabulary ASR systems operate on the principle that every spoken word can be atomized into an acoustic representation of linguistic phonemes. Phonemes are the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. The English language contains approximately forty separate and distinct phonemes that make up the entire spoken language, e.g., consonants, vowels, and other sounds. Initially, the speech is filtered for stray sounds, tones and pitches that are not consistent with phonemes and is then translated into a gender-neutral, monotonic audio stream. Word recognition involves extracting phonemes from sound waves of the filtered speech and then creating weighted chains of phonemes that represent the probability of word instances and finally, evaluating the probability of the correct interpretation of a word from its chain. In large vocabulary speech recognition, a hidden Markov model (HMM) is trained for each phoneme in the vocabulary (sometimes referred to as an HMM phoneme). During recognition, the likelihood of each HMM in a chain is calculated, and the observed chain is classified according to the highest likelihood. In smaller vocabulary speech recognition, an HMM may be trained for each word in the vocabulary.
Human speech communication conveys information other than lexicon to the audience, such as the emotional state of a speaker. Emotion can be inferred from voice by deducing acoustic and prosodic information contained in the delivery of the human speech. Techniques for deducing emotions from voice utilize complex speaker dependent models of emotional state, that are reminiscent of those created for voice recognition. Recently, emotion recognition systems have been proposed that operate on the principle that emotions (or the emotional state of the speaker) can be distilled into an acoustic representation of sub-emotion units that make up delivery of the speech (i.e., specific pitches, tones, cadences and amplitudes, or combinations thereof, of the speech delivery). The aim to identify the emotional content of speech with these predefined sub-emotion speech patterns that can be combined into emotion unit models that represent the emotional state of the speaker. However, unlike text recognition which filter the speech into a gender-neutral and monotonic audio stream, the tone, timbre and, to some extent, the gender of the speech is unaltered for more accurately recognizing emotion units. A hidden Markov model may be trained for each sub-emotion unit and during recognition, the likelihood of each HMM in a chain is calculated, and the observed chain is classified according to the highest likelihood for an emotion.