Large organizations, such as commercial organizations or financial organizations conduct numerous audio interactions with customers, users or other persons on a daily basis. Some of these interactions are vocal, such as telephone or voice over IP conversations, or at least comprise a vocal component, such as an audio part of a video or face-to-face interaction.
Many organizations record some or all of the interactions, whether it is required by law or regulations, for business intelligence, for quality assurance or quality management purposes, or for any other reason. Once the interactions are recorded and also during the recording, the organization may want to extract as much information as possible from the interactions. The information is extracted and analyzed in order to enhance the organization's performance and achieve its business objectives. A major objective of business organizations that provide service is to provide excellent customer satisfaction and prevent customer attrition. Measurements of negative emotions that are conveyed in customer's speech serve as key performance indicator of customer satisfaction. In addition, handling emotional responses of customers to service provided by organization representatives increases customer satisfaction and decreases customer attrition.
Various prior art systems and methods enable post interaction emotion detection, that is, detection of customer emotions conveyed in speech after the interaction was terminated, namely off-line emotion detection. For example, U.S. Pat. No. 6,353,810 and U.S. patent application Ser. No. 11/568,048 disclose methods for off-line emotion detection in audio interactions. Those systems and methods are based on prosodic features, in which the main feature is the speaker's voice fundamental frequency. In those systems and methods emotional speech is detected based on large variations of this feature in speech segments.
The '048 patent application discloses the use of a learning phase in which the “neutral speech” fundamental frequency variation is estimated and then used as the basis for later segments analysis. The learning phase may be performed by using the audio from the entire interaction or from the beginning of the interaction, which makes the method not suitable for real time emotion detection.
Another limitation of such systems and methods is that they require separate audio streams for the customer side and for the organization representative side and provide very limited performance in terms of emotion detection precision and recall in case that they are provided with a single audio stream, that includes both the customer and the organization representative as input, which is common in many organizations.
However the detection and handling of emotions of customers of the organization in real time, while the conversation is taking place, serves as a major contribution for customer satisfaction enhancement.
There is thus a need in the art for method and apparatus for real time emotion detection. Such analysis enables detecting, handling and enhancing customer satisfaction.