1. Field
Example embodiments of the following disclosure relate to image processing, computer vision and machine learning, and more particularly, relate to emotion recognition in a video sequence.
2. Description of the Related Art
With recent developments in technology, significant attention has been given to enhancing human computer interaction (HCl). In particular, engineers and scientists are attempting to capitalize from basic human attributes, such as voice, gaze, gesture and emotional state, in order to improve HCl. The ability of a device to detect and respond to human emotions is known as “Affective Computing.”
Automatic facial expression recognition is a key component in the research field of human computer interaction. Automatic facial expression recognition also plays a major role in human behavior modeling, which has significant potential in applications like video conferencing, gaming, surveillance, and the like. Most of the research in automatic facial recognition, however, is directed to identifying six basic emotions (sadness, fear, anger, happiness, disgust, surprise) on posed facial expression datasets prepared under controlled laboratory conditions. Researchers have adopted static as well as dynamic methods to infer different emotions in the facial expression datasets. Static methods analyze frames in a video sequence independently, while dynamic methods consider a group of consecutive frames to infer a particular emotion.
The mouth region of the human face contains highly discriminative information regarding the human emotion and plays a key role in the recognition of facial expressions. However, in a general scenario, such as, video conferencing, there will be significant temporal segments of the person talking, and any facial expression recognition system that relies upon the mouth region of the face of the person talk for inferring emotions may potentially be misled by the random and complex formations around the lip region. The temporal segment information regarding talking segments in a video sequence is quite important in this context as it can be used enhance the existing emotion recognition systems.
Few major works in the field of emotion recognition have addressed the condition of ‘talking faces’ under which the Action Units (AU) inferred for the mouth region may go potentially wrong, resulting in an erroneous emotion classification. Currently, known methods are directed at determining active speakers in a multi-person environment and do not intend to temporally segment lip activities of a single person into talking and non-talking (which includes neutral as well as various emotion segments) phases. As a result, the current systems suffer from drawbacks of failing to capture exact emotions.
Due to the abovementioned reasons, it is evident that there is a need for methods that intend to temporally segment lip activities into talking and non-talking phases and exact classification of emotions.