The present disclosure is directed to a system and method for learning an involuntary physical characteristic and an associated underlying emotional state of a subject using automatic classification techniques. The disclosure finds application in an educational setting, but is amenable to other settings as well.
In a conventional vision-based facial expression recognition system, a computer is programmed to recognize certain facial expressions that indicate emotions. For example, a frown is recognized to indicate displeasure, and the computer is programmed to associate that facial expression with that emotion. Generally, certain facial expressions, such as smiles and frowns, are universally recognized across populations, as most human subjects voluntarily exhibit these expressions while experiencing certain emotions. Conventional systems have learned to classify a facial expression into one of a number of predetermined sentiments or emotional states.
Conventional emotion recognition systems have used multiple approaches, such as local binary patterns (LBP) and Histogram of Gradient (HoG) features, to learn facial expressions from video sample datasets that typically contain multiple subjects performing several prototypical and universal facial expressions that indicate happiness, sadness, anger, among other emotions. For instance, if happiness or sadness is of interest, then it may be feasible for the conventional system to solicit a smile and frown from a subject. Moreover, these types of facial expressions are considered “universal” in the sense that the expressions are commonly exhibited for happiness and sadness is recognizable to most people. These systems can be precise at detecting a set of artificially induced facial expressions. Alternatively, another approach used for generic facial expression recognition is known as “expression spotting”, where spatial-temporal strain is used to determine moments in videos where facial deformation occurs.
However, other types of expressions are not as universal and show large inter-subject variability. Individuals can exhibit other facial characteristics—sometimes symptoms—under certain conditions of stress, anxiety, confusion, and pleasure, etc. Each individual may react to certain conditions differently, and his or her emotional response, referred herein as also being a physical trait, can be involuntary. Blinking, rigid head motions, and biting of lips, etc., are only a few non-limiting example facial characteristics that manifest as an emotional response. Conventional systems are unable to sufficiently detect the involuntary physical traits or manifestations of individualized emotional states or responses. Particularly, the task of collecting multiple samples of subjects imitating mostly involuntary facial characteristics can be difficult. Furthermore, subjects may voluntarily act out the intended facial behavior differently.
One setting where facial recognition can be used to identify emotional states is in education. A teacher or educational system may desire to predict how a student is performing, or struggling, on an assessment or on specific questions using facial recognition. Previous approaches for assessing a student's emotional state included self-reports and teacher assessments. These were often cumbersome, and were instantaneous rather than continuous, longitudinal analyses of the student's affective (“emotional”) state. The computer vision approach for facial recognition provides a non-obtrusive method of monitoring a student's emotional state with high temporal resolution over a long period of time.
In “The Faces of Engagement: Automatic Recognition of Student Engagement from Facial Expressions”, Affective Computing, IEEE Transactions, vol. 5, no. 1, pp. 86-98 (2014), by J Whitehill et al., levels of engagement are learned in a natural setting by presenting students with standardized assessments. Data is collected from a large pool of subjects, with labels being generated from subjective evaluations by expert judges. An engagement recognition engine is trained using the pooled data. In “Automatic facial expression recognition for intelligent tutoring systems”, Computer Vision and Pattern Recognition Workshops, pp. 1, 6, 23-28 (2008), by Whitehill, et. al., a similar approach is disclosed using a regression technique.
FIG. 1 shows a conventional facial recognition approach 10 according to the PRIOR ART. The method starts at S12. Mainly, a computer is programmed to process a video frame or still image to detect a face at S14. Facial registration (alignment) and normalization is performed on the image once the face is detected at S16. Next, features are extracted in the facial region of interest at S18. At S20, the features are used to train a classifier, where the features are annotated by a label of the input video frame or still image. The method ends at S22.
However, one setback with these conventional computer vision approaches is the inherent limitations in accuracy since the classifiers are trained on pooled training data and emotional states vary highly from individual to individual.
Thus a personalized and natural approach is desired to automatically learn, in an unsupervised or semi-supervised fashion, the association between individuals' involuntary, physical facial characteristics and their underlying emotional states. A system and approach are desired which can rely on the standard core modules of a conventional facial recognition system.