Pose estimation is a key component for many areas of real-time computer vision such as Human-Computer Interaction. For example, work has been done for driver monitoring to determine head pose using various facial cues to derive related attentiveness information. For a driver assistant system, driver fatigue or attention monitoring is useful in order to alert the driver when safety concerns arise. In other more general human-computer interaction applications, head pose estimation is important for tasks that require information on user attention, such as for example, display control, online instruction, or the like. In addition, target orientation estimation is useful in other machine vision applications, for example, for object identification, face recognition, and the like.
Conventional approaches for orientation estimation (either from a still image or from an image sequence) can generally be grouped into two major categories. The first category includes appearance-based methods, which use pattern classification techniques based on the extracted feature vectors from target images. The second category includes approaches based on motion estimation, which use motion analysis techniques between successive images.
Appearance-based technology is generally based on image comparisons using pattern-classifier technologies, such as, for example, Naïve Bayesian Classifier, Support Vector Machines (“SVM”), Neural Networks, Hidden Markov Model (“HMM”), or the like. These classifiers have been successfully used in many applications but they are not without disadvantages. They need a large number of training examples, which are usually collected manually, each of which needs to be aligned exactly in order to extract feature vectors useable for comparison between a target and the model in the training samples. There are always some instances where classification fails, primarily related to appearance variation.
The physical differences in appearance between the model and the target present a problem for appearance-based classification. Particularly, in human face classifiers, selecting a set of features in a human face that can be tracked across all poses, between frames, and across a variety of target human faces presents a challenging problem. Particularly, when determining side poses since the side face appearance generally lack distinct features that are shared among the general population as compared to the front face. It is difficult to define a “common appearance” that applies to everybody. Appearance variation can be a problem even when operating on the same subject. For example, a person may be wearing sunglasses, wearing a hat, may shave off a beard, or the like. In addition, lighting conditions negatively impact the classification performance.
Therefore, appearance-based orientation estimation systems that operate based on generic model databases can typically only achieve limited recognition performance. The great appearance variation between model and targets, or even between the same target at different times leads to unstable results.
The other generalized approach is based on motion estimation technology. Motion estimation technology for pose estimation is generally based on visually recognizable features of the target. For example, human face pose estimation is generally based on the identification of face features, such as, eyes, nose, mouth, and the like. This identification of particular features in an image is a hard problem on its own right. For example, conventional systems detect front faces in a scene through an exhaustive search based on various perceptual cues, e.g. skin color, motion, or the like. Once a face has been detected, the face region is tracked using related information such as facial features, edges, color, depth, and motion, or the like. For real-time applications, e.g., live video, these methods under perform, particularly when the environment has a cluttered background.
These techniques suffer from several critical problems. For instance, automatic model pose initialization is still a difficult problem. Another drawback of motion estimation techniques is that the angle estimate is accurate only for a relatively short image sequence because error accumulation due to the incremental nature of the angle computation becomes too large for a long sequence. Eventually, the estimated angle drifts completely out of phase.
Thus, there is a need for orientation estimation methods and systems that are based on (1) real-time image data, (2) are robust against appearance variation, and (3) can operate over long sequences without drifting.