Head pose is an important visual cue that enhances the ability of vision systems to process facial images. Head pose includes three angular components: yaw, pitch, and roll.
Yaw refers to the angle at which a head is turned to the right or left about a vertical axis. Pitch refers to the angle at which a head is pointed up or down about a lateral axis. Roll refers to the angle at which a head is tilted to the right or left about an axis perpendicular to the frontal plane.
Yaw and pitch are referred to as out-of-plane rotations because the direction in which the face points changes with respect to the frontal plane. By contrast, roll is referred to as an in-plane rotation because the direction in which the face points does not change with respect to the frontal plane.
Estimating head pose from photographs, video sequences, and other images is a highly complex task since it implicitly requires finding a face at an arbitrary pose angle. Several approaches for estimating head pose have been developed. These fall into two principal categories: model-based techniques and appearance-based techniques.
Model-based techniques typically recover an individual's 3-D head shape from an image and then use a 3-D model to estimate the head's orientation. An exemplary model-based system is disclosed in “Head Pose Estimation from One Image Using a Generic Model,” Proceedings IEEE International Conference on Automatic Face and Gesture Recognition, 1998, by Shimizu et al., which is hereby incorporated by reference. In the disclosed system, edge curves (e.g., the contours of eyes, lips, and eyebrows) are first defined for the 3-D model. Next, an input image is searched for curves corresponding to those defined in the model. After establishing a correspondence between the edge curves in the model and the input image, the head pose is estimated by iteratively adjusting the 3-D model through a variety of pose angles and determining the adjustment that exhibits the closest curve fit to the input image. The pose angle that exhibits the closest curve fit is determined to be the pose angle of the input image.
But such model-based approaches suffer from several drawbacks. First, the computational complexity of model-based approaches is very high and beyond the capabilities of many personal computers.
Second, a single 3-D generic face model does not account for variations in head shape or facial expression. Thus, such models yield poor performance when applied to a wide variety of faces.
Third, model-based system performance is typically proportional to input image resolution and requires image resolutions on the order of 128 by 128 pixels for satisfactory performance. As the input-image resolution decreases, performance degrades.
In contrast to model-based techniques, appearance-based techniques typically compare a two-dimensional subject with a set of two-dimensional model images. A distance metric is used to determine the distance between the subject image and each of the model images. The closest model image is used to determine the pose angle of the subject image.
But appearance-based techniques also suffer from significant drawbacks. In particular, the computational complexity of appearance-based methods depends on the number of model images used. If a large number of model images are used, then the system may not be able to perform the comparison in real time.
One appearance-based system that attempts to address this problem is disclosed in U.S. Pat. No. 6,144,755 to Niyogi et al., which is hereby incorporated by reference. Niyogi employs a tree-structured vector quantization technique to organize a training set of facial images. Each of the images in the training set is stored as a leaf of the tree. When an input image is received, the tree is traversed to determine the closest image in the training set. The pose angle of the closest image is output as the pose angle of the input image.
One disadvantage of this system, however, is that it requires a large number of training images to be stored in memory throughout system operation. The storage requirements for these training images may exceed the amount of high-speed random-access memory found in many modem personal computers.
Furthermore, the output pose angles in this system are restricted to the available pose-angle values of the training-set images. Thus, this system will not exhibit adequate accuracy, within 5 to 10 degrees for many applications, unless a very large set of training images is stored.
Another appearance-based pose-angle estimation method is disclosed in “Support Vector Regression and Classification Based Multi-view Face Detection and Recognition,” Proceedings IEEE International Conference on Automatic Face and Gesture Recognition, 2000, by Li et al., which is hereby incorporated by reference. In this technique, Principal Component Analysis (PCA) is first used to reduce the dimensionality of the input image. Then, a Support Vector Regression (SVR) module trained a priori estimates the head-pose angle.
But the estimation accuracy of this technique depends on the number of support vectors (SVs) employed, which can be a large portion of the training-set images. For instance, a 10,000-image training set requires, in this method, at least 1,500 SVs. Therefore, the SVR module requires a large memory to estimate pose with an acceptable margin of error. Moreover, the large number of SVs inflates the computation time required, making real-time implementation difficult.
Another appearance-based pose-angle estimation method is described in “Ensemble SVM Regression Based Multi-View Face Detection System,” Microsoft Technical Report MSR-TR-2001-09, Jan. 18, 2001, by Yan et al., which is hereby incorporated by reference. This system uses wavelet transforms to extract frontal, half-profile, and profile features of an input image and produces an image for each feature. Next, the feature images are provided to three support-vector classifiers. The output of these classifiers is provided to an ensemble SVR module that yields a pose angle. But since this system uses a support vector technique, it suffers from the same problems as the Li system above and cannot be easily implemented as a real-time system.
Therefore, a need remains for a rapid, robust, and cost-effective method to determine head pose. Such a system should preferably have low processing and memory requirements even when operating in real time.