Head pose estimation is generally addressed as a three-dimensional problem, where the studied parameters concern the three ways of rotating a head. The main uses of estimating head poses concern the man-machine interface, face recognition, teleconferencing for low bit-rate transmission, virtual reality, and the construction of a three dimensional model of a face to simulate and display synthetic facial expressions. Conventional efforts have focused principally on developing hardware-based, three-dimensional (3D) image acquisition systems. Basically, such systems obtain 3D information using two different images of the same object. In the case of stereovision, two calibrated cameras are used. The society Eyetronics® has proposed ShapeSnatcher©, a low-cost 3D-acquisition system that uses a single camera coupled with a slide projector. A grid is projected onto the object and distortion of the observed grid is composed with the original grid. In this manner, 3D information can be reconstructed.
A number of patents deal with various aspects of head pose estimation including face location, face tracking, facial feature location, detection of eyes, detection of lips, face recognition, age classification of human faces, facial animation, facial motion tracking, synthesis of facial expressions for toys or figurine, and the like. Except for face location or face tracking, where the face is considered as a region of the image verifying given properties, the other systems consider the face to be acquired in a frontal view, like a mug-shot photo.
An important challenge for face recognition systems is to recognise a face under varying poses. There are two streams of research in this field. The first one tries to transform (using image processing) off-frontal views of faces into frontal views and applies well-known frontal face recognition algorithms. The second stream of research tries to identify the pose of the face using a recognizer trained on a specific pose(s).
Hattori, K., Matsumori, S., and Sato, Y., “International Conference on Pattern Recognition”, 1998, pp 1183-1187, Brisbane, describe an analog system. The hardware-based face measurement system uses a light source in a dark room, a color CCD camera, and two laser scanners on both sides of the camera. The system obtains a precise 3D model of the face, and head pose estimation can be computed.
A number of publications deal with head pose estimation algorithms, largely based on 3D image acquisition devices. One publication on face pose detection uses a single grey level image and neural networks to classify face images into a few categories: left, frontal, and right. One publication describes a system of pose discrimination based on the support vector machine using a single grey-level image. This approach uses a statistical method, based on a multi-dimensional linear classification, used on the whole face image. The output is a qualitative response among the three poses: “left”, “frontal”, “right”. However, this system does not provide a quantitative measure of face direction, does not provide vertical directions, and cannot be further extended for facial gesture recognition.
Systems for transforming facial images from off-frontal to frontal include (1) Feng, G.C Yuen, P.C. and Lai, J.H., “A New Method of View Synthesis under Perspective Projection”, Proceedings of the 2nd International Conference on Multi-modal Interfaces, pp IV-96, IV-101, Hong-Kong 1999, and (2) Maruyama, M., Asano, S., and Nakano, Y., “Face Recognition by Bi-directional View Synthesis”, Proceedings of the 14th International Conference on Pattern Recognition, pp 157-159, Brisbane, Australia, 1998. Each of these documents describes methods for transforming off-frontal faces into frontal faces. Each of these techniques aims is to improve facial recognition performance alone. Disadvantageously, each of the documents teaches transforming an image before facial recognition is carried out on the image. Such transformations create a noisy image, as may be observed from the results listed obtained in each document, leading to errors in facial recognition
Systems for 3-D pose estimation using a single camera are described by each of (1) Park, K.R., Nam, S.W., Han, S.C. and Kim, J., “A Vision-Based Gaze Detection Algorithm”, Proceedings of the 2nd International Conference on Multi-modal Interfaces, pp IV-47, IV-50, Hong-Kong 1999, (2) Chen, Q., Wu, H., Shioyama, T., Shimada, T. and Chihara, K., “A Robust Algorithm for 3D Head Pose Estimation”, Proceedings of the 14th International Conference on Pattern Recognition, pp 1356-1359, Brisbane, Australia, 1998, and (3) Wu, H., Shioyama, T. and Kobayashi, H., “Spotting Recognition of Head Gestures from Color Images Series”, Proceedings of the 14th International Conference on Pattern Recognition, pp 83-85, Brisbane, Australia, 1998. Each of these documents describes 3D head pose estimation using a single camera. In each document, skin color and hair color regions are generally used to establish the 3D-parameters. Information about the skin and the hair color region is used to estimate the pose of the head, because the two types of information can be robustly extracted from color images and both are not sensitive to the changes of facial expression, wearing glasses, and other local changes of facial features. A skin region is represented with the Skin Color Distribution Model, and a hair region is represented with the Hair Color Distribution Model.
Secondly, the foregoing systems each approximate the head as a 3D-elliptical sphere and compute primary and secondary moments of these two regions to derive the 3D-head pose. The tilt is just the principal axis of the face region, and the two other rotations (“up-down”, “left-right”) are expressed using third order polynomials depending on the horizontal and vertical distance of the centroid of the face region with the centroid of the skin region. These two polynomials and their specific coefficients have been learnt statistically and are fixed. Disadvantageously, these methods require a crucial segmentation step. Segmentation can fail easily in certain. circumstances such as complex background or for a person without hair. Accordingly, each of these techniques has significant limitations and is not robust. Kuno, Y., Adachi, Y., Murashima, T., and Shirai, Y., “Intelligent Wheelchair Looking at Its User”, Proceedings of the 2nd International Conference on Audio- and Video-based Biometric Person Authentication, pp-84-89, Washington, 1999, describe a system for determining face direction using a tracking algorithm. This system involves the detection of the face direction to guide an intelligent wheelchair, which is proposed as an alternative to the computation of the face direction. In this system, the face is tracked. If the face turns, the wheelchair turns. Precise computation of the degree of face turning is not required and they do not consider “up-down” movement, because the camera is always below the face of the user. The system detects roughly the eyes, the eyebrows, the nostrils, and the two mouths corners. These points are materialized by square regions and, to compute the face direction, the relative position of these squares is compared with squares of the same frontal face. Moreover, these squares are tracked throughout sequence of images, stabilizing their extraction. The system uses several frames to decide the face direction and does not give a quantitative result.
Huang, J., Shao, X., and Wechsler, H., “Face Pose Discrimination Using Support Vector Machine (SVM)” Proceedings of the 14th International Conference on Pattern Recognition, pp 155-156, Brisbane, Australia, 1998, describes a system of pose discrimination based on the support vector machine using a single gray level image. This approach uses a statistical method based on a multi-dimensional linear classification, used on the whole face image. The output is a qualitative response among the three poses: “left”, “frontal”, “right”. However, this system does not provide a quantitative measure of face direction, does not provide vertical directions, and cannot be further extended for facial gesture recognition.
In the field of the video conference systems, several patents including U.S. Pat. Nos. 5,500,671, 5,359,362, and 5,675,376, and publications deal with the gaze detection and the line of sight estimation of the conferees. The purpose is to send to the conferee, the face image of an interlocutor, permitting communications with eye-to-eye contact. In the videoconference systems, conventional methods consider that the face is frontal and make intensive use of the gaze detection to provide eye-to-eye communications between the conferees. Generally speaking, the systems seek to ensure that the faces transmitted through the communication networks looks satisfying. Using the face direction estimation, more than the gaze detection, it is possible to enhance the quality of the transmitted images by generating a frontal view when the face is off-frontal, or by sending a pre-registered frontal face, or by unchanging the current satisfying frontal face which is displayed.
Thus, a need clearly exist for an improved system of recognizing faces from single images and for determining facial gestures.