Early techniques for determining head-pose used devices that were fixed to the head of the subject to be tracked. For example, reflective devices were attached to the subjects head and using a light source to illuminate the reflectors, the reflector locations were determined. As such reflective devices are more easily tracked than the head itself, the problem of tracking head-pose was simplified greatly.
Virtual-reality headsets are another example of the subject wearing a device for the purpose of head-pose tracking. These devices typically rely on a directional antenna and radio-frequency sources, or directional magnetic measurement to determine head-pose.
Wearing a device of any sort is clearly a disadvantage, as the user's competence and acceptance to wearing the device then directly effects the reliability of the system. Devices are generally intrusive and will affect a user's behaviour, preventing natural motion or operation.
Structured light techniques that project patterns of light onto the face in order to determine head-pose are also known.
The light patterns are structured to facilitate the recovery of 3D information using simple image processing. However, the technique is prone to error in conditions of lighting variation and is therefore unsuitable for use under natural lighting conditions.
U.S. Pat. No. 6,049,747 describes a driver-monitoring device that determines head-pose using structured infra-red light. The technique measures head-pose and assumes it is an estimate of the driver's gaze direction. The lack of a more detailed gaze analysis clearly limits the usefulness of such a system.
Another group of head-pose tracking techniques are the so called “Classification techniques”. Classification techniques attempt to classify a video image as one of a set of possible outcomes. The techniques often use methods such as histograms, principal component analysis and template matching. The main problem with the approach is that only head orientation can be measured—head translation is not accounted for.
Head orientation is measured by classifying the instant orientation as one of finite set of possible orientations. As the number of candidate head positions is increases, so does the probability of false classification. Another difficulty is that the set of candidate head positions must be generated in advance. This is a laborious process.
The system presented by Pappu and Beardsley in “A Qualitative Approach to Classifying Gaze Direction”, Conference of Automatic and Gesture Recognition 1998, Nara Japan, provides an example of the present state of classification techniques for use in head-pose determination.
Other known systems of head-pose detection use techniques that rely on fitting a generic 3D head-mesh structure to sequences of images. This involves iteratively refining an estimation of the head-pose through measurement of the error between candidate 2D projections of the 3D mesh, and the image.
The technique is computationally expensive, and the accuracy is largely dependent on the similarity between the generic mesh model and the actual head being tracked. The wide variety of human face structure thus prevents any guaranteed measure of accuracy. The technique is likely to be applied for non-real-time image processing, with the aim of altering the appearance of a person's face.
Examples of systems that use this style of technique can be seen in “A Robust Model-Based Approach for 3D Head Tracking in Video Sequences” by Marius Malciu and Francoise Preteux, and “Robust 3D Head Tracking Under Partial Occlusion” by Ye Zhang and Chandra Kambhamettu, both from Conference of Automatic and Gesture Recognition 2000, Grenoble France.
An further known technique which may be used for head-pose detection is the structure from motion technique. Structure from motion is a technique whereby the three-dimensional geometry of an object can be recovered from a single video source, using the information available from different views of the object as it moves relative to the camera. Such a technique is discussed in “Real Time Tracking and Modelling of Faces: An EKF-based Analysis by Synthesis Approach” in Proceedings of the Modelling People Workshop at the International Conference on Computer Vision, 1999 by J. Strom, T. Jebara, S. Basu and A. Pentland.
When used for head-pose tracking, a 3D model of the head is initialised using a generic three-dimensional mesh, and an extended Kalman filter is used to iteratively refine both the facial geometry and the head-pose.
Convergence of this technique is not assured due to the typical variation of human facial geometry. This is a similar problem to the Template Mesh Model Fitting technique, though a little lessened due to the adaptive approach used.
It is also important to note that the technique is fragile to facial deformations such as smiling and blinking.
Fatigue measurement using blink detection is described in U.S. Pat. No. 5,867,587 and U.S. Pat. No. 5,878,156 describes a technique for fatigue measurement based on detecting the state of the eyes. Both methods are fragile if applied to tasks that involve wide ranging head motions, such as when driving a car.
Stereo reconstruction using feature templates is also known. Ming XU and Takao Akatsuka in “Detecting Head Pose from Stereo Image Sequence for Active Face Recognition” from Conference on Automatic Face and Gesture Recognition 1998, Nara Japan, present a system for active face recognition. The system uses only four facial features, two of which are the eyes, to recover and approximate head-pose. The system cannot be used for practical head-pose tracking as it is fragile to head and eye motion (including blinking), and requires the image background to be uniform. The range and accuracy of head-motion measurement is also very limited due to the deformation and/or occlusion of features as the head is moved.
Work by Norbert Kruger, Michael Potzsch, Thomas Maurer and Michael Rinne in “Estimation of Face Position and Pose with Labelled Graphs” from Proceedings of British Machine Vision Conference, 1996, investigates the use of Gabor filter based template tracking combined with bunched graph fitting. This is an early paper, and further work can be seen at the Internet address:                http://selforg.usc.edu:8376/˜tmaurer/introduction.html        
However the system described has no facility for using the head-pose information to reliably track eye-gaze.
Similarly to head-pose detection, eye-gaze direction measurement has, in the past, been achieved with the use of devices worn by the subject.
Devices worn to detect eye-gaze direction have included mirrors, lenses, or cameras placed near the eye or in some instances special contact lenses to be placed on the eye. All the methods aim to obtain high-resolution or easily identifiable images of the eye that are independent of head-position.
Again, as for head-pose direction, wearing a device of any sort is a disadvantage, as the user's competence and acceptance to wearing the device then directly effects the reliability. Devices are generally intrusive and will affect a user's behaviour, interfering with natural motion and operation.
The infra-red technique involves shining an infra-red light on the face of the person being monitored then detecting and analysing the reflections from the person's eyes.
Infra-red reflection techniques operate by detecting either the reflection from the eye surface, or cornea of the eye, or both.
The first reflection is from the spherical eye surface. This reflection determines the position of the eye. If the video camera is collocated with the source of the infra-red, the position of the reflection directly measures the position of the eyeball centre. The cornea, on the other hand, acts as a corner reflector.
Image processing is used to detect reflections from the eye surface or cornea, and to localise the centre of the limbus. An accurate gaze estimate can then be computed using the relative position difference between the iris centre and the reflections.
Infra-red sensing can yield very precise eye-gaze measurement. However, the technique is limited for the following reasons:
The cornea can only reflect light back to a source (act as a corner reflector) over a small range of angles. This limits the use of infra-red to applications where gaze is restricted to a small area.
To reliably analyse the reflections on the eye, a high-resolution image is required. Due to finite image sensor resolution, this limits the possible field-of-view for the sensor. To overcome this problem, either a very expensive high-resolution sensor must be used, or a bulky and failure-prone mechanical pan-tilt mechanism can be employed.
Natural lighting conditions can easily confuse the reflection detection process. Flashing techniques are often used to improve reliability, however saturation of the pupil with sunlight will cause a flashing detector to fail. Fluctuating light on the pupil, typical of driving conditions, will also produce erroneous measurements.
The eye-gaze measurements taken using the infra-red reflection technique are in two dimensions only. That is, because the distance of the eyeball from the camera is not determined, the measurement is based on the assumption that the head remains at a fixed distance from the camera Motion of the head towards or away from the camera will change the distance between the eye-reflections, and may be interpreted as a change in gaze-direction.
Techniques to compensate for motion toward or away from the camera are based on measuring the image area of the reflections or other regions on the face, and are prone to noise due to resolution constraints, overlapping reflections from other light sources, and distortion introduced by rotation of the head.
The majority of the known techniques for passive eye-gaze analysis suffer from one of the number of common drawbacks, including the following:                The eye-gaze direction is assumed to be perpendicular to the orientation of the face; the technique is only an eye region detection technique. Passive eye-gaze analysis is not actually performed.        The technique performs eye-gaze tracking without three-dimensional head-pose tracking to compensate for head orientation and motion, thus limiting the application of the technique to strict translations in the image plane, or no head-motion at all.        
Some known techniques use 'neural-networks to estimate gaze direction. Neural networks require long training sequences for every person to be monitored, and do not allow for any head motion.
The techniques based on finding the distortion of the iris circle due to eye rotation tend to be extremely noise and resolution sensitive.
The technique used to find the iris in “Vision-Based Eye-Gaze Tracking for Human Computer Interface”, IEEE International Conference on Systems, Man and Cybernetics, Tokyo, Japan, 1999 by Kim Kyung-Nam et. al, uses a circular Hough transform to find the iris centre. The gaze direction is determined using the distance of the iris centre from a fixed marker that must be worn on the face. Thus, head rotation or head translation along the camera axis will be interpreted by the technique as a change in gaze direction.
U.S. Pat. No. 6,072,892 describes an “eye position detecting apparatus and method therefor”. The technique locates the position of the eyes in an image of a face using a histogram classification approach. The technique only locates the eyes, and does not perform any actual eye-gaze measurement.
U.S. Pat. No.6,055,323 describes a technique for “face image processing” by locating the position of the eyes in an image of a face by first locating nares (nostrils) and then using a default model of the face to determine the eye image regions.
U.S. Pat. No. 5,859,686 describes an “eye finding and tracking system” which locates and tracks the eyes using normalised cross-correlation of iris image templates, combined with knowledge of probable eye-positions to reduce the probability of erroneous detection.
Each of the above techniques fails to account for three-dimensional motion of the head, and are prone to error due to head-rotation and head-translation along the camera axis.
Device wearing and active sensing are also used to detect eye closure and blinking.
Devices worn to measure eye-closure or detect blinking fall into two categories, namely, techniques using electrodes worn near the eyes that measure eyelid muscle activity, and devices worn on the head that project infra-red light onto the eye region, and determine eye-closure based on the amount of reflected light.
Clearly wearing a device of any sort is a disadvantage, as the user's competence and acceptance to wearing the device then directly effects the reliability. Devices are generally intrusive and will affect a user's behaviour, interfering with natural motion and operation.
One known passive video technique used for eye closure and blink detection fits deformable eye templates to three parameters. The first two parameters represent the parabolic shapes of the eyelids, and the third represents the radius of a circle representing the edge between the iris and sclera The technique relies on determining the location of the corners of each eye. Similarly to the techniques for eye finding, there is the in-built assumption that the head does not rotate away from the image plane, because the head-pose is not tracked in three dimensions.
U.S. Pat. No.5,867,587 describes an “impaired operator detection and warning system employing eyeblink analysis”. The technique detects blink events, and measures blink duration with the aim of detecting a fatigued operator. The eyes are first found using the patented technique of U.S. Pat. No. 5,859,686. Blink events are detected when a fluctuation in the eye-template correlation meets specific set of requirements. The blink technique described is defective in situations where an operator's head rotates significantly, due to the inadequate head-pose tracking.
U.S. Pat. No. 5,878,156 describes a technique for “detecting the open/closed state of the eyes based on analysis of relation between eye and eyebrow images in input face images”. The technique binarizes the images regions surrounding the eye and eyebrow, determines which regions represent the eye and eyebrow, then calculates the distance between the centroids of these regions. A technique that determines the ratio of the areas of the eye and eyebrow image regions is used to add robustness to variation in head-pose distance from the camera This technique may be unreliable when the head is rotated left and right, as rotational motion of the head in this plane will cause the eyebrow and eye image-region area-ratio to change which will be interpreted as a change in head-pose distance. The technique will also be unreliable when used on operators with fair or blonde eyebrows, when the eyebrows are moved around on the face, or when reflected light conditions on the eye change. Additionally, the technique will not work at all when using glasses or sunglasses, at the very least due to the fact that the frames will cover the eyebrow image regions.
In, summary the known techniques for eye-gaze measurement, eye-closure measurement, blink detection have failed to use a three-dimensional estimate of head-pose as the foundation for further facial analysis, such as eye-gaze, closure or blink detection. The techniques have measured head-pose, eye-gaze, eye-closure or blink detection, individually, but not simultaneously. Thus, the techniques have failed to take advantage of the relationships between the measures, and, in general, are limited in their application due to over-simplified approaches to the measurement problem.
More specifically, the known techniques have not suitably accounted for large variations in head position and rotation when measuring eye-gaze, eye-closure or blink event detection, and thus although claiming to be robust, are only robust given specific restrictions on head-pose. Thus the known techniques remain fragile when applied to head motions typical of operating a machine in a seated position, such as driving a car.
Additionally, no prior technique is known that automatically detects which parts of the face are flexible. There is a need for such a technique in the area of motion capture for facial animation.
Present day facial animation systems map defined points or nodes on the face of a human, onto another set of points or nodes on the face of a computer-animated character. These points are selected manually by placing markers on the face, and then placing corresponding control points onto the face geometry in the animation software. This process of identifying control points is lengthy and would be improved by automatically finding all the flexible points on the face.
U.S. Pat. No. 6,028,960 describes a technique for “face feature analysis for automatic lip-reading and character animation”. The technique tracks the face by identifying the nostril features, and then determining lip and mouth contours. The lip and mouth contours are then used to control an artificial face model. The technique simplifies the process of animating lip-motion for computer-animated talking characters. However, the technique makes no mention of using any other facial feature other than the lips to perform this animation. It instead relies on the artificial generation of face structure using only the lip and mouth contours. It does not simplify the animation process for capturing facial expressions that involve eyelid, eyeball, eyebrow and other facial expressions not involving the mouth.
As the technique only tracks the nostrils, it will only be so while the nostrils are visible to the camera Clearly the system will fail for head orientations where the head is tilted forward so that the nostrils are obscured from the camera by the top of the nose.
U.S. Pat. No. 6,016,148 describes a technique for “automated mapping of facial images to animation wireframes topologies”. The technique describes the general principle of using measured positions of points or nodes on the face to alter corresponding points or nodes on a computer modelled wire-frame mesh topology. The patent does not include a method to automatically determine the location of these points or nodes on the face.