Traditional interaction between a user and a computer occurs with the computer waiting passively for the user to dictate its actions. Through input devices, such as a keyboard and a mouse, the user communicates actions and intentions to the computer. Although this one-sided interaction is common it fails to fully exploit the capabilities of the computer.
It is desirable to have the computer play a more active role in interacting with the user rather than merely acting as a passive information source. A more interactive design involves linking the computer to a video camera so that the computer can interact with the user. The computer achieves this interaction by detecting the presence of and tracking the user. The user's face in particular provides important indications of where the user's attention is focused. Once the computer is aware of where the user's is looking this information can be used to determine the user's actions and intentions and react accordingly.
An important way in which a computer determines where a user's attention is focused by determining the facial pose of the user. A facial pose is the orientation of the user's face. The facial pose can be described in terms of rotation about three axes, namely, pitch, roll and yaw. Typically, the pitch is the movement of the head up and down, the yaw is the movement of the head left and right, and the roll is the movement of the head from side to side.
Determining a user's facial pose in real time, however, presents many challenges. First, the user's head must be detected and tracked to determine the location of the head. One problem with current real-time head tracking techniques, however, is that these techniques often are confused by waving hands or changing illumination. In addition, techniques that track only faces do not run at realistic camera frame rates or do not succeed in real-world environments. Moreover, head tracking techniques that use visual processing modalities may work well in certain situations but fail in others, depending on the nature of the scene being processed. Current visual modalities, used singularly, are not discriminating enough to detect and track a head robustly. Color, for example, changes with shifts in illumination, and people move in different ways. In contrast, “skin color” is not restricted to skin, nor are people the only moving objects in the scene being analyzed.