1. Field of the Invention
The invention is generally related to the area of artificial intelligence, and more particularly, related to computer vision, especially in the context of markerless head pose estimation and tracking from monocular video sequences captured by an imaging device (e.g., a video camera).
2. Related Art
In computer vision, head pose estimation is a process of inferring the orientation and position of a human head from digital imagery [1]. More exactly, for monocular passive optical camera based applications, it is about the estimation of the head motion in six degrees of freedom relative to a still camera, where the six degrees of freedom include three degrees of freedom for rotating along the three axes and three degrees of freedom for translating along the three axes.
Although people can interpret the head orientation and movement easily, head pose estimation is still remained as one of the challenging problems in computer vision due to the fact that the final pixel-based facial image is largely affected by various factors including camera geometric distortion, perspective camera projection, and varying illumination.
Error accumulation in pose estimation is another major concern during incremental pose tracking. A large number of existing head pose estimation approaches are based on estimating the incremental head motion between two successive video frames. In these incremental estimation approaches, error accumulation degrades the estimation accuracy to the point that the final pose estimation becomes unusable.
Two latest approaches [2, 3] show some promising results for head pose estimation in real-time and under certain assumptions. Morency et al [2] use an iterative normal flow constraint to estimate the pose differences between a current frame and key-frames (including the last frame) that are collected in an online manner. Jang et al [3] use the feature point (including SIFT and regularly sampled facial image points) registration to estimate the pose differences between the current frame and key frames (including the last frame) that are also collected in an online manner. However, pose estimation results by Morency et al and Jang et al are sensitive to large translation motion due to the local optimization of the normal flow based computation and feature point registration. Consequently, both are not appropriate for interactive game applications, where the head of a user could have fast or large translation movements.
Thus there is a great need for pose estimation that would overcome the issues demonstrated in Morency et al or Jang et al. Further, such solutions would be efficient enough for interactive applications (e.g., games), requiring no intervention from a user(s) or no special constraints on the user(s).