1. Field of the Invention
This invention relates to a method of image processing, particularly to a method of image processing using three facial feature points in 3-D head motion tracking.
2. Description of Related Art
In recent years, model-based image coding has drawn public attention, specifically its foreseeable application in visual communication, such as videophone or the virtual meeting mentioned in MPEG-4 Applications Document, ISO/IEC JTC1/SC29/WG11/N1729, July 1997. Since the primary subject in visual communication is the head section (and part of shoulders) of an image, the focus falls mainly on the head to reduce the data load in transmission. One possible approach is to introduce an explicit 3-D head model, such as the well-known CANDIDE face model (Mikael Rydfalk xe2x80x9cCANDIDE-a Parameterised Face,xe2x80x9d Linkoping University, Report LiTH-ISY-I-0866, October 1987) with texture mapping. A general system model of model-based face image coding is shown in FIG. 1. A user""s face model is first inputted into an encoder 10 and then adapted to fit the face image, and analyses on the model are employed to extract meaningful face features, as well as head motion. These analysis data are then sent through a transmission medium 15 to a decoder 20 to synthesize a realistic face image.
Besides, methods for inferring 3-D motion from 2-D images in 3-D motion estimation can largely be divided into the following two classifications:
1. Use of 2-D feature points; and
2. Use of optic flow information.
In most methods, correspondences from 2-D feature points or selected pixels are first established, and inference of 3-D motion with perspective projection are next made if only rigid body motion is involved.
In the first classification, in Thomas S. Huang and Arun N. Netravali""s xe2x80x9cMotion and Structure from Feature Correspondences: A Review,xe2x80x9d Proceedings of the IEEE, vol. 82, No. 2, pp. 252-268, February 1994, Huang and Netravali had categorized and introduced different algorithms to infer 3D motions either for 3D-to-3D feature correspondences, 2D-to-3D feature correspondences, or 2D-to-2D feature correspondences. They concluded, at least 5 feature points are necessary to figure out the actual 3D motion. Further, in Roger Y. Tsai and Thomas S. Huang""s xe2x80x9cUniqueness and Estimation of Three-Dimensional Motion Parameters of Rigid Objects with Curved Surfaces,xe2x80x9d IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. PAMI-6, No. 1, pp. 13-27, January 1984, Tsai and Huang proposed a linear algorithm for solving the 2D-to-2D feature correspondences with eight feature points.
In the second classification, luminance differences between two 2D images are utilized. Under the assumption of fixed lighting, the intensity differences between two consecutive video frames are mainly due to object motion. This method is adopted in OBASC (Object-Based Analysis-Synthesis Coder) and KBASC (Knowledge-Based Analysis-Synthesis Coder) developed at the University of Hannover (Liang Zhang, xe2x80x9cTracking a face for knowledge-based coding of videophone sequences,xe2x80x9d Signal Processing: Image communication, vol. 10, pp. 93-114, 1997; Jorn Ostermann, xe2x80x9cObject-based analysis synthesis coding based on the source model of moving rigid 3D objects,xe2x80x9d Signal Processing: Image communication, vol. 6, pp. 143-161, 1994). Li et al. also elaborated this concept and proposed a method that estimates motion with xe2x80x9cno correspondence problemxe2x80x9d (in Haibo Li, Pertti Roivainen, and Robert Forchheimer, xe2x80x9c3-D Motion Estimation in Model-Based Facial Image Coding,xe2x80x9d IEEE Tran. on Pattern Analysis and Machine Intelligence, vol. 15, No. 6, pp. 545-555, June 1993). Moreover, Netravali and Salz (in A. N. Netravali and J. Salz, xe2x80x9cAlgorithms for Estimation of Three-Dimensional Motion,xe2x80x9d ATandT Technical Journal, vol. 64, No. 2, pp. 335-346, February 1985) derived robust algorithms for estimating parameters of the motion of rigid bodies observed from a television camera. In their algorithms, the capture rate of a television camera (30 frames/sec) is specifically considered.
However, in a practical software-based application for visual communication, the following two constraints have to be considered:
1. Real-time requirement for on-line communication;
2. Each additional feature point adds extra work load in pattern recognition.
Accordingly, this invention provides a method of image processing of three-dimensional (3-D) head motion with three facial feature points, including the following steps: providing a user""s source image to a first processing device; capturing the user""s first image and providing it to a second processing device; selecting three facial feature points of the first image from the second processing device to form a 3-D feature triangle; capturing user""s consecutive video frames and providing them to the second processing device when the user proceeds with head motions; tracking the three facial feature points corresponding to the consecutive video frames to form a series of actual 2-D feature triangle; rotating and translating the 3-D feature triangle freely to form a plurality of geometric transformations, selecting one of the geometric transformations with acceptable error between the two consecutive 2-D feature triangles, and repeating the step until the last frame of the consecutive video frames and geometric transformations corresponding to various consecutive video frames are formed; and providing the geometric transformations to the first processing device to generate a head motion corresponding to the user""s source image.
Facial feature points are three feature points such as the positions of the lateral canthus of the two eyes and the nose, or the ear-tips and the lower chin. The three feature points form a feature triangle and are calibrated. The motion between two consecutive video frames is slight, and human head motion reveals the following characteristics, namely: (1) Feature points are fixed and the three feature points form a feature triangle that can be considered as a rigid body; (2) most head motions are rotation dominated, and (3) the rotation pivot of one""s head can be considered to be at the center of his neck. 3-D head motion estimate can be inferred from consecutive video frames with steepest-descent iterative method. Subsequently, if an estimate for 3-D head motion is not acceptable, error recovery can be made for the 3-D head motion estimate with a prediction process, such as Grey System.
The embodiment as disclosed in this invention presents a procedure that estimates head motion from two consecutive video frames using three feature points. Here, a precise solution is not intended, rather, an approximate one is provided because, for a videophone application, one needs not to know how many degrees a user on the other side turns his or her head, but to see natural face orientation.
Also, a camera captures a new image every tenth to thirtieth of a second. During such a small period, changes in motion would be small, and a simple steepest-decent iterative method that tries each transformation adaptively should be able to resolve the unknown transformation quickly. The local minimum obtained is usually close to the global minimum since the 3-D positions obtained in the last frame give a good initial guess for the iteration used in the frame.
Furthermore, some characteristics in human head motion may help to design good error criteria that guide iterations toward the global minimum. However, incorrect estimations are still possible, so the capability to recover from incorrect results is necessary. Prediction algorithms are usually employed to provide a next possible step from previous history data, so a prediction algorithm in the motion estimation procedure for error recovery is included. In fact, the prediction algorithm can also help to smooth estimated motions.
Moreover, a calibration procedure that also utilizes human head characteristics to simplify the calibration procedure is designed in this embodiment.