In computer graphics, it is still a fundamental problem to synthetically construct realistic human heads, particularly the face portion. Hereinafter, when referring to ‘head’ or ‘face’, the invention is most interested in the portion of the head extending from chin-to-brow, and ear-to-ear. Most prior art methods require either extensive manual labor by a skilled artist, expensive active light 3D scanners, Lee et al., “Realistic Modeling for Facial Animations,” Proceedings of SIGGRAPH 95, pages 55–62, August, 1995, or the availability of high quality of texture images as a substitute for exact face geometry, see Guenter et al., “Making Faces,” Proceedings of SIGGRAPH 98, pages 55–66, July 1998, Lee et al., “Fast Head Modeling for Animation,” Image and Vision Computing, Vol. 18, No. 4, pages 355–364, March 2000, Tarini et al., “Texturing Faces,” Proceedings Graphics Interface 2002, pages 89–98, May 2002.
More recent efforts have focused on the availability of an underlying model for human faces, see Atick et al., “Statistical Approach to Shape from Shading: Reconstruction of 3D Face Surfaces from Single 2D Images,” Neural Computation, Vol. 8, No. 6, pages 1321–1340, 1996, Blanz et al., “A Morphable Model for the Synthesis of 3D Faces,” Proceedings of SIGGRAPH 99, July 1999, Pighin et al., “Synthesizing Realistic Facial Expressions from Photographs,” Proceedings of SIGGRAPH 98, July 1998, and Shan et al., “Model-Based Bundle Adjustment with Application to Face Modeling,” Proceedings of ICCV 01, pages 644–651, July 2001.
The model-based approaches make use of the fact that human faces do not vary much in their general characteristics from person to person. Blanz et al. derive an approximate textured 3D face from a single photograph. They require knowledge of rendering parameters, e.g., light direction, intensity, etc., which need to be specified by the user and adjusted by an optimization process. However, texture often increases the uncertainty in the process.
Blanz et al., formulate an optimization problem to reconstruct textured 3D face from photographs in the context of an inverse rendering paradigm. However, their method does not exploit point-to-point correspondence across many faces. Moreover, if the scale of their faces varies across samples, e.g., a baby's face vs. an adult face, only a partial set of points on the larger face is relevant.
A 3D variant of a gradient-based optical flow algorithm can be used to derive the necessary point-to-point correspondence, see Vetter et al., “Estimating Coloured 3D Face Models from Single Images: An Example Based Approach,” Computer Vision—ECCV '98, Vol II, 1998. Their method also employs color and/or texture information acquired during the scanning process. That approach does not work well for faces of different races or in different illumination given the inherent problems of using static textures.
The application of statistical methods to 3D face geometry is relatively rare and not well explored. Atick et al. recover an eigenhead from a single image by leveraging knowledge of an object class, Modular eigenspaces can be used to recover 3D facial features and their correlation with texture. These can then be used to reconstruct the structure and pose of a human face in the live video sequences, see Jebara et al., “Mixtures of Eigenfeatures for Real-Time Structure from Texture,” Proceedings of ICCV '98, January, 1998.
A number of methods are known for recovering 3D object shape from 2D object silhouettes that do not depend on color or texture information, see Lazebnik et al., “On Computing Exact Visual Hulls of Solids Bounded by Smooth Surfaces,” Computer Vision and Pattern Recognition (CVPR'01), Vol. I, pages 156–161, December 2001, Matusik et al., “Image-Based Visual Hulls,” Proceedings of SIGGRAPH 00, July 2000, Potmesil,” Generating Octree Models of 3D Objects from their Silhouettes in a Sequence of Images,” CVGIP 40, pages 1–29, 1987, Szeliski, “Rapid Octree Construction from Image Sequences,” CVGIP: Image Understanding, Vol. 58, No. 1, pages 23–32, 1993, and Zheng, “Acquiring 3D Models from Sequences of Contours,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 16, No. 2, February 1994.
The reconstructed 3D shape is called a visual hull, which is a maximal approximation of the object consistent with the object's silhouettes. The accuracy of this approximate visual hull depends on the number and location of the cameras used to generate the input silhouettes. In general, a complex object such as human face does not yield a good shape when a small number of cameras are used to approximate the visual hull. Moreover, human faces possess numerous concavities, e.g., the eye sockets and the philtrum, which are impossible to reconstruct even in an exact visual hull due to its inherent limitation.
However, if there exists inherent knowledge of the object to be reconstructed, then this knowledge can be used to constrain the silhouette information to recover the shape of the object. For example, an optimal configuration of human motion parameters can be searched by applying a silhouette/contour likelihood term, see Sminchisescu, “Consistency and Coupling in Human Model Likelihoods,” IEEE International Conference on Automatic Face and Gesture Recognition, May 2002. Internal and external camera parameters can be recovered using exact information of an object and its silhouette images, see Lensch et al., “Automated Texture Registration and Stitching for Real World Models,” Proceedings of Pacific Graphics '00, October 2000.
Therefore, there is a need for a system and method that can reconstruct faces from 3D shape models and 2D images.