In automatic face recognition an unknown subject is identified by inputting a facial image into a gallery of previously identified persons. The database of stored images is referred to as a gallery or watch list and the inputted image or video is usually referred to as a probe.
Biometric identifiers are distinctive, measurable characteristics used to label and describe individuals. Facial images commonly use biometric characteristic, as are images of the iris, finger prints, gait, etc. Accurate and reliable face recognition can be utilized for surveillance and security tasks (i.e. entrance security, law enforcement, criminal record identification, etc.).
Two-dimensional facial recognition may not offer effective facial recognition due to different illuminations, expressions and poses/viewpoints. Use of three-dimensional (3-D) face models preserves the geometric structure of a face despite illumination, expressions and pose variables. For example, U.S. Pat. No. 7,620,217, entitled “Three-dimensional face recognition system and method”, by W.-C. Chen, et al, herein incorporated by reference, discloses a generalized framework for the three-dimensional face recognition. U.S. Patent Publication No. 2006/0078172 entitled “3D Face Authentication and Recognition Based on Bilateral Symmetry Analysis,” by L. Zhang, et al., herein incorporated by reference, discloses the use of curvatures for 3-D face profile recognition and authentication.
Since depth information is lost in a two-dimensional image, construction of 3-D models from 2D images requires specific algorithms and sensors. Examples of attempts to obtain a 3-D representation of a human face include U.S. Pat. No. 6,047,078, entitled “Method for Extracting a Three-dimensional Model Using Appearance-based Constrained Structure from Motion,” by S. B. Kang, herein incorporated by reference, which discloses creation of a 3-D face model from a sequence of temporally related 2-D images by tracking the facial features. In the publication by Z. L. Sun, et al., entitled “Depth Estimation of Face Images Using the Nonlinear Least-squares Model”, IEEE Transaction on Image Processing, 22(1): 17-30, (January 2013), the three-dimensional structure of a human face is reconstructed from its corresponding 2-D images with different poses. The depth values of feature points are estimated by a nonlinear least-square method. The appearance-based approaches require two or more input 2D images of different pose views of the subject; therefore it is difficult to apply to moving subjects. Also, 2-D images are sensitive to environment lighting.
Passive stereo sensors, as disclosed for example in U.S. Published Application No. 2005/0111705, entitled “Passive stereo sensing for 3-D facial shape biometrics”, May 26, 2005, herein incorporated by reference, use two cameras to capture the object and determine the object's location in the three-dimensional space by using a triangulation technique. Although the passive stereo works well on textured scenes and has a high resolution, issues develop with occluding boundaries and difficulty with smooth regions (no textures for matching correspondences). Passive stereo requires the illumination from the environment because it does not have an active light source. It is therefore only suitable for outdoor daylight or indoor strong light scenarios and cannot be applied in low light conditions.
Range sensors or Time-of-Flight (ToF) sensors resolve depth information by measuring the time or phase changes of the emitted light from the camera to the scene point Similar to approaches based on the structured light technique, time-of-flight sensors emit active lighting of a certain spectrum into the scene and do not require additional environment light; so that they can be used in low light conditions. This class of apparatus is a popular solution in existing three-dimensional face recognition approaches due to high resolution and high speed. U.S. Pat. No. 6,947,579, entitled “Three-dimensional Face Recognition,” issued September, 2005, by M. M. Bronstein, et al., discloses high precision three-dimensional representations of human faces acquired using a range camera; and the 3-D isometric face model is used to deal with expression changes. In the publication by Kakadiaris. I. A., et al., entitled “Three-dimensional Face Recognition in the Presence of Facial Expressions: an Annotated Deformable Model Approach”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(4): 640-649, (2007), a laser scanner is used to acquire the high resolution depth information. In this method, the annotated face model (AFM) is used to fit a face geometry image from three-dimensional 3-D face data to eliminate the variations caused by expressions.
Due to the high cost of Time-of-Flight or range sensors, it is not practical to deploy them in a large scale environment for surveillance purposes. It is desirable to design an affordable large-scale face recognition system that can be used in the dark/low light environment. For this reason Kinect sensors, such as disclosed in U.S. Pat. No. 7,433,024, are desirable for acquiring 3-D depth information. U.S. Pat. No. 7,433,024 entitled “Range Mapping Using Speckle Decorrelation,” issued Oct. 7, 2008, by J. Garcia, et al., (herein incorporated by reference) discloses a Kinect sensor comprising color camera (with three color channels termed with red, green, and blue, RGB) and an infrared (IR) projector and receiver Similar to structured light approaches, the IR projector emits a random dotted pattern to the scene. By correlating the received pattern with the projected pattern, the depth information is resolved by stereo triangulation. Since the depth resolution acquired by the Kinect sensor is very low and detecting facial features requires high-fidelity imagery, the development of tailored algorithms optimizes usage of the Kinect sensor for face recognition.
A Kinect sensor provides a color image (called RGB image) and a depth image (called D image) of the scene. Together, they are called RGB-D images. Using the depth information from a Kinect sensor for face recognition is an active area of research. Most existing methods use a single Kinect sensor for three-dimensional face data acquisition. The publication by R. I. Hg, et al. entitled “An RGB-D Database Using Microsoft's Kinect for Windows for Face Detection,” published by the Eighth International Conference on Signal Image Technology and Internet Based Systems, Naples, (November, 2012) discloses the building of a RGB-D dataset using a Kinect sensor, comprising 1581 images of 31 targets. Using this method, faces are aligned using ae triangle formed by the two eyes and the nose. The publication by. B. Y. Li, et al., entitled “Using Kinect for Face Under Varying Poses, Expressions, Illumination and Disguise”, published by IEEE Workshop on Applications of Computer Vision, Tampa, Fla. (2013) discloses the use of “dictionary learning” to fit the noisy depth data acquired by a Kinect sensor onto a canonical model. The publication by G. Goswami, et al, entitled “On RGB-D Face Recognition Using Kinect,” published by IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems, Arlington, Va. (2013), discloses the applcation of a histogram of gradients (HOG) feature on the entropy and salient map of both RGB and depth images for the face classification. The publication by T. Huynh, et al. entitled “An efficient LBP-based descriptor for facial depth images applied to gender recognition using RGB-D face data,” published by ACCV Workshop on Computer Vision with Local Binary Pattern Variants, Daejeo, Korea, (Nov. 5-6, 2012) discloses the use of gradient local binary patterns (G-LBP) to represent faces in depth images and the application of this descriptor to identify the gender of the subject.
However, a single Kinect sensor has the limited field of view. In order to increase the field of view, there are some approaches that use multiple Kinects for the data acquisition. The publication by M. Hossny, et al, entitled “Low cost multimodal facial recognition via Kinect sensors,” published by Proceedings of the 2012 Land Warfare Conference, Melbourne, Victoria (2012), discloses a system comprising three Kinect sensors on a triangular rig to capture the 3-D information of a subject and the application of Haar features to detect human faces.
The above Kinect sensor based methods consider the depth image as a regular gray image and apply traditional statisical based classification methods for recogntion, See, e.g., M. A. Turk, et al, “Face Recognition Using Eigen Faces,” Proceedings of IEEE Computer Vision and Pattern Recogntion (CVPR), pp. 586-591 (1991).
Pose variation is one of the major challenges in face recognition even where three-dimensional data is available due to the reason that important facial features (i.e., eye corners and mouth corners) may be not complete in non-frontal poses. Most existing methods exploit the facial symmetry to complete the missing data caused by the pose variation or occlusion (See, for example, the publication by G. Passalis, et al., entitled “Using Facial Symmetry to Handle Pose Variations in Real-world 3D Face Recognition,” published by IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(10): 1938-1951 (2011); and the publication by B. Y. Li, et al., entitled “Using Kinect for Face Recognition Under Varying Poses, Expressions, Illumination and Disguise,” published by IEEE Workshop on Applications of Computer Vision, Tampa, Fla. (2013). However, the completed face information obtained by this method lacks accuracy because the estimation was based on a hypothetic model. Taking a different approach, U.S. Published Application No. 20120293635, entitled “Head Pose Estimation Using RGBD Camera,” issued Nov. 22, 2012, by P. Sharma, et al, (herein incorporated by reference) discloses multiple temporal-related depth images and application of an extended Kalman filter to estimate the 3-D head poses (translations and rotations) with respect to a reference pose. The accuracy of this approach was determined by the number of depth images.
It is also known in the art to connect a plurality of Kinects to obtain 3-D images. For example, Chinese Patent No. CN103279987A, by Huarong, et al., herein incorporated by reference, discloses an object fast three-dimensional modeling method based on Kinect comprising the steps of: (step 1) fixing the relative positions of each Kinect and a rotating platform, enabling all Kinects to directly face the rotating platform with different visual angles respectively to obtain a relatively integral object model; (step 2) placing an object to be reconstructed in the center of the rotating platform, starting a system to carry out reconstruction on the object, achieving scene modeling on the scene depth information output by the Kinect by using a three-dimensional vision theory, unifying the scene depth information of Kinect locating in different coordinate systems on an identical coordinate system; (step 3) filtering wrong three-dimensional point clouds by using a removing method base on normal correction, specifically, obtaining the dense three-dimensional point clouds of scene depth information through the step 2, extracting the normal information of the three-dimensional point clouds, constructing exterior point judgment functions based on local normal constraint, judging the data of the three-dimensional point clouds which does not meet the local normal constraint as exterior points, and removing the exterior points; and (step 4) obtaining a three-dimensional model of the object.