A number of computer vision application require the automatic determination of the three-dimensional (3D) pose (3D rotation angles and 3D translation) of an object, as well as the 3D locations of landmark points on the object, from a 3D point cloud. In particular, some applications require the 3D pose of a human head, as well as the 3D locations of facial landmarks, such as centroids of the eyes, from a 3D point cloud. The 3D point cloud is typically constructed from a depth image acquired by a depth sensor, such as a Microsoft Kinect™, a Creative Senz3D™ sensor, or a stereo camera. The 3D point cloud can also be generated synthetically using a 3D model of the object, or the 3D point cloud can be acquired directly using a 3D scanner such as a Cyberware™ scanner.
Automatically determining the head pose and facial landmark locations is important for face recognition systems, human-computer interfaces, and augmented reality systems, to name but a few applications. In face recognition systems, for example, one of the impediments to high accuracy is variations in the pose of the head. By accurately determining the pose, computer graphics techniques can be used to re-render the face in a frontal pose and thus largely eliminate the variations due to the pose.
As another example, an augmented reality system for cars that uses the windshield as a head-up display needs to precisely determine the 3D position of the driver's eyes in order to overlay information on the head-up display properly, so that the information is properly aligned with objects in the world that are visible through the windshield.
There are a number of prior-art solutions to the problem of head pose and facial landmark estimation. Many solutions use 2D images acquired by a grayscale or color camera to infer the 3D pose and location, e.g., by optimizing the pose, shape, and lighting parameters of a 3D morphable model to obtain a 2D rendering that matches an input image as closely as possible.
Some prior-art methods for solving this problem use depth images (also known as depth maps), which are 2D images in which the value at each pixel represents a depth value, or color-plus-depth images in which each pixel has color values and a depth value. Note that sensors that capture color-plus-depth images are sometimes called RGB-D (red, green, blue, depth) sensors, and the images the sensors produce are sometimes called RGB-D images. Also note that monochrome-plus-depth images (e.g., grayscale plus depth) can be considered as a type of color-plus-depth image.
One method uses a stereo pair of images to determine depths and then detect the head using skin color. A 3-layer neural network estimates the pose given the scaled depth image of the head region, see Seeman et al., “Head pose estimation using stereo vision for human-robot interaction,” IEEE International Conference on Automatic Face and Gesture Recognition, pp. 626-631, May 2004.
Another method uses a more accurate and faster system for head pose estimation that takes advantage of a low-noise depth image acquisition system and the speed of a graphics processing unit (GPU). First, candidate 3D nose positions are detected in a high-quality depth image. Then, the GPU is used to identify the best match between the input depth image and a number of stored depth images that were generated from an average head model located at each candidate nose position, see Breitenstein et al., “Real-time face pose estimation from single range images,” IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1-8, June 2008.
Other methods also use high quality 3D depth images as input. Those methods are based on random regression forests and learned mappings from a patch of the depth image to head pose angles or facial landmark locations. In follow-up work, a Kinect sensor is used, which provides significantly noisier data compared to the high-quality scans used in the previous work, see Fanelli et al., “Random forests for real time 3D face analysis,” International Journal of Computer Vision, 101:437-458, 2013, Fanelli et al., “Real time head pose estimation with random regression forests,” IEEE International Conference on Computer Vision and Pattern Recognition, 2011, and Fanelli et al., “Real time head pose estimation from consumer depth cameras,” Proceedings of the German Association for Pattern Recognition (DAGM) Symposium, 2011.
One method estimates the pose using a Kinect sensor depth image by determining the 3D rotation of a template that best matches the input. However, that method requires an initial person-specific template in a known pose. This makes it impractical for many applications, see Padeleris et al. “Head pose estimation on depth data based on particle swarm optimization,” CVPR Workshop on Human Activity Understanding from 3D Data, 2012.
U.S. Pat. Nos. 8,582,867 and 8,824,781 describe a method for human body pose estimation, in which the goal is to estimate the joint positions of the skeleton of a body. In that method, patches of a depth image are used to determine feature vectors, which are matched, using an approximate nearest neighbor algorithm, to a database of feature vectors from training patches with known displacements to the joint positions. Each nearest neighbor match is used to obtain displacements to joint locations, which are then used to derive estimates of the desired joint positions.