Deep learning (DL) is a branch of machine learning and artificial neural network based on a set of algorithms that attempt to model high level abstractions in data by using a deep graph with multiple processing layers. A typical DL architecture can include many layers of neurons and millions of parameters. These parameters can be trained from large amount of data on fast GPU-equipped computers, guided by novel training techniques that can work with many layers, such as rectified linear units (ReLU), dropout, data augmentation, and stochastic gradient descent (SGD).
Among the existing DL architectures, convolutional neural network (CNN) is one of the most popular DL architectures. Although the idea behind CNN has been known for more than 20 years, the true power of CNN has only been recognized after the recent development of the deep learning theory. To date, CNN has achieved numerous successes in many artificial intelligence and machine learning applications, such as face recognition, image classification, image caption generation, visual question answering, and automatic driving cars.
Face detection, i.e., detecting and locating the position of each face in an image, is usually the first step in many face recognition applications. A large number of face detection techniques can easily detect near frontal faces. However, robust and fast face detection in uncontrolled situations can still be a challenging problem, because such situations are often associated with significant amount of variations of faces, including pose changes, occlusions, exaggerated expressions, and extreme illumination variations. Some effective face detection techniques that can manage such uncontrolled situations include (1) a cascaded convolutional neural networks (CNN) framework described in “A Convolutional Neural Network Cascade for Face Detection,” H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Jun. 1, 2015 (referred to as “the cascaded CNN” or “the cascaded CNN framework” hereinafter”), and (2) a multitask cascaded CNN framework described in “Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks,” K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, IEEE Signal Processing Letters, Vol. 23, No. 10, pp. 1499-1503, October 2016 (referred to as “the MTCNN” or “the MTCNN framework” hereinafter).
In the cascaded CNN, a coarse-to-fine cascaded CNN architecture is proposed for face detection. More specifically, instead of using a single deep neural network, the cascaded CNN uses several shallow neural networks operating on different resolutions of the input image, so that the CNN can quickly reject those background regions in the low resolution stages, and then carefully evaluate a small number of candidate regions in the final high resolution stage. To improve localization effectiveness, a calibration stage is used after each detection/classification stage to adjust the detection window (or “the bounding box”) position. As a result, the cascaded CNN typically requires six stages and six simple CNNs: three of those for binary face detection/classification, and three more for bounding box calibration. This face detection framework can be highly suitable for implementations in the embedded environments due to the cascade design and the simple CNN used by each stage. Note that, each of the bounding box calibration stages in the cascaded CNN requires an additional CNN and thus extra computational expense. Moreover, in the cascaded CNN, the inherent correlation between face detection and face alignment is ignored.
In the MTCNN, a multi-task cascaded CNN is proposed, which integrates the face detection and face alignment operations using unified cascaded CNNs through a multi-task learning process. In principal, the MTCNN also uses several coarse-to-fine CNN stages to operate on different resolutions of the input image. However, in the MTCNN, facial landmark localization, binary face classification, and bounding box calibration are trained jointly using a single CNN in each stage. As a result, only three stages are needed in the MTCNN. More specifically, the first stage of the MTCNN generates candidate facial windows quickly through a shallow CNN. Next, the second stage of the MTCNN refines the candidate windows by rejecting a large number of non-face windows through a more complex CNN. Finally, the third stage of the MTCNN uses a more powerful CNN to further decide whether each input window is a face or not. If it is determined to be so, the locations of five facial landmarks are also estimated. The performance of the MTCNN is notably improved compared to previous face detection systems. The MTCNN framework is generally more suitable for implementations on resource-limited embedded systems compared to the aforementioned cascaded CNN framework.
In many face detection applications, it is also desirable to estimate the pose of each face because each person's head/face can have different orientations, i.e., different poses in different images, e.g., when a person is constantly moving in a video. Various techniques can be used to estimate the pose of the person's head/face. One example technique is to first estimate the locations of some facial landmarks, such as eyes, nose, and mouth, and then estimate the pose based on these landmark locations. Another technique involves representing the head pose with three Euler angles, i.e., yaw, pitch and roll, and estimating the pose directly with these three angles. The angle-based pose estimation approach typically has a lower complexity than the landmark-based approach because the angle-based approach requires just three values whereas the latter one generally requires more than three landmark coordinates in its estimation.
Face detection on captured video images and pose estimation on the detected faces find usefulness in many embedded system applications. For example, in a surveillance camera system equipped with many cameras, to reduce the transmission bandwidth and the storage cost of the server, it is desirable that each camera only sends the faces in the captured video to the server, instead of sending the entire video. Hence, face detection can be used to generate the face images from video images. Moreover, to avoid sending and storing too many faces of the same person, it is also desirable to keep track of the pose change of each face, and send just the face image corresponding to the “best pose,” i.e., the face that is the closest to the frontal view (i.e., with the smallest rotations) of each detected person. Note that it is often beneficial to perform face detection and head-pose-estimation in a joint process, because doing so can reduce the complexity of the overall system.