Deep learning (DL) is a branch of machine learning and artificial neural network based on a set of algorithms that attempt to model high level abstractions in data by using a deep graph with multiple processing layers. A typical DL architecture can include many layers of neurons and millions of parameters. These parameters can be trained from large amount of data on fast GPU-equipped computers, guided by novel training techniques that can work with many layers, such as rectified linear units (ReLU), dropout, data augmentation, and stochastic gradient descent (SGD).
Among the existing DL architectures, convolutional neural network (CNN) is one of the most popular DL architectures. Although the idea behind CNN has been known for more than 20 years, the true power of CNN has only been recognized after the recent development of the deep learning theory. To date, CNN has achieved numerous successes in many artificial intelligence and machine learning applications, such as face recognition, image classification, image caption generation, visual question answering, and automatic driving cars.
Face detection is an important process in many face recognition applications. A large number of face detection techniques can easily detect near frontal faces. However, robust and fast face detection in uncontrolled situations can still be a challenging problem, because such situations are often associated with significant amount of variations of faces, including pose changes, occlusions, exaggerated expressions, and extreme illumination variations. Some effective face detection techniques that can manage such uncontrolled situations include (1) a cascaded convolutional neural networks (CNN) framework described in “A Convolutional Neural Network Cascade for Face Detection,” H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Jun. 1, 2015 (referred to as “the cascaded CNN” or “the cascaded CNN framework” hereinafter”), and (2) a multitask cascaded CNN framework described in “Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks,” K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, IEEE Signal Processing Letters, Vol. 23, No. 10, pp. 1499-1503, October 2016 (referred to as “the MTCNN” or “the MTCNN framework” hereinafter).
In the cascaded CNN, a coarse-to-fine cascaded CNN architecture is proposed for face detection. More specifically, instead of using a single deep neural network, the cascaded CNN uses several shallow neural networks operating on different resolutions of the input image, so that the CNN can quickly reject those background regions in the low resolution stages, and then carefully evaluate a small number of candidate regions in the final high resolution stage. To improve localization effectiveness, a calibration stage is used after each detection/classification stage to adjust the detection window (or “the bounding box”) position. As a result, the cascaded CNN typically requires six stages and six simple CNNs: three of those for binary face detection/classification, and three more for bounding box calibration. This face detection framework can be highly suitable for implementations in the embedded environments due to the cascade design and the simple CNN used by each stage. Note that, each of the bounding box calibration stages in the cascaded CNN requires an additional CNN and thus extra computational expense. Moreover, in the cascaded CNN, the inherent correlation between face detection and face alignment is ignored.
In the MTCNN, a multi-task cascaded CNN is proposed, which integrates the face detection and face alignment operations using unified cascaded CNNs through multi-task learning. In principal, the MTCNN also uses several coarse-to-fine CNN stages to operate on different resolutions of the input image. However, in the MTCNN, facial landmark localization, binary face classification, and bounding box calibration are trained jointly using a single CNN in each stage. As a result, only three stages are needed in the MTCNN. More specifically, the first stage of the MTCNN generates candidate facial windows quickly through a shallow CNN. Next, the second stage of the MTCNN refines the candidate windows by rejecting a large number of non-face windows through a more complex CNN. Finally, the third stage of the MTCNN uses a more powerful CNN to further decide whether each input window is a face or not. If it is determined to be so, the locations of five facial landmarks are also estimated. The performance of the MTCNN is notably improved compared to previous face detection systems. The MTCNN framework is generally more suitable for implementations on resource-limited embedded systems compared to the aforementioned cascaded CNN framework.