Deep learning (DL) is a branch of machine learning and artificial neural network based on a set of algorithms that attempt to model high level abstractions in data by using a deep graph with multiple processing layers. A typical DL architecture can include many layers of neurons and millions of parameters. These parameters can be trained from large amount of data on fast GPU-equipped computers, guided by novel training techniques that can work with many layers, such as rectified linear units (ReLU), dropout, data augmentation, and stochastic gradient descent (SGD).
Among the existing DL architectures, convolutional neural network (CNN) is one of the most popular DL architectures. Although the idea behind CNN has been known for more than 20 years, the true power of CNN has only been recognized after the recent development of the deep learning theory. To date, CNN has achieved numerous successes in many artificial intelligence and machine learning applications, such as face recognition, image classification, image caption generation, visual question answering, and automatic driving cars.
Face detection, i.e., detecting and locating the position of each face in an image, is usually the first step in many face recognition applications. A modern face detection system often includes two main modules: a face detection module and a face tracking module. A face detection module often employs a DL architecture such as CNN to detect human faces in digital images. Once a new face (i.e., a new person) is detected by the face detection module in an image frame of a video, the face tracking module tracks the new person through subsequent image frames in the video to find/re-identify the same person in each of the subsequent image frames. For some low-complexity embedded system applications, the face tracking module can be implemented based on some simple tracking techniques, e.g., Kalman filter and Hungarian algorithm.
In many face detection applications, it is also desirable to perform face-pose estimation because each person's head/face can have different orientations, i.e., different poses in different images. Moreover, to avoid sending and storing too many faces of the same person, it is also desirable to keep track of the pose change of each face, and send just the face image corresponding to the “best pose,” e.g., the face that is the closest to the frontal view (i.e., with the smallest rotations) of each detected person. Various techniques can be used to estimate the pose of the person's head/face. One example technique is to first estimate the locations of some facial landmarks, such as eyes, nose, and mouth, and then estimate the pose based on these landmark locations. Another technique involves representing the head pose with three Euler angles, i.e., yaw, pitch and roll, and estimating the pose directly with these three angles. Hence, when a tracked person/face is lost by the face tracking module, and the corresponding face tracker needs to be destroyed, a face image of that face having the best pose can be transmitted and stored for further references. Using a CNN-based DL architecture, face detection and face-pose-estimation can be performed as a joint process.
A large number of face detection techniques can easily detect near frontal faces. However, robust and fast face detection in uncontrolled situations can still be a challenging problem, because such situations are often associated with significant amount of variations of faces, including pose changes, occlusions, exaggerated expressions, and extreme illumination variations. Some effective CNN-based face detection techniques that can manage such uncontrolled situations include (1) a cascaded-CNN framework described in Li et al., “A Convolutional Neural Network Cascade for Face Detection,” Proc. IEEE Conf on Computer Vision and Pattern Recognition, Jun. 1, 2015 (referred to as “the cascaded CNN” or “the cascaded CNN framework” hereinafter”), and (2) a multitask-cascaded-CNN framework described in Zhang et al., “Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks,” IEEE Signal Processing Letters, Vol. 23, No. 10, pp. 1499-1503, October 2016 (referred to as “the MTCNN” or “the MTCNN framework” hereinafter).
However, due to the high complexity involved in the MTCNN framework and often limited computational resources available within an embedded system, many challenges exist to implement the MTCNN-based face detection into an embedded system to achieve satisfactory real-time performance. Moreover, the simple face tracking techniques used by the embedded systems often result in many near-duplicate faces being tracked and transmitted, thereby wasting computational resources and network bandwidth.