Deep learning (DL) is a branch of machine learning and artificial neural network based on a set of algorithms that attempt to model high level abstractions in data by using a deep graph with multiple processing layers. A typical DL architecture can include many layers of neurons and millions of parameters. These parameters can be trained from large amount of data on fast GPU-equipped computers, guided by novel training techniques that can work with many layers, such as rectified linear units (ReLU), dropout, data augmentation, and stochastic gradient descent (SGD).
Among the existing DL architectures, convolutional neural network (CNN) is one of the most popular DL architectures. Although the idea behind CNN has been known for more than 20 years, the true power of CNN has only been recognized after the recent development of the deep learning theory. To date, CNN has achieved numerous successes in many artificial intelligence and machine learning applications, such as face recognition, image classification, image caption generation, visual question answering, and automatic driving cars.
In some face recognition applications, it is also desirable to estimate the pose of each detected face because each person's head/face can have different orientations, i.e., different poses in a sequence of captured images due to movement. Using a CNN-based DL architecture, face detection and face-pose-estimation can be performed as a joint process.
Performing face recognition on captured surveillance videos can provide extremely useful information for many applications, including crime/accident investigations and customer analysis in retail market research. Recently, with the rapid development of computer hardware and machine learning technology, it has become possible to perform many advanced tasks such as face detection directly on the surveillance video systems. Face detection, i.e., detecting and locating the position of each face in an image, is usually the first step in a face recognition process. After face detection, the remaining face recognition tasks are often performed on a main server or a control center. However, if a surveillance video system includes a large number of surveillance cameras and all captured videos have to be transmitted to the control center or the main server to perform face recognition and other tasks, the requirements for the network bandwidth and the computational power of the control center or the server can be forbiddingly high. Hence, in a surveillance video system equipped with many cameras, it is desirable that each camera only transmits the detected faces in a captured video to the server, instead of sending the entire video. However, for surveillance video systems routinely capture a large number of people and in situations where people linger in a video for a long period time, the amount of face image data generated and transmitted to the server can still be undesirably high.