With the popularity of smart phones and electronic devices with camera and video recording functions, more and more camera applications and cloud computing-based services need to obtain human-face recognition of a live video for extracting facial metadata, based on video clips from the camera or in real time online For example, to use the human-face recognition for security operation of an access operation to an electronic device.
However, a challenge exists in directly performing human-face recognition from a video with respect to human-face recognition based on a still picture, because frame blurring and low-resolution frequently occur in the video, and in such circumstances, serious recognition error is inevitable.
By far, the video sequence-based face recognition mainly has the following three approaches:
1. Image Level Fusion
Image level fusion is one directly performed on an acquired raw image. The image level fusion generally adopts a centralized fusion system to perform the fusion processing process. It is a low-level fusion, for example, a process of determining a target property by performing image processing to a blur image containing a plurality of pixels is just an image level fusion.
For human-face recognition, an image super-resolution algorithm may be specifically employed to rebuild the human-face image. The super-resolution algorithm is a technique for enhancing the resolution of an image or video, with a purpose that the resolution of the outputted image or video would be higher than any frame of any inputted image or video. Here, “enhancing the resolution” means making the existing content much clearer or that a user can see a detail that could be perceived previously. When it is relatively difficult or costly to obtain a high-quality image or video, it is quite essential to use the super-resolution algorithm. The process of rebuilding the image super-resolution may be generally performed in three steps:    (1) pre-processing, for example, de-noising, clip, etc.;    (2) alignment, and estimating a motion vector between the low-resolution sequences, and    (3) rebuilding, and fusing information of multiple frames of the low-resolution image.
The super-resolution rebuilding process of an image generally needs three-dimensional modeling, which results in a cumbersome computational complexity.
Besides, there is also a scheme of de-blurring an image and then restoring the image specifically directed to the cause of the blur, for example, restoring for a motion blur, and restoring for a defocus, etc. Its main purpose is to generate a clear picture so as to perform works such as recognition and judgment.
However, currently, the image level fusion for human-face recognition is mainly used for visual inspection, which is not very flexible and is quite sensitive to environment (for example noise), misalignment, and the like.
2. Feature Level Fusion
The feature level fusion mainly extracts a local feature of a human face from among each frame of a video image. Since the same kind of samples have a certain distribution in space, image set sub-space (mutual sub-space) and manifold learning may be employed to reduce dimensions of the feature space of the sample, and then the dimension-reduced sample feature space is matched to the logged sample egien space, thereby performing human-face recognition.
In this scheme, all local features in a feature vector of the human face come from the same frame; therefore, it does not break away from the constraint of frame.
3. Classifier Level Fusion
The classifier level fusion builds a multi-scale target classifier and a pose determiner, respectively, mainly based on scale variation, pose variation, and image feature information of an object, and estimates a confidence level of a target recognition result, a weight of adjacent frame pose variation, and a target scale weight; the recognition result of each frame is compared to an image sample in a database to score each frame. And then, the target image fusion is performed based on the score of each frame. However, the classifier level recognition mainly relies on the classifier of a single frame, and the decision based on the score of each frame; thus, it still has a drawback of inaccurate classification caused by insufficient feature extraction; besides, in a complex dynamic environment, there are fewer appropriate classifier algorithms to implement recognition of a dynamic target.
Therefore, there is a need for rapidly, accurately and robustly recognizing a human-face image from a video sequence.