The present application relates to video analysis systems.
In recent years, intelligent video analysis systems, e.g. automatic gender and age estimation, face verification and recognition, and wide-area surveillance, are flourishing nourished by steady advances of computer vision and machine learning technologies both in theory and in practice. In general, certain statistical models are learned offline from a huge amount of training data in the development stage. However, when being deployed to real-world scenarios, these systems are often confronted by the model mismatch issue, that is, the performance degradation originated from the fact that training data can hardly cover the large variations in reality due to different illumination conditions, image quality, and noise, etc. It is extremely hard, if not impossible, to collect sufficient amount of training data in that the possible factors are unpredictable in different scenarios. Thus, it is desirable to allow the statistical models in visual recognition systems to adapt to their specific deployment scenes by incremental learning, so as to enhance the systems' generalization capability.
To address this model mismatch issue, people have developed various strategies working with training data block 10 and developing a model to apply to testing data block 20. The most straightforward and ideal way is to obtain the ground truth labels of the testing data block 20 in the deployment scene and utilize them to perform supervised incremental learning, as shown in FIG. 1(a). Nevertheless, manual labels are costly and sometimes impractical to obtain. Alternatively, these systems can trust the predictions by the model and simply employ them in incremental learning in a self-training manner as illustrated in FIG. 1(b). However, these positive feedbacks are very risky in practice. Another alternative way is to explore the structure and distances of unlabeled data using semi-supervised learning approaches as in FIG. 1(c), while, whether the heuristic distance metric can capture the correct underlining structure of unlabeled data is in question.
Inferring biological traits like gender and age from images can greatly help applications such as face verification and recognition, video surveillance, digital signage, and retail customer analysis. Both gender and age estimation from facial images have attracted considerable research interests for decades. Yet they remain challenging problems, especially the age estimation, since the aging facial patterns are highly variable and influenced by many factors like gender, race, and living styles, not to mention the subtleties of images due to lighting, shading, and view angles. Thus, sophisticated representations and a huge amount of training data have been required to tackle these problems in real-world applications.