Recognition of objects within videos plays an important role for many video-related purposes, such as indexing and retrieval of videos based on identified objects, security and surveillance, and other similar functions. As used herein, the term “object” shall refer to a definable image within a video, such as a face, automobile, article of clothing, or virtually any other type of object. For example, FIG. 1 illustrates a sample frame of a video scene. Exemplary objects that are capable of being recognized within the illustrated video include characters' faces, a plant in a vase, a shoe, and an automobile tire, each of which is shown within a dashed box to indicate its detection and recognition within the frame. As will be understood, however, virtually any image may be detected and recognized within a given video.
Many object recognition systems, and particularly facial recognition systems, are known in the art, such as those described in R. Gross et. al, Face Recognition Across Pose and Illumination, Handbook of Face Recognition, Springer-Verlag (2004), and W. Zhao et. al, Face Recognition: A Literature Survey, ACM Computing Surveys (2003), and in other similar texts. A typical face recognition system includes three general stages: face data collection, facial modeling, and facial identification using the learned/generated models. Traditional photo-based face recognition technologies, such as those described in M. Turk and A. Pentland, Face Recognition Using Eigenfaces, IEEE Conference on Computer Vision and Pattern Recognition, pp. 586-91 (1991), utilize a single image or a set of images or photos to generate a model or models. These systems function properly only when the underlying photos, which are used for analysis and generation of facial models, are taken in controlled environments, such as with uniform or fixed lighting conditions. Further, the faces in the photos generally must be frontal poses only, and include little or no expression. Because these traditional systems are constrained in their ability to adapt to variations in photos, and because they only provide fixed-face models, their applications, especially for videos (as opposed to still images), are highly limited.
Recently, in order to overcome the limitations of traditional photo-based technologies, some video-based facial recognition systems have emerged, such as those described in M. Kim et. al., Face Tracking and Recognition with Visual Constraints in Real-World Videos, IEEE Conference on Computer Vision and Pattern Recognition (2008), and Krueger and Zhou, Exemplar-Based Face Recognition from Video, European Conference on Computer Vision, pp. 732-46 (2002), and in other similar texts. These proposed systems attempt to overcome the recognition and modeling problems posed by images with variations in lighting, background, and character pose, as well as continuous camera motion or character movement within a video scene. These systems generally function by either treating each frame within a video as an independent image (essentially just a variation of a traditional photo-based system) and generating a plurality of facial models corresponding to each image, or they look at all images in the sequence as a whole and weight each image in the sequence equally to generate a combination model of all equally-weighted images.
Both types of video-based recognition systems, however, are cumbersome and inefficient, and they produce facial models that are often inaccurate. Particularly, by analyzing all images in a video, the resulting model or models are naturally generated using some images that are partially occluded, have low resolutions, include non-frontal poses, contain poor lighting, and have a host of other issues, resulting in poor quality models. Accordingly, recognition systems that incorporate models generated by conventional video-based systems often produce low recognition rates and overall poor results.
The ability to effectively and efficiently index, store, and retrieve videos, or portions of videos, based on objects in those videos is important for a variety of fields. For example, production companies or advertisement agencies often rely on old or previously-created movies, television shows, and other video clips for inclusion in new advertisements, promotions, trailers, and the like. Additionally, with the continuing advances of technology, online video viewing is becoming increasingly popular, and thus the capability to locate, retrieve, and present videos or clips based on user-entered search criteria is becoming progressively more vital. Further, security systems can benefit from accurate and consistent identification of perpetrators or victims within surveillance videos. However, existing and conventional object and facial recognition systems are neither flexible nor accurate enough for these and other commercial applications.
For these and many other reasons, there is a long-felt but unresolved need for a system or method that is able to generate effective object models for object recognition based on video data, and track temporal coherence of videos in order to dynamically update and optimize the generated models.