1. Field of the Invention
This invention relates to face recognition. More specifically, it relates to identifying a video face track using a large dictionary of still face images of a few hundred people, while rejecting unknown individuals.
2. Brief Description of the Prior Art
Video face identification has recently risen to the forefront of face recognition research. Although still-image face recognition research has been ongoing for approximately three decades, application of the still-image face recognition to video-based imagery is a complex process with a number of challenges due to a variety of factors including a person's motion and unconstrained variations in pose, occlusion, and illumination. On the other hand, some aspects of video imagery create opportunity for a more efficient face recognition. For example, video-based imagery provides numerous samples of a person from differing viewpoints, which could be harnessed to provide a strong prediction of the person's identity. Moreover, throughout a long video like a movie or a television show episode the relationship of face tracks can be harnessed using strong affinity metrics.
In the last few years, there has been increased interest in face recognition in sitcoms. These methods have focused on using additional context such as script text, audio, and clothing-however, the employed face identification methods have not been very accurate.
Existing video face recognition methods tend to perform classification on a frame-by-frame basis and later combine those predictions using an appropriate metric. Applying the recently popular Sparse Representation Based Classification's l1-minimization in this fashion is computationally expensive.
Most video-based face recognition methods, if they retain any temporal information, only consider the relationship between frames, thus ignoring any temporal or visual affinity between individual face tracks. In any given sitcom or movie scene, many face tracks are generated for present actors. This result is sometimes due to poor tracking, shot changes, or pose variations. For this reason, face predictions may be noisy, meaning that a face track may be classified correctly at one point, and then a later track of the same person may be identified incorrectly.
Current video face recognition techniques fall into one of four categories: key-frame based, temporal model based, image-set matching based, and context based.
Key-Frame Based Methods
Key-frame based methods generally perform a prediction on the identity of each key-frame in a face track followed by a probabilistic fusion or majority voting to select the best match. Due to the large variations in the data, key-frame selection is claimed to be crucial in this paradigm. One version of this method disclosed in Zhao et al., Large scale learning and recognition of faces in web videos, FG (2008) involves using a database with still images collected from the Internet. A model over this dictionary is established by learning key faces via clustering. These cluster centers are compared to test frames using a nearest-neighbor search followed by majority, probabilistic voting to make a final prediction.
Chen et al., Dictionary-based lace recognition from video, ECCV pp. 766-779 (2012) dictionary-based methods focus on dictionary learning done on a per face track basis. Finally, Bäuml et al., Semisupervised Learning with Constraints for Person Identification in Multimedia Data, CVPR, pp. 3602-3609 (2013) discloses a method that does not use key-frames, but similarly performs probabilistic voting over all frames in a track using a classifier trained via Maximum Likelihood Regression (MLR).
Temporal Based Methods
Temporal model based methods learn the temporal, facial dynamics of the face throughout a video. Several methods employ Hidden Markov Models (HMM) for this purpose. A version of this method disclosed in Hadid et al., From still image to video-based face recognition: an experimental analysis, FG (2004) employ a still image training library by imposing motion information upon it to train an HMM. Zhou et al., Probabilistic recognition of human faces from video, CVIU (2003) discloses probabilistic generalization of a still-image library to accomplish video-to-video matching. Generally training these models is prohibitively expensive, especially when the dataset size is large.
Image-Set Matching Based Methods
Image-set matching based methods allow the modeling of a face track as an image-set. Many methods-such as the ones disclosed in Yamaguchi et al, Face recognition using temporal image sequence, FG (1998) and Lee et al., Online learning of probabilistic appearance manifolds for video-based recognition and tracking, CVPR, pp. 852-859 (2005)—perform a mutual subspace distance where each face track is modeled in its own subspace from which a distance is computed between each face track. They are effective with clean data, but these methods are sensitive to the variations inherent in video face tracks. Some experts attempt to address this issue by learning a subspace for each pose within a face track. Other methods-such as the one disclosed in Cinbis et al., Unsupervised metric learning for face identification in TV video, ICCV (2011)—take a more statistical approach using Logistic Discriminant-based Metric Learning (LDML) to learn a relationship between images in face tracks, where the inter-class distances are maximized. LDML is very computationally expensive and focuses more on learning relationships within the data, without directly relating the test track to the training data.
Context Based Methods
Context based methods have been popular due to their applicability to movies and sitcoms. These methods generally focus on simple face recognition techniques supplemented by context. Several variations of this method-such as the ones disclosed in Bojanowski et al., Finding actors and actions in movies, ICCV (2013), Everingham et al., Taking the bite out of automated naming of characters in TV video, CVIU (2009), and Tapaswi et al., “Knock! Knock! Who is it?” Probabilistic Person Identification in TV-Series. CVPR (2012)—perform person identification, where they use all available information, e.g. clothing appearance and audio, to identify the cast rather than the facial information alone. A small user selected sample of characters may be used in the given movie to compute a pixel-wise Euclidean distance to handle occlusion. Other embodiments of this method-such as the one disclosed in Arandjelovic et al., Automatic Cast Listing in Feature-Length Films with Anisotropic Manifold Space, CVPR (2006)—use a manifold for known characters, which successfully clusters input frames.
Still-Image Methods
Still-Image based literature is vast, and one popular approach entitled Sparse Representation based Classification (SRC) is disclosed in J. Wright, et al., Robust face recognition via sparse representation, TPAMI (2009). SRC is based on a principle that a given test face can be represented by a linear combination of images from a large dictionary of faces. The key concept is enforcing sparsity on the representation, since a test face can be reconstructed best from a small subset of the large dictionary, i.e. training faces of the same class. A straightforward adaptation of this method would be to perform estimation on each frame and fuse results probabilistically, similarly to key-frame based methods. However, l1-minimization is known to be computationally expensive, and therefore, what is needed is a constrained optimization with the knowledge that the images within a face track are of the same person. Imposing this fact reduces the problem to computing a single l1-minimization over the average face track.
Graph-Based Methods
Several graph-based methods employ Markov models in an active-learning paradigm in which a few samples are selected to be labeled by the user, then used to label the rest of the data. The version of this method disclosed in Gallagher et al., Using a Markov Network to Recognize People in Consumer Images, ICIP (2007) involves the step of creating a Markov network where similarity edges are formed between faces in different photos and dissimilarity edges between the others, with an edge weight defined by appearance. This graph is then used in Loopy Belief Propagation to label all unlabeled test samples.
Another reference, Kapoor et al., Which faces to tag: Adding prior constraints into active learning, ICCV pp. 1058-1065 (2009), combines Gaussian Processes to enforce label smoothness with Markov Random Fields to encode the match and non-match structures, where matches are images of the same individual (faces within a track) and non-matches are faces in the same shot. More recently, Lin et al., Joint people, event, and location recognition in personal photo collections using cross-domain context, ECCV. Springer-Verlag (2010) disclosed creating a probabilistic, Markov framework using multiple contexts (faces, events, and location) to improve recognition. One advantage of these methods is that they are iterative methods that allow feedback from users and thus label the unlabeled data with few samples. However, the efficacy of graph-based method diminishes when a large number of face tracks is involved due to their inability to smooth the initial predictions across all tracks in one optimization.
Accordingly, what is needed is a new more efficient video face recognition system. However, in view of the art considered as a whole at the time the present invention was made, it was not obvious to those of ordinary skill in the field of this invention how the shortcomings of the prior art could be overcome.
All referenced publications are incorporated herein by reference in their entirety. Furthermore, where a definition or use of a term in a reference, which is incorporated by reference herein, is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies and the definition of that term in the reference does not apply.
While certain aspects of conventional technologies have been discussed to facilitate disclosure of the invention, Applicants in no way disclaim these technical aspects, and it is contemplated that the claimed invention may encompass one or more of the conventional technical aspects discussed herein.
The present invention may address one or more of the problems and deficiencies of the prior art discussed above. However, it is contemplated that the invention may prove useful in addressing other problems and deficiencies in a number of technical areas. Therefore, the claimed invention should not necessarily be construed as limited to addressing any of the particular problems or deficiencies discussed herein.
In this specification, where a document, act or item of knowledge is referred to or discussed, this reference or discussion is not an admission that the document, act or item of knowledge or any combination thereof was at the priority date, publicly available, known to the public, part of common general knowledge, or otherwise constitutes prior art under the applicable statutory provisions; or is known to be relevant to an attempt to solve any problem with which this specification is concerned.