In many computer vision applications, such as camera surveillance and face recognition, it is necessary to determine whether persons, or other objects represented in different images are the same or not. In the art, this is known as person re-identification when the images selected for comparison are images of full bodies or face recognition when the images selected for comparison are images of faces. To that end, a person re-identification and/or a face recognition system is a computer application capable of identifying or verifying a person from a digital image or a video frame from a video source. One of the ways to do this is by comparing selected image features computed from two images of two people's bodies or faces.
The images can be cropped regions in still images or cropped regions in frames in a video that contain all or a part of a body of a person. In surveillance and other applications in which persons are tracked by video cameras, the problem of determining whether different tracks are of the same person naturally arises. The tracks may be from different points in time, from the same video camera, or from two different video cameras. This problem can be solved by comparing the two cropped image regions and determining whether the regions represent the same person or not.
In recent years a deep convolutional neural network (CNN) architecture for face recognition has emerged that achieves practical accuracy on various difficult test sets. The architecture takes a cropped face image as input and uses a strong baseline CNN such as VGG or ResNet to compute a feature vector followed by a fully connected layer that outputs a vector of length C where C is the number of unique identities in the training set. The network is trained to minimize the softmax loss between the output vector and a one-hot encoding of the correct identity for the input face image. In other words, the CNN learns to directly predict the identity of the input face by first computing a distinctive feature vector representing the identity of the face. After training, the final fully connected layer that gives the probability of each training identity is discarded since the training identities are not the same as the identities encountered during testing. Instead, the output of the layer before the final fully connected layer is used as an identity-specific feature vector. Feature vectors for two testing face images are L2 normalized and compared by simply using L2 distance (or cosine similarity).
Despite the good results achieved with this basic architecture, there is a fundamental mismatch between how the network is trained and how it is used during testing. To that end, several methods address this mismatch by using different loss functions for training. For example, one alternative loss functions are the triplet loss described by F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815-823, 2015. The triplet loss takes an “anchor” face as well as positive and negative example images of the anchor's identity as an input example and attempts to minimize the distance between the anchor and positive feature vectors minus the distance between the anchor and negative feature vectors. One difficulty with this loss is that the number of triples of face images for training becomes very large and some kind of hard-negative mining is needed.
Another loss function, known as contrastive loss, has a similar effect to the triplet loss using a slightly different loss function. Another loss function, known as the center loss attempts to minimize the distance between a face's feature vector and the mean feature vector for the class (the set of face images for a particular person). Using center loss plus softmax loss tends to yield clusters of feature vectors for each person that are compact and separable from other identities.
Three other related loss functions, A-softmax (for angular softmax), large-margin softmax and L2-constrained softmax modify the standard softmax loss function in a way that encourages feature vectors of a particular identity to cluster near each other. All of these various loss functions have their advantages and disadvantages.