Visual recognition is a challenging but important problem in the video surveillance system, due to large appearance variations caused by light conditions, view angles, body poses and mutual occlusions. Visual targets captured by surveillance cameras are usually in small size, making many visual details such as facial components are indistinguishable, with different targets looking very similar in appearance. This makes it difficult to recognize a reference image to a target from a variety of candidates in the gallery set based on the existing feature representations.
Various image identification systems have been used to address this problem. Typically, an identification system takes one or more reference images and compares them to a target image, with a generated similarity score helping to classify the target as the same or different person. This generally requires both procedures for extracting features from an image, and procedures for defining a similarity metric. Feature types are ideally invariant under various lighting conditions and camera position, and can include color histogram data, Haar features, Gabor features, or the like. Similarity scoring can be based on saliency weighted distance, Local Fisher discriminant analysis, Mahalanobois metric learning, locally adaptive decision functions, or the like.
More recently, the deep learning methods have been used. Both the feature extraction and the similarity scoring can use a deep neural network. For example, a convolutional neural network (CNN) can be used to extract features from images, and a second CNN used to compare the features with the similarity metric. For example, a first CNN can use patch matching for feature extraction, while a second CNN can use similarity metrics including cosine similarity and Binomial deviance, Euclidean distance and triplet loss, or logistic loss to directly form a binary classification problem of whether the input image pair belongs to the same identity.
One discussed system uses cross-input neighborhood differences and patch summary features that evaluate image pair similarity at an early CNN stage to make use of spatial correspondence in feature maps. This system, described by Ahmed, E., Jones, M., Marks, T. K. in a paper title “An improved deep learning architecture for person re-identification” in: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE (2015), uses two layers of convolution and max pooling to learn a set of features for comparing the two input images. A layer that computes cross-input neighborhood difference features is used to compare features from one input image with the features computed in neighboring locations of the other image. This is followed by a subsequent layer that distills these local differences into a smaller patch summary feature. Next, another convolutional layer with max pooling is used, followed by two fully connected layers with softmax output.