In many computer vision applications, such as camera surveillance, it is necessary to determine whether persons, or other objects, represented in different images are the same or not. When the objects are persons, this is known in the art as person re-identification. For person re-identification, the images can be cropped regions of still images, or cropped regions of frames in a video, that contain all or a part of a body of a person. In surveillance and other applications in which persons are tracked in videos, the problem of determining whether different tracks are of the same person naturally arises. This problem can be solved by comparing the cropped image regions from one of the tracks to those from a different track and determining whether the regions represent the same person or not. The images or tracks may be from the same camera at different points in time, or from different cameras at either the same point in time or different points in time.
Typically, methods for person re-identification include two components: a method for extracting features from images, and a metric for comparing the features extracted from different images. The focus in person re-identification research has been on improving the features or improving the comparison metric or both. The basic idea behind improving the features is to determine features that are at least partially invariant to changes in lighting, pose, and viewpoint. Typical features used in past methods include variations on color histograms, local binary patterns, Gabor features, salient color names, and local image patches.
To improve the comparison metric, metric learning approaches determine a mapping from an original feature space into a new space in which feature vectors extracted from two different images of the same person are “closer” (more similar) than feature vectors extracted from two images that are of two different people. Metric learning approaches that have been applied to re-identification include Mahalanobis metric learning, locally adaptive decision functions, saliency-weighted distances, local Fisher discriminant analysis, marginal Fisher analysis, and attribute-consistent matching.
Some methods use a deep learning approach for person re-identification. One such deep learning approach uses a “Siamese” convolutional neural network (CNN) for metric learning Siamese CNNs learn a non-linear similarity metric by repeatedly presenting pairs of images from a training set, along with a training label for each pair indicating whether the two images in the pair are images of the same person or of two different persons.
In one previous method, the Siamese architecture includes three independent convolutional networks that act on three overlapping parts of the two images. Each part-specific network includes two convolutional layers with max pooling, followed by a fully connected layer. The fully connected layer produces an output vector for each image, and the two output vectors are compared using a cosine function. The cosine outputs for each of the three parts are then combined to obtain a similarity score.
Another deep learning method uses a differential network. The differential architecture begins with a single convolutional layer with max pooling, followed by a patch-matching layer that multiplies convolutional feature responses from the two inputs at a variety of horizontal offsets. The response to each patch in one image is multiplied separately by the response to every other patch sampled from the same horizontal strip in the other image. This is followed by a max-out grouping layer that outputs the largest patch match response from each pair of patches in the horizontal strip, followed by another convolutional layer with max pooling, followed by a fully connected layer with 500 units and finally a fully connected layer with 2 units representing “same” or “different”. A softmax function is used to convert these final 2 outputs to probabilities.