The present embodiments relate to person re-identification in a video system. In particular, a person in one image is identified as also being in another image.
Person re-identification is a challenging problem. For this inter-camera association or multi-camera tracking, people are matched across different, usually non-overlapping, camera fields of view. Matching is complicated by variations in lighting conditions, camera viewpoints, backgrounds, and human poses. In public spaces, face recognition and other fine biometric cues may not be available because of low image resolution and/or distance.
It can be quite challenging even for a human to match two images of a same person from among images of many people. Re-identification approaches may be divided by two categories: a) non-learning based (direct) methods, and b) learning-based methods. The direct methods usually extract a set of hand-crafted descriptive representations and combine their corresponding distance measurements without learning. On the other hand, learning-based methods usually extract a bunch of low-level descriptors, concatenate them into a long feature vector, and obtain discriminability by labeled training samples and machine learning techniques.
Two cues, the spatio-temporal information and target appearance, may be fused for re-identification. The spatio-temporal cue may be learned. For the appearance cue, color information and learnt brightness transfer functions (BTFs) or color calibration handle the changing lighting conditions in different cameras. Distinct people may look similar if they wear clothes with the same color, which in turn increases the difficulties of finding correct associations. Appearance-based re-identification relies on the information provided by the visual appearance of human body and clothing. The targets of interest do not change their clothes in different cameras. However, this is a challenging problem since human appearance usually exhibits large variations across different cameras. These processor implemented appearance-based models tend to suffer in light of lighting and pose changes.
Many approaches address this problem mainly by two important elements: descriptor extraction and similarity/distance measurements. For descriptor extraction, the goal is to find the invariant and distinctive representation to describe a person image. Several descriptors have been used, which include color histogram, histogram of oriented gradients (HOG), texture filters, Maximally Stable Color Regions (MSCR), and decomposable triangulated model. For similarity/distance measures, standard distance measurement (e.g., Bhattacharyya distance, correlation coefficient, L1-Norm, or L2-Norm) are used. Among these descriptor and similarity measurements, the color histogram followed by Bhattacharyya distance are most widely used since the color information may be an important cue. However, the performance of color histogram in any color space is still not satisfactory.