Public venues such as shopping centres, parking lots and train stations are increasingly subject to surveillance using large-scale networks of video cameras. Application domains of large-scale video surveillance include security, safety, traffic management and business analytics. A key task in many such applications is rapid and robust object re-identification, which is the problem of finding a specific object of interest across multiple cameras in the network. In one example application from the security domain, a security officer may want to view any video feed containing a particular suspicious target in order to identify undesirable activities. In another example from the business analytics domain, a shopping centre may wish to track a specific customer across multiple cameras in order to build a profile of shopping habits for that customer. In the following discussion, the term “object re-identification” will be understood to include the terms “object identification” and “object recognition”.
Robust object re-identification is a challenging problem for several reasons. Firstly, the viewpoint (i.e. the relative orientation of the camera with respect to an object in the camera's field of view) and lighting may vary significantly between cameras in the network. For example, a single network may contain both outdoor cameras viewing targets at large distances in bright daylight, and indoor cameras viewing targets at close range under artificial lighting. Furthermore, many targets may have similar appearance and may vary only in minor details. For example, many commuters on public transport wear similar business attire but their appearance varies in regard to details such as neckwear and hair length. Also, public venues are often characterized by crowds of uncooperative targets moving in uncontrolled environments with varying and unpredictable distance, speed and orientation relative to the camera. The term “uncooperative target” refers to a target that is neither consciously nor unconsciously maintaining a particular relationship to a camera. Finally, cameras in the network may have non-overlapping fields of view, so that a given object cannot be continuously tracked from one camera to the next.
Common approaches for object re-identification include (i) appearance-based or attribute-based methods, and (ii) methods that apply to static cameras or active cameras. One known method for appearance-based object re-identification using static cameras models the appearance of an object by extracting a vector of low-level features based on colour, texture and shape from an exemplary image of the object. The features are extracted in a region of interest defined by a vertical stripe around the head of the target. Re-identification is based in part on computing an appearance dissimilarity score based on the Bhattacharyya distance between feature vectors extracted from images of a candidate target and the target of interest.
Another known method for attribute-based re-identification in static cameras uses a bank of support vector machine (SVM) classifiers to determine the presence or absence of 15 binary attributes (such as sunglasses, backpacks and skirts) from an image of a pedestrian. The SVM classifiers are trained on 2784-dimensional low-level colour and texture feature vectors from a training set of pedestrians with known attributes. To overcome the problem that different attributes are detected with varying reliability, an attribute distance metric (Mahalanobis distance) is learned based on a dataset of matching pairs of images of pedestrians. Re-identification is based in part on computing the learned attribute distance metric between the 15 attributes extracted from images of a candidate target and the target of interest.
The performance of the above re-identification methods based on static cameras suffers when objects are viewed across a large distance, which is common in large-scale video surveillance systems. Re-identification methods based on pan-tilt-zoom (PTZ) cameras can overcome this limitation by controlling the camera to capture high-resolution imagery of candidate objects at large distances. This approach will be referred to as one form of “active re-identification”. One known method for active re-identification uses face detection to identify objects of interest. A static master camera is used to detect targets and estimate their gaze direction, and an active slave camera is used to obtain high-resolution face imagery of selected candidate targets. Candidate target selection is based on the expected information gain with respect to target identity from observing the target. The “expected information gain”, also known as “mutual information”, is the expected reduction in uncertainty about the identity of the target that results from making the observation. This method tends to select candidates that are both facing the slave camera and have uncertain identity. The drawback of this method is that it relies on a highly discriminative feature (i.e. face) captured in a specific viewpoint (i.e. frontal).
Another known method for active re-identification based on information theoretic concepts dynamically plans a sequence of PTZ settings to capture zoomed-in views of different regions on a candidate object to maximize the expected information gain with respect to the class of the candidate object. The term “class” refers to a semantic object category, such as “book” or “mug”. The information gain is computed in part from the learned distribution of low-level image features of the object of interest under different PTZ settings. This method assumes that multiple images of each class of object under all available PTZ settings can be obtained offline in order to learn the feature distributions.
In another known related approach, a camera setting is controlled to maximize mutual information in a stochastic automaton such as an object detector. The stochastic automaton takes quantized image features (also known as “code words”) at different scales as input. Code words are initially detected in a first captured image, and the camera setting is iteratively updated to observe individual code words at higher resolution. The camera setting is selected by maximizing the mutual information with respect to the state of cells in the stochastic automaton after observing code words that are taken as input to the cells Similar to the previous method, this method requires training data of the object of interest to train the stochastic automaton.
Yet another known method actively re-identifies pedestrians from a gallery of known people based on a sequence of zoomed-in observations of different body regions. The method first captures a whole-body image of the candidate and extracts a feature vector based on colour and texture. The feature vector is used to rank the gallery based on a Bhattachryya distance between the candidate and each gallery image. Each successive observation is then selected as the zoomed-in region giving the greatest feature variance across the gallery based on the current ranking. This method assumes that whole-body and zoomed-in views of all body regions are available for every object of interest in the gallery.