Public venues such as shopping centres, parking lots and train stations are increasingly subject to surveillance using large-scale networks of video cameras. Application domains of large-scale video surveillance include security, safety, traffic management and business analytics. In one example application from the security domain, a security officer may want to view any video feed containing a particular suspicious person in order to identify undesirable activities. In another example from the business analytics domain, a shopping centre may wish to track customers across multiple cameras in order to build a profile of shopping habits.
Many surveillance applications require methods, known as “video analytics”, to detect, track, match and analyse multiple objects across multiple camera views. In one example, referred to as a “hand-off” application, object matching is used to persistently track multiple objects across first and second cameras with overlapping fields of view. In another example application, referred to as “re-identification”, object matching is used to locate a specific object of interest across multiple cameras in the network with non-overlapping fields of view.
Cameras at different locations may have different viewing angles and work under different lighting conditions, such as indoor and outdoor. The different viewing angles and lighting conditions may cause the visual appearance of a person to change significantly between different camera views. In addition, a person may appear in a different orientation in different camera views, such as facing towards or away from the camera, depending on the placement of the camera relative to the flow of pedestrian traffic. Robust person matching in the presence of appearance change due to camera viewing angle, lighting and person orientation is a challenging problem.
The terms “re-identification”, “hand-off” and “matching” relate to the task of relating an object of interest within at least partial view of a video camera to another object within at least partial view of the same or another video camera. A person re-identification process is comprised of two major steps: feature extraction and distance calculation. The feature extraction step often forms an appearance descriptor or feature vector to represent the appearance of a person. A descriptor is a derived value or set of derived values determined from the pixel values in an image of a person. One example of a descriptor is a histogram of colour values. Another example of a descriptor is a histogram of quantized image gradient responses. Given a person's image in a camera view, the matching step finds the closest match to the given image from a set of images in another camera view based on the distances from the given image to each image in the image set. The image with the smallest distance to the given image is considered to be the closet match to the given image. A distance metric must be selected to measure the distance between appearance descriptors of two images. Selecting a good distance metric is advantageous for the matching performance of person re-identification. General-purpose distance metrics, e.g., Euclidean distance, cosine distance, and Manhattan distance, often fail to capture the characteristics of appearance descriptors and hence the performance of general purpose distance metrics is usually limited.
To avoid the limitation of the general-purpose distance metrics, a distance metric model may be learned from a training dataset. A distance metric learning method directly learns a distance metric from a given training dataset containing several training samples. Each training sample often contains a pair of appearance descriptors and a classification label indicating if the two appearance descriptors are created from images belonging to the same person or different persons. The classification label is defined as +1 if the appearance descriptors belonging to the same person, while the classification label is defined as −1 if the appearance descriptors belong to different persons. The training samples with positive and negative classification labels are called positive and negative training samples, respectively. The distance metric is explicitly learned to minimize a distance between the appearance descriptors in each positive training sample and maximize the distance between the appearance descriptors in each negative training sample. Discriminative subspace analysis methods learn a projection that maps appearance descriptors to a subspace where appearance descriptors extracted from an image of a person are separated from appearance descriptors extracted from images of other people. During the matching process, the learned projection is used to map appearance descriptors extracted from images of persons to the subspace and calculate the distances between the projected appearance descriptors. One example of discriminative subspace analysis is kernel Fisher discriminant analysis. Another example of discriminative subspace analysis is discriminative null space analysis.
A distance metric ensemble model may also be built by combining the models learned from distance metric learning methods and discriminative subspace analysis methods. A distance metric ensemble method often performs better than each individual metric learning method or discriminative subspace method.
The distance metric model or a distance metric ensemble model learned from a training dataset often perform very poorly on a dataset collected under a new environment, e.g., an airport, which is different from the environment where the training dataset is collected, e.g., a city centre. The differences in lighting conditions, camera view angles, person orientations, and camera sensor characteristics introduce a significant change in the distribution of appearance descriptors. Hence the distributions of appearance descriptors from two different environments are significantly different. This problem is known as the domain shift problem and usually causes a significant degradation in the performance of a person re-identification system when the system is deployed to a new environment. The domain shift problem also exists in the same surveillance system installed at the same location. For example, the training dataset is collected in summer and the system requires to work in winter. The seasonal change introduces a significant change in the distribution of appearance descriptors. The environment where training data is collected is called source domain or training domain and the environment where the system is deployed is called target domain.
One known method to solve the problem of domain shift is to adaptively update a support vector machine (SVM) model learned from source domain data using unlabelled target domain data. The support vector machine (SVM) model is updated based on the assumption that the difference between the mean values of positive and negative samples in the source domain is close to the mean values of positive and negative samples in the target domain. However, this assumption may not be reasonable when there is a large difference between source and target domain, e.g., a large change in lighting conditions or camera view angles.
In another known method for domain adaptation, uses a discriminative component analysis method to jointly learn the similarity measurements for person re-identification in different scenarios in an asymmetrical manner A cross-task data discrepancy constraint is explored to learn a discriminant shared component across tasks. A drawback of the discriminative component analysis method is that a large amount of labelled training data from the target domain is required. Collecting labelled data from the target domain is often time consuming and impractical for large camera networks.
Another known method to solve the problem of domain shift is to capture unlabelled training data in the target domain and use multiple dictionaries to model the similarities and differences between the appearances of people in the source and target domains. In the unlabelled training data capture method, a shared dictionary represents characteristics of appearance that are common to the source and target domain, and an independent residual dictionary for each domain represents the characteristics of appearance unique to each domain. Furthermore, a target dictionary represents characteristics of appearance in the target domain that are not captured by the shared dictionary or residual dictionaries. However, a large amount of training data is required in the target domain to robustly train the residual and target dictionaries in the target domain. Capturing a large training set may not be possible if the target domain is sparsely populated.