Public venues such as shopping centres, parking lots and train stations are increasingly subject to surveillance using large-scale networks of video cameras. Application domains of large-scale video surveillance include security, safety, traffic management and business analytics. In one example application from the security domain, a security officer may want to view a video feed containing a particular suspicious person in order to identify undesirable activities. In another example from the business analytics domain, a shopping centre may wish to track customers across multiple cameras in order to build a profile of shopping habits.
Many surveillance applications require methods, known as “video analytics”, to detect, track, match and analyse multiple objects of interest across multiple camera views. In one example, referred to as a “hand-off” application, object matching is used to persistently track multiple objects across first and second cameras with overlapping fields of view. In another example application, referred to as “re-identification”, object matching is used to locate a specific object of interest across multiple cameras in the network with non-overlapping fields of view.
Cameras at different locations may have different viewing angles and work under different lighting conditions, such as indoor and outdoor. The different viewing angles and lighting conditions may cause the visual appearance of a person to change significantly between different camera views. In addition, a person may appear in a different orientation in different camera views, such as facing towards or away from the camera, depending on the placement of the camera relative to the flow of pedestrian traffic. Robust person matching in the presence of appearance change due to camera viewing angle, lighting and person orientation is a challenging problem.
In most person matching methods, the appearance of a person is represented by a “descriptor”, also referred to as an “appearance descriptor” or “feature vector”. A descriptor is a derived value or set of derived values determined from the pixel values in an image of a person. One example of a descriptor is a histogram of colour values. Another example of a descriptor is a histogram of quantized image gradient responses.
In some known methods for person matching, known as “supervised learning”, a projection is learned from pairs of images of people captured from a pair of cameras. In each pair of images, the first image is captured from the first camera and the second image is captured from the second camera. Pairs of images of the same person are known as “positive” training images. Pairs of images of different people are known as “negative” training images. Pairs of appearance descriptors extracted from positive training images are known as “positive” training samples. Pairs of appearance descriptors extracted from negative training images are known as “negative” training samples.
The projection is learned with information related to whether the image pairs are positive or negative training samples. In one known method, known as “distance metric learning”, a projection is learned to minimize a distance between the appearance descriptors in each positive training sample and maximize the distance between the appearance descriptors in each negative training sample. In another method, known as “linear discriminative analysis”, a set of projections are learned to separate appearance descriptors associated with different positive training samples in a common subspace. In another method, known as “canonical correlation analysis”, a set of projections are learned to maximize the correlation between the appearance descriptors in each positive training sample in a common subspace.
The supervised learning methods may be impractical due to the need for positive training images. In practice, generating a set of positive training images is time consuming and requires intense manual labour. Furthermore, people may appear infrequently in some camera views, such as remote perimeters, making the collection of a large set of positive training images impractical. Therefore, methods, known as “unsupervised learning”, resort to learning a discriminative representation of appearance descriptors without the need to capture large quantities of positive training images in every pair of cameras.
In some known unsupervised methods for person matching, known as “dictionary learning”, a “dictionary” is learned to encode a compact, discriminative representation of an appearance descriptor. A dictionary consists of a set of dictionary “atoms” or basis vectors. An appearance descriptor of a person can be reconstructed as a linear weighted sum of dictionary atoms, each atom being weighted by a coefficient. The coefficients for all dictionary atoms collectively form a “code”. Given an appearance descriptor, the corresponding code is determined by finding the weighted sum of dictionary atoms that minimizes a difference, known as a “reconstruction error”, between the appearance descriptor and a reconstruction of the appearance descriptor using the dictionary atoms. A dissimilarity score (e.g., Euclidean distance), between the codes of a pair of images determines if the pair of image is matched.
In one known dictionary learning method, known as “cross-dataset dictionary learning”, multiple dictionaries are learned to model the similarities and differences between the appearance of people in different datasets collected from different environments. In the method, a shared dictionary represents characteristics of appearance that are common to all the datasets, and an independent residual dictionary for each dataset represents the characteristics of appearance unique to each environment. Furthermore, a target dictionary represents characteristics of appearance in the target dataset that are not captured by the shared dictionary or residual dictionaries. However, the cross-dataset dictionary learning method requires a prior knowledge of a matching correspondence between training images received from query and gallery cameras in the target dataset. The matching correspondence may be obtained from a manual annotation or may be obtained by asking people to appear in the fields of views of the query and gallery cameras during the data collection. Additionally, the cross-dataset dictionary learning method requires positive training images from other datasets collected from environments different from the target environment.
Another known dictionary learning method, known as “l1 graph-based dictionary learning”, uses a l1-norm graph regularisation term in the dictionary learning formulation to improve the robustness of the dictionary against outliers caused by changes in background, pose, and occlusion. However, the l1 graph-based dictionary learning method requires a prior knowledge of a matching correspondence between training images received from query and gallery cameras. The matching correspondence may be obtained from a manual annotation or may be obtained by asking people to appear in the fields of views of the query and gallery cameras during the data collection.