Public venues such as shopping centres, parking lots and train stations are increasingly subject to surveillance using large-scale networks of video cameras. Application domains of large-scale video surveillance include security, safety, traffic management and business analytics. In an example from the business analytics domain, a shopping centre may wish to track customers across multiple cameras in order to build a profile of shopping habits.
A key task in many of these applications is the performance of rapid and robust object matching across multiple camera views. In one example, also called “re-identification”, object matching is used to locate a specific object of interest or specific objects of interest across multiple cameras in the network with non-overlapping fields of view.
Robust object matching is a challenging problem for several reasons. Firstly, many objects may have similar appearance, such as a crowd of commuters on public transport wearing similar business attire. Furthermore, the viewpoint (i.e. the orientation and distance of an object in the camera's field of view) can vary significantly between cameras in the network. Finally, lighting, shadows and other photometric properties including focus, contrast, brightness and white balance can vary significantly between cameras and locations. In one example, a single network may simultaneously include outdoor cameras viewing objects in bright daylight, and indoor cameras viewing objects under artificial lighting.
One approach to solving this problem is to use group matching for object re-identification. There are several approaches to group matching.
In one approach, appearance-based ratio-occurrence descriptors are used for matching group images. A Centre Rectangular Ring Ratio-Occurrence (CRRRO) defines a rectangular ring structure expanding from the centre of a group image, and Block-based ratio-occurrence (BRO) defines a block based structure where the entire image is divided into blocks. In both these cases, a colour histogram is extracted for the ring or block structure and used for comparison. CRRRO has the disadvantage that it cannot handle large non-centre-rotational changes in people's positions within a group.
In another approach, appearance-based covariance (COV) descriptors can be used for group matching. COV is a discriminative descriptor that captures both appearance and spatial properties of image regions.
Both the approaches have the disadvantage that they do not work when matching groups with similar appearance, for example where people are dressed in business suits.
Some methods adopt a Bayesian framework for consistent labelling across non-overlapping camera views. The Bayesian framework uses spatial-temporal cues (as priors) and visual appearance cues (as likelihoods) for computing a maximum-a-posteriori (MAP) estimator. In situations comprising multiple disjoint fields of view (FOVs), for a new detected object in one FOV, the Bayesian framework formulates a hypothesis space in another earlier FOV and assigns a higher prior to the objects composing a hypothesis if they enter the later FOV at the same time and a lower prior if they enter at different times. This approach has the disadvantage that the formulated hypothesis space is not scalable as it is likely to become very large and cause a very large computational overhead. In addition, this method relies more on low-level visual appearance cues.
In another approach, combining biometric cues such as gait and spatio-temporal cues such as velocity and position is used. This approach uses a graph to model people and their relationships in an input video stream. Each person is modelled as a node in the graph and there is an edge connecting two persons if they are spatially close to each other. The identity of each person is then propagated to their connected neighbours in the form of message passing in a graph via belief propagation. The message passed depends on a local evidence term capturing the biometric cue such as face or gait, and a compatibility term describing the spatio-temporal cue. This approach is complicated and inefficient, since it performs message passing on the underlying graph model.
In another method, for each track, an appearance context score is determined by summing weighted appearance similarity scores with its neighbours. Weights are determined by the closeness of the track's end times. To match between tracks of objects, an inter-object similarity score for two tracks in different views is obtained by subtracting the corresponding appearance context scores. The final appearance affinity score between the tracks is obtained by adding the inter-object similarity score and appearance similarity score. This approach relies heavily relies on low-level appearance which is not robust for identifying people with similar appearances in a view, or where appearance changes across views (e.g. caused by viewpoint and illumination changes). Also, this approach does not work well in the situation where neighbours change.