Public venues such as shopping centres, parking lots and train stations are increasingly subject to surveillance using large-scale networks of video cameras. Application domains of large-scale video surveillance include security, safety, traffic management and business analytics. In one example application from the security domain, a security officer may want to view any video feed containing a particular suspicious person in order to identify undesirable activities. In another example from the business analytics domain, a shopping centre may wish to track customers across multiple cameras in order to build a profile of shopping habits. In the arrangements described, the terms “person”, “target” and “object” relate to an object of interested within at least partial view of a video surveillance camera.
Many surveillance applications require methods, known as “video analytics”, to detect, track, match and analyse multiple objects across multiple camera views. In one example, referred to as a “hand-off” application, object matching is used to persistently track multiple objects across first and second cameras with overlapping fields of view. In another example application, referred to as “re-identification”, object matching is used to locate a specific object of interest across multiple cameras in the network with non-overlapping fields of view.
A known method for analysing an object in an image (for example a frame of a video sequence) includes steps of detecting a bounding box containing the object and extracting an appearance descriptor of the object from pixels within the bounding box. In the present disclosure, the term “bounding box” refers to a rectilinear region of an image containing an object, and an “appearance descriptor” refers to a set of values derived from pixels in the bounding box. One example of an appearance descriptor is a histogram of pixel colours within a bounding box. Another example of an appearance descriptor is a histogram of image gradients within a bounding box.
Robust video analytics is typically challenging for two reasons. Firstly, cameras across a network may have different viewpoints, leading to projective distortions including stretching, skewing and rotation. For example, objects in images captured by a camera with a large tilt angle, roll angle or wide field of view can appear stretched, skewed and rotated by varying degrees depending on the object's location in the image. Secondly, cameras with a wide field of view often exhibit radial distortion, also known as “pincushion” or “barrel” distortion. Radial distortion causes straight lines, such as the edges of buildings and a vertical axis of a person, to appear curved in an image. Projective and radial distortion causes the appearance of an object to vary between cameras in a network. Variation in the appearance of the object reduces the reliability of object detection and descriptor extraction, and can cause tracking, matching and other video analytics processes to fail. Furthermore, displaying distorted images to an operator through a graphical user interface can cause operator fatigue, since additional effort is required to interpret the images.
In some known methods, the problems described above are addressed using knowledge of camera calibration to render a “rectified” image of the object in which the distortions are removed. In one known method, a rectified image of an object is rendered by first detecting a bounding box of the object in a distorted image, determining an angular orientation of the object and finally rotating the bounding box so that the object is upright. To determine the angular orientation of the object, this method requires the step of locating of the object in 3D space based on a known camera location and height of the object. A drawback of the known method is that rotating a bounding box does not fully correct for radial and projective distortion. Another drawback is that knowledge of the camera location and height of the object are either unknown or known with low accuracy in practical surveillance systems. Yet another drawback is that detecting a bounding box of an object in a distorted image is unreliable using the known method.
In another known method, a rectified image on an object is rendered by first detecting a bounding box of the object in a distorted image, determining the 3D location of the object and rendering an image of the object from a virtual camera with zero roll and tilt. In the method, the 3D location of the object is determined using knowledge of the 3D camera location and orientation and assuming the scene has a planar ground. A limitation is that the method does not correct for radial distortion. Another limitation of the method is the requirement of a planar ground, which is invalid for scenes with stairs, curbs and other changes in elevation. Another drawback of the method, as with the previously described method, is that knowledge of the camera location is either unknown or known with low accuracy in practical surveillance systems.
Yet another known method first renders the entire image corrected for radial distortion and then renders a set of overlapping sub-images corrected for projective distortion based on the first rendered image. The sub-images correspond to a set of virtual cameras with zero tilt and roll observing a grid of 3D locations on the ground plane of the scene. Sub-image corrections are computed based on knowledge of the 3D camera location. Each sub-image is processed independently to detect a bounding box containing an object. A drawback of the method is the high computational cost of rendering multiple overlapping corrected images covering the entire original image. Another drawback of the method, as with the previously described methods, is that knowledge of the camera location is either unknown or known with low accuracy in practical surveillance systems.