Public venues such as shopping centres, parking lots and train stations are increasingly subject to surveillance using large-scale networks of video cameras. Application domains of large-scale video surveillance include security, safety, traffic management and business analytics. In one example application from the security domain, a security officer may want to view any video feed containing a particular suspicious person in order to identify undesirable activities. In another example from the business analytics domain, a shopping centre may wish to track customers across multiple cameras in order to build a profile of shopping habits. In the following discussion, the terms “person”, “target” and “object” will be understood to mean an object of interest that may be within view of a video surveillance camera.
Many surveillance applications require targets to be detected, tracked, matched and analysed across multiple camera views. Robust analysis of video is challenging due to the large variation in viewpoint across cameras in a network. In one example, targets observed in a camera with a wide field of view may appear to be geometrically distorted when located far from the centre of the video frame. In another example, targets observed in a camera mounted with a large tilt angle may appear to be oriented away from a vertical direction when located far from the centre of the video frame. These geometric distortions can change the appearance of a target and cause detection, tracking, matching or some other analysis to fail.
The above challenges may be overcome based on knowledge of the geometric properties of the image formation process. In one example, knowledge of the camera geometry can be used to rectify an image to remove geometric distortions. In another example, knowledge of camera geometry can be used to align an observed target to a vertical orientation. Rectifying or aligning an image to a vertical orientation reduces the variation in the appearance of an object due to the viewpoint of the camera. In one application, known as “re-identification”, vertical alignment is applied to images of objects observed in two camera views, in order to determine whether the objects have the same identity.
It is well known that camera geometry can be estimated from knowledge of the vanishing points within an image. One known method for determining a vanishing point in an image first extracts at least two straight lines in the image, corresponding to the edges of static objects in the scene. In one example, two nearly vertical straight lines at the boundaries of a building are extracted by applying a Hough transformation to edge pixels in an image. A vertical vanishing point is proposed by taking the intersection of these lines. Additional straight lines that pass near the vanishing point are extracted, and a reliability score for the proposed vanishing point is computed based in part on the length, contrast and intersections of these additional lines. In another example, multiple line segments are detected based on a magnitude of an image gradient. Intersections between multiple pairs of line segments are computed and clustered to determine a vanishing point. The clustering process is repeated multiple times to determine additional vanishing points. A drawback of the two approaches described above is that they rely on the presence of objects with parallel straight edges in an image. Some views in a surveillance camera network, such as a view of an outdoor park, may not contain sufficient parallel straight edges to determine a vanishing point.
Other known methods determine parallel lines from moving objects of arbitrary shape, such as a person, rather than static straight-edged structures. In one example, two known features on the object, such as the head location and foot location, are detected when the object is at different locations in a video frame. A vanishing point is determined at the intersection of the lines connecting the pairs of known features. A vanishing line is then determined from multiple vanishing points computed from different objects or the same object at multiple pairs of locations in the video frame. Finally, the camera geometry is determined from the vanishing line and a known height of at least one object in the image. A drawback of this method is that it relies on an object to maintain a fixed height at different locations in an image in order to extract parallel lines. This is generally not the case for a person undergoing changes in posture as they walk through a scene.
In another example, a vertical vanishing point is found at the intersection of vertical lines joining corresponding head and feet locations of walking pedestrians at different locations in a video frame. In order to reduce errors due to changes in posture, this approach selects images with a fixed posture, the fixed posture corresponding to the moment at which the legs are closest to each other during a walking cycle. The fixed posture is determined based on the shape of segmented region of the walking person. A horizontal vanishing line is determined from pairs of different head and feet locations. Finally, the vertical vanishing points and horizontal vanishing line are used to compute the camera geometry. This approach relies on robust and accurate segmentation of the moving object in order to analyse the posture of the target. However, robust and accurate segmentation is a significant challenge in real surveillance scenarios with arbitrary background and lighting conditions.
In yet another example, camera geometry is estimated from a set of vertical lines estimated from the major axis of segmented regions of many walking pedestrians. In order to deal with errors due to changes in posture or poor segmentation, this approach uses RANSAC to find a subset of reliable vertical lines. The camera geometry is estimated from an inlier set of vertical lines and the known general distribution of heights of people in the population. Further robustness is achieved by computing the relative 3D height of lines in the inlier set, and discarding lines that fall outside a predetermined range. A drawback of this approach is that many vertical lines are required to find a reliable inlier set using RANSAC, which requires a crowded scene or video captured over an extended period.