Surveillance cameras, such as Pan-Tilt-Zoom (PTZ) network video cameras, are omnipresent nowadays. The cameras capture more data (video content) than human viewers can process. Automatic analysis of the captured video content is therefore needed.
An important part of automatic analysis of video content is the tracking of objects in a sequence of images captured of a scene. Objects may be separated from a background of the scene and treated as foreground objects by a previous extraction process, such as foreground/background separation. The terms foreground objects, and foreground, usually refer to moving objects, e.g. people in a scene. Remaining parts of the scene are considered to be background.
Foreground/background separation allows for analysis, such as detection of specific foreground objects, or tracking of moving objects within a sequence of images. Such further analysis has many applications, including, for example, automated video surveillance and statistics gathering, such as people counting.
One method of foreground/background separation is statistical scene modelling. In one example, a number of Gaussian distributions are maintained for each pixel of an image to model the recent history of the pixel. When a new input image of a sequence of images is received, each pixel from the image is evaluated against the Gaussian distributions maintained by the scene model at the corresponding pixel location. If the input pixel matches one of the Gaussian distributions, then the parameters of the associated Gaussian distribution are updated with an adaptive learning rate. Otherwise, a new Gaussian model for the pixel is created.
Foreground/background separation typically detects foreground areas of a scene as blobs, where each blob represents a foreground area of a scene. Blobs have no consistent identity within each subsequent image of an image sequence without a later step, such as a video object tracker, to resolve the identities of blobs over time.
Video object tracking provides a consistency across images of an image sequence for foreground blobs by associating blobs with each other across multiple images (i.e. over time).
The process of foreground/background separation to produce foreground blobs, which are also called detections, has an ambiguity over the relationship of a given blob to an object. Each blob may correspond to part of an object, to one object, or to more than one object. For example, one object may correspond to multiple foreground blobs. From the point of view of a video object tracker, a blob has no context with regard to real-world objects.
More than one blob may correspond to one object, potentially resulting in more than one track corresponding to one object. More than one track may correspond to one object where there have been partial detections due to detection failures in the foreground/background separation process.
One blob may also correspond to one object where a clear, unobstructed view of the object can be seen by a camera and there are no detection failures for the object.
One blob may correspond to more than one object where the objects are overlapping in the view of a camera. The overlapping objects may be said to exhibit spatial connectedness with regard to the foreground/background separation process, and more generally one of the objects may be said to be occluding one or more other objects.
As objects move through a scene, the objects can be viewed to be constantly interacting through merging and splitting of blobs. For example, two humans walking across the scene in opposite directions may cross. Initially, a foreground/background separation process may detect each human as a single blob (i.e., there will be two blobs detected). When the humans cross (i.e., the humans exhibit spatial connectedness from the point of view of a camera), the foreground/background separation process may only output one blob corresponding to both humans. When two or more previously detected blobs are detected as one blob, the blobs are considered to have merged. When the humans separate, the humans may again be detected by the foreground/background separation process as two separate blobs after previously being detected one as blob. One previously detected blob may be detected as more than one blob, because the blob has split. However, detection failures in the foreground/background separation process may also cause multiple blobs to be detected for one object. A video object tracker may not be able to discern between a split and detection failure in the foreground/background separation process.
A conventional method of tracking an object uses a mean shift algorithm and colour distribution of the object being tracked to find the object within the scene based on a visual appearance of the object. The method adds robustness where the object being tracked is partially occluded by one or more others objects. However, the method may not completely support merges and splits due to the lack of context in detections. While some segmentation of occluding objects may be possible using the mean shift algorithm and colour distribution method, occluded objects can easily be lost. Further, a video object tracker using the method can become stuck in a local maxima. Additionally, the lack of context in the detections may create problems when initialising tracks, such as tracks that are initialised on blobs containing more than one object. A Kalman filter may also be used with the mean shift algorithm and colour distribution method for predicting the location of a track in order to reduce search space. However, such iterative visual methods are computationally expensive when compared to a “geometric” tracker which uses foreground blob shapes and positions only. Such visual methods can be too computationally demanding to implement on a low-power device such as a video camera. Thus, a need exists to provide an improved method, apparatus and system for tracking objects in a sequence of images, that is both robust to continual interactions between objects and that is relatively computationally inexpensive.