There is a long history of video analytic technologies designed to analyse digital video and track video objects.
Many video tracking systems use some form of foreground separation to work out what is moving in the scene, and what is stationary. This can be as simple as looking at the pixel differences between each frame (“frame differencing”), and can get quite complex, taking into account obfuscating factors such as camera movement, shadows, reflections, and background movements such as water ripples, tree movement, and escalator movement.
Foreground separation can be used as input to a geometric tracker (i.e. a tracker that treats each connected foreground region as an object to be tracked). Point tracking methods such as Kalman filters can then be used to track the objects. Such a tracker works well on individual objects moving through the scene but is poor at following tracks that touch each other, as it does not distinguish foreground objects from each other.
Visual Signature Algorithms (also known as Kernel Trackers) are algorithms capable of tracking objects by analysing the scene for objects of a similar appearance to the known tracks. Existing Visual Signature Algorithms include Mean-Shift, CamShift, and KLT.
The Mean-shift tracker is a Visual Signature algorithm that requires initialisation with an Exemplar View of an object. An Exemplar View is the region of an image representing the object to be tracked. The Exemplar View can be provided either by a geometric tracker or a specialised detector, e.g. a Human Body Detection algorithm. The mean-shift tracker then creates an Exemplar View Histogram, a histogram of the Exemplar View. Many different histogram types are possible, including three dimensional pixel histograms of RGB or YCbCr, one dimensional pixel histograms of Hue (ignoring pixels with brightness or saturation below a fixed threshold), and higher dimensional histograms that take into account such features as luma gradients and textures.
On each subsequent video frame, the mean-shift tracker creates a Back Projection, being a Probability Density Function (PDF) of the video frame, mapping each pixel or area of the current video frame to a corresponding normalised histogram value. Then, starting at the predicted location of the track, a mean-shift procedure (an iterated shifting of the centroid of the object using the first moment of values of the back projection within a bounding box of the object) is used to find a local maxima of the PDF. The predicted location of the track can simply be the same position as in the previous frame, or it could take into account known behaviour of the track so far (e.g. using a Kalman filter.) The local maxima describes the mean-shift calculated Current Frame Track Location, and is typically represented by a bounding box. The track information is finally updated with the Current Frame Track Location and the system awaits the next video frame.
The mean-shift algorithm is also able to give an approximate confidence of the determined tracking, by examining the absolute strength of the PDF within the bounding box, penalised by the strength of the PDF in the immediate area outside the bounding box.
The mean-shift tracker has some useful properties. The use of histograms means that the mean-shift tracker is invariant to rotation and (to a lesser degree) scale and deformation of the objects. The mean-shift tracker is also computationally efficient compared with other Visual Signature algorithms.
One limitation of the mean-shift tracker is that the tracked object may gradually change in appearance over a period of time. If updated Exemplar Views for the track are not provided, the track may be lost. Updating the Exemplar View may be done by using mean-shift tracking in conjunction with a Human Body Detection algorithm and a geometric track association algorithm to associate Human Body Detection bounding boxes with existing tracks. Alternately, using only the mean-shift object positions, the Exemplar View Histogram may be updated if the histogram described by the mean-shift calculated object position is sufficiently similar to the Exemplar View Histogram. One such approach uses a threshold for the Bhattacharyya coefficient between the two histograms to decide whether to update the Exemplar View histogram.
A significant limitation of the mean-shift tracker is that if the histogram peaks of an object also appear in nearby background areas of the image, the algorithm can fail to locate the present position of the object, instead including the nearby background areas in its determined location.
A simple way to avoid including background pixels is to centre-weight the histogram data with respect to the bounding box of the Exemplar View. One improvement is to exclude or penalise nearby background areas as defined as the area immediately outside the Exemplar View bounding box or the foreground area associated with the track, when creating histograms and/or back projections. Background exclusion is done only for the Exemplar View and not for subsequent mean-shift generated calculated object positions as errors in the mean-shift generated calculated object position may cause parts of the object to be in the background exclusion area, which in turn can cause larger errors, leading to tracking failure.
Another significant limitation of the mean-shift tracker is that if the object moves to an area of the scene that has a similar background appearance to the object, it is possible the tracker will get stuck on the background area. One approach for addressing this issue dynamically creates a compensated Exemplar View Histogram using the Exemplar View Histogram and a histogram constructed from a bounding box based on the predicted track location, using the ratios of bin sizes to determine whether background areas share features with the Model, and penalising those bins if they do.
A common problem with these methods that create a compensated Exemplar View Histogram by de-emphasising selected Exemplar View Histogram bins is that the mean-shift generated calculated object position may be changed as well. For example, when tracking a person, if the person's trousers are a similar colour to the background, the compensated Exemplar View Histogram may remove that colour, and the subsequent back projection and mean-shift generated calculated object position represents the upper part of the body. The bounding box centroid is modified and (in the case of the CAMShift visual signature algorithm) size is reduced. Only a portion of the object (the upper half) is now being tracked. If a geometric tracker is being used to assist track predictions in future frames, the track prediction will no longer be accurate. It is also easier to lose the track altogether due to the smaller bounding box.
There is a need for a tracker that is more robust to tracking objects when there is visually similar background nearby.