There is a long history of video analytic technologies designed to analyse digital video and track video objects.
Many video tracking systems use some forms of foreground separation to work out what is moving in the scene, and what is stationary. This can be as simple as looking at the pixel differences between each frame (“frame differencing”), and can get quite complex, taking into account obfuscating factors such as camera movement, shadows, reflections, and background movements such as water ripples, tree movement, and escalator movement.
Foreground separation can be used as input to a geometric tracker (i.e. a tracker that treats each connected foreground region as an object to be tracked). Point tracking methods such as Kalman filters can then be used to track the objects. Such a tracker works well on individual objects moving through the scene but is poor at following tracks that touch each other, as it does not distinguish foreground objects from each other.
Visual Signature Algorithms (also known as Kernel Trackers) are algorithms capable of tracking objects by analysing the scene for objects of a similar appearance to the known tracks. Existing Visual Signature Algorithms include Mean-Shift, CamShift, and KLT.
The Mean-shift tracker is a Visual Signature algorithm that requires initialisation with an Exemplar Image of an object. An exemplar image is the region of an image representing the object to be tracked. The exemplar image can be provided either by a geometric tracker or a specialised detector, e.g. a Human Body Detection algorithm. The mean-shift tracker then creates a Model Histogram, a histogram of the exemplar image. Many different histogram types are possible, including three dimensional pixel histograms of RGB or YCbCr, one dimensional pixel histograms of Hue (ignoring pixels with brightness or saturation below a fixed threshold), and higher dimensional histograms that take into account such features as luma gradients and textures.
Then, on each subsequent video frame, the mean-shift tracker creates a Back Projection, being a Probability Density Function (PDF) of the video frame, mapping each pixel or area of the current video frame to a corresponding normalised histogram value. Then, starting at the predicted location of the track, a mean-shift procedure (an iterated shifting of the centroid of the object using the first moment of values of the back projection within a bounding box of the object) is used to find a local maxima of the PDF. The predicted location of the track can simply be the same position as in the previous frame, or it could take into account known behaviour of the track so far (e.g. using a Kalman filter).
The mean-shift algorithm is also able to give an approximate confidence of the determined tracking, by examining the absolute strength of the PDF with the bounding box, penalised by the strength of the PDF in the immediate area outside the bounding box.
The mean-shift tracker has some useful properties. The use of histograms means that the mean-shift tracker is invariant to rotation and (to a lesser degree) scale and deformation of the objects. The mean-shift tracker is also computationally efficient compared with other Visual Signature algorithms.
The mean-shift tracker however has a number of limitations.
The mean-shift tracker has a limited ability to deal with occlusions. The mean-shift algorithm does not adjust the histogram to what it expects to see, thus when a track is partially occluded, there is a greater probability of losing the track. Even if the algorithm continues to track successfully, the predicted position of the track does not take into account the occlusion, which can be a problem for any subsequent process that requires an accurate bounding box for the object.
One attempt to address this issue assumes the track with the lowest low point occludes the other tracks, when the predicted bounding boxes of tracks overlap. Then, when calculating the histogram for the occluded track's exemplar image, those pixels that are geometrically expected to be occluded are excluded. This method potentially works well as long as the predicted bounding boxes are accurate. However, if the predicted lowest low point is incorrect, data from the wrong histogram could be excluded, resulting in even more erroneous position estimates.
Another significant limitation of the mean-shift tracker is that if the histogram peaks of an object also appear in other nearby objects or in background areas of the image, the algorithm can incorrectly identify areas of the track.
A simple way to avoid including background pixels is to centre-weight the histogram data. A common improvement is to exclude or penalise nearby background areas as defined as the area immediately outside the bounding box or the foreground area associated with the track, when creating histograms and/or back projections.
One approach to deal with nearby objects is by assigning a probability to each pixel of an ambiguous region, using the relative histogram strengths of the candidate exemplar images. However, the mean-shift step will still be prone to mistakes if there are substantial similarities in appearances between the objects.
There is a need for a tracker that is more robust to tracking objects when there are other visually similar objects nearby.