Feature tracking is an important task in the field of digital video analysis. Digital video consists of a sequence of two-dimensional arrays, known as frames, of sampled intensity values, known as picture elements or pixels. A feature may be defined as a pattern of pixels in such a frame. Given the location of a feature of interest in one frame, the aim of feature tracking is then to determine the location of that feature in other, usually subsequent frames. That is, a trajectory for the selected feature must be found with respect to the coordinate system of the camera used to capture the sequence of frames.
The feature is typically selected through some intervention by a human user, usually by directing a pointing device at the feature displayed as part of an image on a screen. The feature may also be selected through an automatic detection process which, by using some predefined criteria, selects a feature that corresponds to such criteria.
If the selection is performed in real time, feature tracking may be used for controlling some other variable, such as the pointing direction of a sensor such as a camera, by feeding the results to a control system. In such applications, speed is of the utmost importance. Other applications use feature trajectories in post-processing tasks such as adding dynamic captions or other graphics to the video. Speed is less important in such applications.
There are two broad categories of feature tracking. A first approach, sometimes known as centroid tracking, requires the feature or object to be clearly distinguishable from the background in some sensing modality. An example of this first category is the tracking of movement of people across a fixed, known scene, in a surveillance application. In this case, a detection process may be employed independently in each frame to locate one or more objects. The task of tracking is to associate these locations into coherent trajectories for one or more of the detected objects as they interact with one another.
The second category may be referred to as motion-based or correlation tracking. In this case there is no separate detection process, and the location of the feature in the current frame must be found by reference to its position in the previous frame. This is a more general category with wider application, since there are fewer restrictions on the nature of the scene. The present disclosure falls into this category.
A critical step in the second approach is motion estimation, in which a region is sought in the current frame that is most similar to the region surrounding the feature in the previous frame. There exist many approaches to motion estimation including search and match, optical flow, and fast correlation among others, and all are potentially applicable to motion-based tracking. Because these methods have various limitations in terms of speed and reliability, many systems use some form of predictive tracking, whereby the trajectory over previous frames is extrapolated to predict the location of the feature in the current frame. If the trajectory is accurate, only a small correction to the predicted position need be found by the motion estimation process; potentially reducing computation and increasing reliability. The Kalman filter is an example of a predictive tracking strategy which is optimal under certain estimation error assumptions. An estimated motion vector is the “measurement” which enables correction of the current prediction. If the camera is moving between frames, and this motion may somehow be independently estimated, the camera motion may be compensated for in forming the prediction. This also helps to reduce the reliance on motion vector accuracy.
The main disadvantage of motion-based tracking in complex dynamic scenes with cluttered backgrounds arises from the lack of a separate detection stage. The feature may be occluded by another object, or suddenly change course, so that predictive motion estimation fails and tracking is lost. In these cases, tracking should be halted and the system notified of the “loss-of-track” (LOT) condition. However, the nature of motion estimation is such that a vector is always returned whether or not the feature is still actually visible near the predicted position. Hence, detecting the LOT condition requires some extra checking after the connection to the predicted position.
Most commonly, the region surrounding the current feature position is compared with stored reference data in some domain, and if that region is sufficiently different, an LOT condition is flagged. The reference data is initially derived from the region around the feature in the frame in which the feature was selected. Previous approaches have either kept the reference data fixed while tracking, or updated it continuously with the contents of the previous frame. Using a “goodness of fit” measure supplied by the motion estimation itself—for example, the height of a correlation peak—as the LOT criterion, is equivalent to the second approach, that is, comparing the region surrounding the current feature position with the region surrounding the feature position in the previous frame.
However, both these approaches, which may be viewed as opposite extremes of adaptivity, have disadvantages. Keeping the reference data fixed means the associated feature tracking system is unable to adapt to gradual but superficial changes in the appearance of the feature as it, for example, rotates in depth or undergoes lighting changes. Consequently, a LOT condition will be flagged prematurely. On the other hand, continual updates of the reference data can make such a feature tracking system too robust, causing it to fail to detect an insidious but fundamental change in the feature surrounds. Such a situation often occurs when a feature is occluded by another object.