Automated visual tracking systems are often used to track a target image appearing in a series of image frames. In general, once a target object is identified, the tracking system determines the target object's position in each successive image frame by distinguishing the target object from background and other non-target image data. Such tracking systems often utilize a motion estimation algorithm to predict movement of the target object's position in a new (current) image frame by analyzing movement patterns of the target object in two or more image frames preceding the new frame.
Although not always described as such, conventional motion estimation and tracking systems embody some form of appearance model that is utilized to identify the target object in each image frame. In general, the appearance model is a description of the target object that can be used by the motion estimation/tracking system to distinguish the target object from non-target image data surrounding the target object in each image frame. As the target object changes location, the motion estimation/tracking system identifies each new location by identifying a region of the new frame that satisfies the previously established description provided by the appearance model.
One of the main factors that limits the performance of motion estimation and tracking systems is the failure of the appearance model to adapt to target object appearance changes. The image conveyed by a three-dimensional (3D) target object located in 3D space onto a two-dimensional image frame is typically subjected to image deformations caused by relative displacement between the target object and the image frame generating device (e.g., camera). For example, the size of the target object typically grows larger or smaller when a distance between the target object's position relative to the camera is changed. Similarly, the shape and/or light reflected from the target object typically changes due to changes in orientation of the target object relative to the camera (e.g., rotation or translation of the target object or the camera). In addition, image distortion occurs when a non-target object partially or fully occludes the target object (i.e., becomes located between the target object and the camera). Moreover, complex natural objects (i.e., objects whose appearance is subject to changes that are independent of relative displacement between the target object and the camera, such as changes in facial expression) introduce additional appearance variation that must be accounted for by the appearance model. As described in additional detail below, conventional appearance models, such as template-matching models, global statistical models, 2-frame motion estimation, and temporally filtered motion-compensated image models, fail to account for one or more of these deformations, thereby causing the a motion estimation and tracking system to eventually lose track of a target object.
Template-matching appearance models are pre-learned, fixed image models (“templates”) of the target object that are used by a tracking system to identify (“match”) the target object in an image frame, thereby determining its location. While such tracking systems can be reliable over short durations (i.e., while the target object's appearance remains consistent with the fixed image model), they cope poorly with the appearance changes of target objects over longer durations that commonly occur in most applications. The reliability of these tracking systems may be improved by representing the variability of each pixel in the template (see B. Frey, “Filling in Scenes by Propagating Probabilities Through Layers into Appearance Models”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume I, pages 185–192, Hilton Head, June 2000). However, a learning stage is required prior to tracking in which the variance of the brightness of the image at each pixel over the training image data is estimated.
The reliability of a tracking system may be enhanced with the use of subspace models of appearance (see, for example, M. J. Black and A. D. Jepson, “EigenTracking: Robust Matching and Tracking of Articulated Objects using a View-Based Representation”, International Journal of Computer Vision, 26(1):63–84, 1998). Such view-based models, usually learned with Principal Component Analysis, have the advantage of modeling variations in pose and lighting. They can also be used for search as well as incremental tracking. But they also have the disadvantage that they are object specific and they require that training occur before tracking in order to learn the subspace.
The use of local and global image statistics, such as color histograms, have also been used as coarse appearance models for tracking target objects (see, for example, S. Birchfield, “Elliptical Head Tracking Using Intensity Gradients and Color Histograms”, Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 232–237, Santa Barbara, June 1998). These appearance models offer robustness when image distortions and occlusions are encountered, are fast to learn, and can be used for searching as well as tracking. However, global statistical descriptions lack a spatial structure of appearance; that is, the ability to distinguish a large number of similarly colored pixels (e.g., a large proportion of red and blue pixels) from the spatial relationship of the colored pixels (e.g., a group of red pixels vertically positioned over a group of blue pixels associated with an image of a red shirt located over blue pants). This lack of expressiveness limits the ability of global statistical descriptions to accurately register the appearance model to the target object in many cases. Additionally, these coarse appearance models can also fail to accurately track objects in regions of interest that share similar statistics with nearby regions.
Motion-based tracking methods integrate motion estimates through time. For 2-frame motion estimation, motion is computed between each consecutive pair of frames. Because motion is computed between each consecutive pair of frames, the only model of appearance used by the motion-based tracking system is the appearance of the target object within the region of interest in the last frame. As a result, errors in this method can quickly accumulate over time. The appearance model in 2-frame motion estimation is able to adapt rapidly to appearance changes. However, the appearance model often drifts away from the target object when the target object changes appearance rapidly. As a result, the region of interest often slides off the target object and onto the background or another object. This is especially problematic when the motions of the target object and background are similar.
Motion-based tracking methods have been improved by accumulating an adaptive appearance model through time. Indeed, optimal motion estimation can be formulated as the estimation of both motion and appearance simultaneously (see Y. Weiss and D. J. Fleet, “Velocity Likelihoods in Biological and Machine Vision”, Probabilistic Models of the Brain: Perception and Neural Function, pages 81–100, Cambridge, 2001. MIT Press). In this sense, like the learned subspace approaches above, optimal motion estimation is achieved by registering the image against an appearance model that is acquired through time. For example, a stabilized image sequence can be formed from the motion estimates to learn the appearance model. This stabilized image sequence may be smoothed with a recursive low-pass filter, such as a linear IIR low-pass filter, to remove some noise and up-weight the most recent frames. However, linear filtering does not provide measures of stability that produce robustness with respect to occlusions and to local distortions of appearance.
What is needed is a robust, adaptive appearance model for motion-based tracking of complex natural objects. The appearance model should adapt to slowly changing appearance, as well as maintaining a natural measure of the stability of the observed image structure during tracking. The appearance model should be robust with respect to occlusions, significant image deformations, and natural appearance changes like those occurring with facial expressions and clothing. The appearance model framework should support tracking and accurate image alignment for a variety of possible applications, such as localized feature tracking, and tracking models for which relative alignment and position is important, such as limbs of a human body.