1. Field of the Invention
The present invention is related to the field of video processing and more particularly concerns a layer-based tracking method for tracking objects within and across video images. This method may also allow improved instantiation of objects and tracking through occlusions.
2. Description of the Related Art
Video representations in terms of dynamic layers and associated algorithms have emerged as powerful tools for motion analysis and object tracking. In this form, the image is represented as a superposition of multiple independently moving regions, also referred to as layers. The decomposition into layers provides a natural way to estimate motion of independently moving objects. For object tracking applications, a layer is represented by three entities: appearance; shape; and motion models. Since, a priori, the decomposition of image/video into layers is unknown, layer-based tracking algorithms jointly estimate the layer decomposition, appearance, shape, and motion over time.
Layer-based methods use parametric motion models to match the appearance and shape model parameters of the existing layers to the new observation of a new frame, and thereby update the motion, appearance and shape models. The underlying assumption of such a procedure is that the layer is rigid (up to a parametric transformation, such as affine). Thus, these methods are unable to handle complex non-rigid motions of layers, unless these motions are explicitly modeled by using separate layers for each of the articulations of an object.
A key problem in layer-based trackers is that of estimating the motion parameters for the layers given their appearance and shape representations. Some existing layer-based methods, use an iterative Expectation Maximization (EM) algorithm to estimate the motion parameters. In the Expectation step (E), a parametric motion estimate is used to warp the appearance and shape models from time t−1 to the current time t. These parametrically warped appearance and shape models are then matched with the new observations to compute the layer ownerships. In the Maximization (M) step, these computed ownerships are used to refine the motion estimates. It should be noted that parametric motion constraints are used to estimate the layer ownerships. With such an E step it is desirable for either the object motion to strictly conform to the parametric motion model employed, or for the object motion to vary slowly enough from the parametric motion model that the resulting ownership estimates remain a sufficiently accurate approximation to allow correct assignment of pixels among various layers. Even if the object motion is rigid, estimating ownerships in this manner uses a pixel-by-pixel match between the parametrically warped appearance models and the observations. In such a match it is desirable for the selected appearance model to be capable of accounting for rapid within-object shade or texture variations. An example of such potential appearance variation occurs when a car that is being tracked moves through the shadow of a leafless tree. The texture of shadow on the car appears to move in the opposite direction to that of the car. Both non-rigid motion and rapid intra-object appearance variations can lead to a poor approximation for ownership estimates, which may cause the EM algorithm to undesirably lock on to local maxima and may result in tracking drift and/or tracking loss.
Some template trackers use parametric motion (affine/similarity etc.) to update both the motion and the shape of the template. However, drift may still occur in these models because there is no explicit updating of template ownership. The appearance model can also be chosen to be the previous frame, but is susceptible to drift near occlusions,
Some methods use global statistics such as color histograms, instead of templates. Because these methods do not enforce pixel-wise appearance, they are robust to rapid shape and appearance changes. However, histograms are relatively weak appearance representations. Therefore, the tracking of objects can drift near occlusions or when nearby regions have similar statistics to the object being tracked.
The motion model used within each layer is often modeled as a single two dimensional affine model. Such methods may model rigid objects reasonably well, but often have difficulties with non-rigid objects. For non-rigid objects, such as people, proposals include a further decomposition of a tracked non-rigid object into multiple layers (arms, legs and torso) to account for the independently moving body parts. However, the complexity of the problem increases linearly with the number of layers.
As none of these approaches fully address the problems of providing an accurate tracking system that remains robust even in cases such as partial or complete occlusion, including objects passing one another, and the tracking of non-rigid objects, there is a need for a method and video processing techniques for same. Additionally, difficulties remain in the instantiation of layers, particularly dealing with issues such as shadow removal and distinguishing overlapping objects. Another area in which improvement is desirable is the tracking of objects between multiple cameras, including the identification of objects in multiple simultaneous images taken from differing viewpoints.