This invention relates to tracking and segmenting an object within a sequence of image frames, and more particularly to methods and apparatus for segmenting and tracking a video object which may move and deform.
When tracking an object among multiple frames of a video sequence, an enclosed boundary of the object is identified in each frame. The object is the area within the boundary. The challenge in identifying the object boundary in a given frame increases as the constraints on a trackable object are relaxed to allow tracking an object which translates, rotates or deforms. For example, tracking non-rigid 3-dimensional objects introduces complexity into the tracking process.
Once the object is identified in one frame, template matching may be used in a subsequent frame to detect translation of the object. The template typically is the object as identified in the prior frame. Deformable models are used to detect objects which translate, rotate or deform. Various methods using deformable models are described below.
Yuille et al. in xe2x80x9cFeature Extraction from Faces Using Deformable Templates,xe2x80x9d International Journal of Computer Vision, Vol. 8, 1992, disclose a process in which eyes and mouths in an image are identified using a model with a few parameters. For example, an eye is modeled using two parabolas and a circle radius. By changing the shape of the parabolas and the circle radius, eyes can be identified. Yuille et al. and other deformation models typically have encompassed only highly constrained deformations. In particular, the object has a generally known shape which may deform in some generally known manner. Processes such as an active contour model have relaxed constraints, but are only effective over a very narrow spatial range of motion. Processes like that disclosed by Yuille are effective for a wider spatial range of motion, but track a very constrained type of motion. Accordingly, there is a need for a more flexible and effective object tracker, which can track more active deformations over a wider spatial range.
Active contour models, also known as snakes, have been used for adjusting image features, in particular image object boundaries. In concept, active contour models involve overlaying an elastic curve onto an image. The curve (i.e., snake) deforms itself from an initial shape to adjust to the image features. An energy minimizing function is used which adapts the curve to image features such as lines and edges. The function is guided by internal constraint forces and external image forces. The best fit is achieved by minimizing a total energy computation of the curve. In effect, continuity and smoothness constraints are imposed to control deformation of the model. An initial estimate for one frame is the derived contour of the object from a prior frame. A shortcoming of the conventional active contour model is that small changes in object position or shape from one frame to the next may cause the boundary identification to fail. In particular, rather than following the object, the estimated boundary instead may latch onto strong false edges in the background, distorting the object contour. Accordingly, there is a need for an improved method for segmenting and tracking a non-rigid 3-dimensional video object.
According to the invention, constraints on the topological changes to an active contour from one frame to the next are relaxed. The contour is derived by minimizing contour energy while also considering normalized background information and motion boundary information. The normalized background information and motion boundary information contribute to defining the object boundary propagation from one frame to the next, so that the constraints on contour topology can be relaxed.
According to one aspect of the invention, a background model is derived to distinguish foreground from a normalized background within each image frame of a sequence of image frames. Such background model is derived for a generally stable background over a sequence of image frames
According to another aspect of the invention, by comparing successive image frames, a forward frame difference and a backward frame difference are derived for a given frame. Combining the forward frame difference and the backward frame difference removes double image errors and results in a motion boundary for the given image frame.
According to another aspect of this invention, the image data for the given image frame are allocated among three groups. In one group are image data which are part of the derived motion boundary, along with image data which differ by at least a threshold amount from a corresponding point among the normalized background data. In another group are image data which closely correspond to the normalized background image data. A third group includes the remaining pixels, (i.e., pixels not part of the motion background, which do not closely correspond to the normalized background image data, and which do not differ from such normalized background data by the threshold amount.) In some embodiments morphological filtering is performed on the first group of image data with discarded image data placed in the third group.
To derive an object boundary estimate for a tracked object within the given image frame, an initial estimate is the derived object boundary from the preceding image frame. Such initial estimate is adjusted based on object tracking for the current image frame. In some embodiments other adjustments also are introduced, such as for detecting local affine deformations. The revised estimate of the object boundary then is processed based on the background model and the motion boundary information, along with refining using an active contour model.
In a preferred embodiment the background model is derived a preprocessing for the entire sequence of image frames. For a given image frame, object tracking and initial boundary estimation processes are performed. The motion boundary derivation then is performed along with an application of the active contour model.
According to another aspect of this invention, the result of the preprocessing and motion boundary processing is a revised estimate of the object boundary. Such revised estimate is processed. Starting from a first point of the revised estimate, a next point on the object boundary is derived by determining whether the adjacent point on the revised object boundary is in the first group, second group or third group of image data. If in the first group, then the contour boundary propagates outward by one pixel (so as to inflate the object boundary at the corresponding location on the revised object boundary). If in the second group, then the contour boundary propagates inward by one pixel (so as to deflate the object boundary at the corresponding location on the revised object boundary). Successive iterations are performed going around the object boundary with the points on the object boundary propagating inward or outward by one pixel according to which group contains the image data point.
When a point on the boundary is found in the third group, the object boundary either stays the same for such point or propagates inward depending on image edge information. Edge energy is derived for the image frame to derive a representation of image edges. When an angle between the point on the object boundary and the derived image edge is larger than xcfx80/4 and the edge energy at such point is very low (e.g., less than 1% of the maximum along the image edge), then the object boundary propagates inward at such data point. Otherwise the object boundary stands still at such image point.
According to another aspect of the invention, when the revised object boundary is divided into multiple regions, to avoid losing regions, joint pixels are added to form one composite region. Specifically, joint pixels are added where a smaller region is located within the revised object boundary. Such region would be dropped during processing of the revised estimate of the object boundary. In one embodiment the criteria for adding joint pixels includes: a pair of regions within the revised estimate of the object boundary with the narrowest gap are connected first; all joint pixels are to occur within the object boundary and cross the narrowest gap (a maximum gap length may be specified); and the joint pixels are placed in the first group of image data for purposes of processing the revised estimate of the image boundary.
According to another aspect of the invention, an active contour model is applied to the propagated object boundary to refine the object boundary as a final estimate for the given image frame.