The present invention concerns a system and method for tracking moving objects in a sequence of video images and in particular, a system that represents the moving objects in terms of layers and uses the layered representation to track the objects.
Many methods have been proposed to accurately track moving objects in a sequence of two-dimensional images. Most of these methods can track moving objects only when the motion conforms to predefined conditions. For example, change-based trackers ignore any information concerning the appearance of the object in the image and thus have difficulty dealing with moving objects that overlap or come close to overlapping in the sequence of images. Template-based image tracking systems such as that disclosed in the article by G. Hager et al. entitled xe2x80x9cReal-time tracking of image regions with changes in geometry and illumination,xe2x80x9d Proceedings. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 403-410, 1996, typically update only motion. The templates used by these systems can drift off or become attached to other objects of similar appearance. Some template trackers, such as that disclosed in the article by M. J. Black et al. entitled xe2x80x9cTracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion,xe2x80x9d Proceedings of the. Fifth International Conference on Computer Vision, ICCV""95, p.p. 374-381 1995 use parametric motion (affine/similarity etc.) to update both the motion and the shape of the template. Because, however, there is no explicit updating of template ownership, drift may still occur. A Multiple-hypothesis tracking method disclosed, for example in an article by I. J. Cox et al. entitled xe2x80x9cAn efficient implementation of Reid""s multiple hypothesis tracking algorithm and its evaluation for the purpose of visual tracking,xe2x80x9d EEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 2, pp. 138-150, February 1996, solves some of these problems but only when the image sequence is processed off-line in a batch mode. In addition, the computational complexity of these algorithms limits their state representations to contain only motion information.
The present invention is embodied in a system that tracks one or more moving objects in a sequence of video images. The tracking system employs a dynamic layer representation to represent the objects that are being tracked. This tracking system incrementally estimates the layers in the sequence of video images.
According to one aspect of the invention, the system concurrently estimates three components of the dynamic layer representationxe2x80x94layer segmentation, motion, and appearancexe2x80x94over time in a maximum a posteriori (MAP) framework. In order to enforce a global shape constraint and to maintain the layer segmentation over time, the subject invention imposes a prior constraint on parametric segmentation. In addition, the system uses a generalized Expectation-Maximization (EM) algorithm to compute an optimal solution.
According to one aspect of the invention, the system uses an object state that consists of representations of motion, appearance and ownership masks. With an object state represented as a layer, maximum a posteriori (MAP) estimation in a temporally incremental mode is applied to update the object state for tracking.
According to another aspect of the invention, the system applies a constant appearance model across multiple images in the video stream.
According another aspect of the invention, the system employs a parametric representation of the layer ownership.