The present invention relates to video pattern recognition. In particular, the present invention relates to tracking an object in video data.
Computer vision systems are designed to allow computer systems to extract information from image data. Examples of computer vision systems include 3-D tracking systems that track the three-dimensional movement of an object using successive frames of a video signal, stereo vision systems that build a depth map of a scene using two cameras that provide different perspectives on the scene, and 2-D scene modeling that attempts to build a model to describe a moving object in a scene.
In 3-D tracking systems, the movement of the object is tracked by a single camera based on a strong prior model of what the object looks like. Such models are usually constructed by hand requiring a great deal of work and making it difficult to extend the tracking system to new objects.
Some 3-D tracking systems have relied on particle filtering in which the possible positions of an object are described as particles. At each frame, each particle in a set of particles is scored based on the amount of alignment between the captured image and the prior model positioned at the particle. High scoring particles are retained, while low scoring particles are filtered out. In the next frame, the retained particles are used to propose a new particle set that is grouped around the retained particles. This new particle set is then scored. The high scoring particles in each frame are then used to identify a sequence of positions for the object. Like other 3-D object tracking, particle filtering systems have not been able to learn a model for the appearance of the object. Instead, a strong prior model has been constructed by hand before the system is used for tracking.
In stereo vision systems, the images on two cameras are compared to each other to determine the depth position of particular portions of each image. However, such systems do not produce a generative model of the objects in the images and do not track the movement of objects in successive images.
In 2-D scene modeling, a sequence of images from a single camera is used to learn the appearance of an object as it moves relative to a background. Such systems have not performed well because learning the appearance of objects that can occlude each other is a hard problem when using a single camera.
Thus, a system is needed that improves the performance of scene modeling while allowing 3-D tracking of objects without requiring a strong prior model of the objects.