1. Field of the Invention
The present invention relates generally to computer vision, and more particularly to a method and apparatus for tracking an object moving through a sequence of images while identifying the object and changes of view of the object.
2. Description of Related Art
Different techniques have been developed for tracking an object moving through a sequence of images. The motion of an object through a sequence of images can be both rigid and articulated. An object with rigid motion moves cohesively from one position in a frame to another. An object with articulated motion, on the other hand, tends to deform as it moves between frames. For example, the motion of a hand is both articulated and rigid. Besides recognizing the motion of an object in a sequence of images, techniques have been developed for recognizing the changing appearance of an object between image frames. For example, in addition to tracking the position of a hand between image frames, the shape of the hand is sought to be identified. Techniques for tracking objects, therefore, attempt not only to track the object but also to recognize any change in appearance of the object between image frames.
Parameterized optical flow estimation is one method for tracking an object as it moves in a sequence of images. As disclosed by Adelson et al. in an article entitled xe2x80x9cThe Plenoptic Function and The Elements of Early Vision,xe2x80x9d published in Computation Models of Visual Processing pp. 1-20, Boston, Mass., 1991, MIT Press (Landy et al. Editors), these techniques treat an image region containing an object as moving xe2x80x9cstuffxe2x80x9d. Consequently, these techniques are unable to distinguish between changes in xe2x80x9cviewpointxe2x80x9d or configuration (i.e., appearance) of the object and changes in xe2x80x9cpositionxe2x80x9d relative to a recording device. More specifically, these optical flow techniques represent image motion in terms of some low-ordered polynomial (e.g. an affine transformation). A disadvantage of optical flow techniques is that tracking may fail when the initial viewpoint of an object is used for tracking changes between frames.
Another method for tracking an object through a sequence of images is with template matching techniques. Template matching techniques give rise to a xe2x80x9cthingxe2x80x9d being tracked through an image sequence. These template matching techniques are typically limited to situations in which the motion of the object through the sequence of images is simple and the viewpoint of the object is either fixed or changes slowly. A disadvantage, therefore, of these template matching techniques is that if the view of the object being tracked changes significantly through the sequence of images, then the xe2x80x9cthingxe2x80x9d being tracked may no longer be recognizable and the tracking may fail.
Yet another method for tracking an object through a sequence of images is with three dimensional modeling techniques. Three dimensional modeling techniques tend to track rigid objects effectively. For example, three dimensional modeling works well when tracking rigid objects such as cars. However, performance of three dimensional modeling techniques degrades significantly when tracking an articulated object such as a hand because the modeling becomes computationally expensive. Another disadvantage is that, it may be difficult to automatically construct a three dimensional model of the object to be tracked. An aspect of three dimensional modeling is that it encodes the structure of an object but not necessarily its appearance. This aspect of three dimensional modeling may be disadvantageous when pertinent features of an object are not its structure but the object""s texture and markings.
Besides the aforementioned methods for tracking an object through a sequence of images, a number of techniques have been used to determine the appearance of an object. These include techniques that focus on an object""s structure (i.e., object-centered structural descriptions) and techniques that focus on an object""s view (i.e., view-based object representations). One method for making view-based determinations of an object representation is through the use of an eigenspace. In general, an eigenspace defines a set of orthogonal basis vectors. A linear combination of these basis vectors can then be used to approximate an image. Because the basis vectors are orthogonal to each other, each basis vector adds information to the whole as defined by the value of its coefficient.
Eigenspaces have been used to initially locate an object in an image, as disclosed by Turk et al. in U.S. Pat. No. 5,164,992 (also published in xe2x80x9cFace Recognition Using Eigenfacesxe2x80x9d, Proc. Computer Vision and Pattern Recognition, CVPR-91, pp. 586-591, Maui, June 1991). More specifically, Turk et al. discloses a system that uses an eigenspace to perform global searching by comparing an input image with the eigenspace at every image location. Global searching is extended by Moghaddam et al. in xe2x80x9cProbabilistic Visual Learning For Object Detection,xe2x80x9d Proceedings of the International Conference on Computer Vision, pp. 786-793, Boston, Mass., June 1995. Moghaddam et al. extends the global search idea to include scale by matching the input at different scales using a standard eigenspace approach.
In addition, many eigenspace approaches require that the object is located and cleanly segmented from the background of the image before the image can be matched with the eigenspace. This segmentation is performed so that reconstruction and recognition of the object is more accurate since it is based on the object and not the image background. Consequently, most eigenspace approaches require that an object is located in the image, segmented from its image background, and transformed into a predetermined form before the object can be matched with an eigenspace. Initially, the predetermined form or view of an object includes its position, orientation and resolution (i.e., scale).
Some eigenspace approaches such as that disclosed by Murase et al., however, have been used to avoid rotating an image into a predetermined orientation in preparation for matching. Specifically, Murase et al. disclose such a technique in xe2x80x9cVisual Learning and Recognition of 3-D Objects from Appearance,xe2x80x9d International Journal of Computer Vision, 14:5-24, 1995. Briefly, Murase et al. discloses the construction of an eigenspace from a training set of images that represent every possible viewpoint of an object. This multiple viewpoint eigenspace eliminates the need for orienting an object before matching it with the eigenspace. In addition, this multiple viewpoint eigenspace can be used to identify changes in view.
Many of the aforementioned view-based matching systems that are used for recognizing objects are limited in certain respects. Some of these view-based systems are affected by image transformations such as translation, scaling, and rotation. Other of these view-based matching systems perform separate operations to segment an object from an image and transform the object into a predetermined form for matching with an eigenspace. Additionally, some of these methods for matching require a large set of views of the object for accurate matching. It would, therefore, be desirable to provide a method and apparatus for tracking an object in a sequence of images using a view-based representation of objects that does not require a large set of views while recognizing both changes in viewpoint and changes in position. Furthermore, it would be advantageous for this method and apparatus to simultaneously perform operations for transforming an object into its predetermined form for matching and operations for matching the object with an eigenspace.
In accordance with the invention there is provided an apparatus, and method and article of manufacture therefor, for identifying and tracking an object recorded in a sequence of images. A memory of the apparatus is used to store a set of training images. Each image in the training set of images records a different view of the object in the sequence of images. A set of basis images is generated for the set of training images stored in the memory. The generated set of basis images is used to characterize variations of the views of the object in the set of training images. Each image in the sequence of images is evaluated to identify changes in view and structure of the object while tracking the object through the sequence of images. Changes in view and structure of the object in an image in the sequence of images is identified by aligning and matching a view of the object in the image with the views of the object represented in the set of basis images.