The present invention is related to the field of digital video processing and analysis, and more specifically, to a method and apparatus for tracking a video object in an ordered sequence of two-dimensional images, including obtaining the trajectory of the video object in the sequence of images.
Reference is made to a patent application entitled Apparatus and Method For Collaborative Dynamic Video Annotation being filed on even date herewith, and assigned to the same assignee as the present application, and whereof the disclosure is herein incorporated by reference to the extent it is not incompatible with the present application.
A situation can arise wherein two or more users wish to communicate in reference to a common object, for example, in reference to a video. An example of this could be where a soccer team coach wishes to consult with a colleague to seek advice. The soccer team coach might wish to show a taped video of a game and ask the colleague to explain, using the video, why one team failed to score in a given attack situation. In addition, the coach might wish to record this discussion and show it later to other coaches to get more opinions.
In another scenario, a student could be taking a training course being given at a remote location from where a course instructor is located. It may be that the student cannot understand a procedure being taught in the course. The student can then call the instructor over the Internet phone to find out how such a procedure should be performed. The instructor can first browse through the training video together with the student to find the clip where the difficulty can be identified. The student may then ask various questions of the instructor about that procedure. For example, the instructor may then decide to show the student another video, which offers more detailed information. The instructor may then annotate this video using collaborative video annotation tools to explain to the student how this procedure should be performed.
Existing methods for object tracking can be broadly classified as feature-point (or token) tracking, boundary (and thus shape) tracking and region tracking. Kalman filtering and template matching techniques are commonly applied for tracking feature points. A review of other 2-D and 3-D feature-point tracking methods is found in Y.-S. Yao and R. Chellappa, xe2x80x9cTracking a dynamic set of feature points,xe2x80x9d IEEE Trans. Image Processing, vol.4, pp.1382-1395, October 1995.
However, all these feature point tracking methods fail to track occluded points unless they move linearly. Boundary tracking has been studied using a locally deformable active contour model snakesy. See, for example, F. Leymarie and M. Levine, xe2x80x9cTracking deformable objects in the plane using an active contour model,xe2x80x9d IEEE Trans. Pattern Anal. Mach. Intel., vol.15, pp.617-634, June 1993;. K. Fujimura, N. Yokoya, and K. Yamamoto, xe2x80x9cMotion tracking of deformable objects by active contour models using multiscale dynamic programming,xe2x80x9d J. of Visual Comm. and Image Representation, vol.4, pp.382-391, December 1993; M. Kass, A. Witkin, and D. Terzopoulos, xe2x80x9cSnakes: active contour models,xe2x80x9d Int. Journal of Comp. Vision, vol. 1, no. 4, pp. 321-331, 1988. In addition, boundary tracking ahs been studied using a locally deformable template model. See, for example, C. Kervrann and F. Heitz, xe2x80x9cRobust tracking of stochastic deformable models in long image sequences,xe2x80x9d in IEEE Int. Conf. Image Proc., (Austin, Tex.), November 1994.
These boundary tracking methods, however, lack the ability of tracking rapidly moving objects because they do not have a prediction mechanism to initialize the snake. In order to handle large motion, a region-based motion prediction has been employed to guide the snake into a subsequent frame. See, for example, B. Bascle and et al, xe2x80x9cTracking complex primitives in an image sequence,xe2x80x9d in Int. Conf. Pattern Recog., (Israel), pp. 426-431, October 1994; and B. Bascle and R. Deriche, xe2x80x9cRegion tracking through image sequences,xe2x80x9d in Int. Conf. Computer Vision, pp. 302-307, 1995.
Nevertheless, the prediction relies on a global, that is, not locally varying, motion assumption, and thus may not be satisfactory when there are local deformations within the boundary and the image background is very busy. Region tracking methods can be categorized into those that employ global deformation models and those that allow for local deformations. See for example, Y. Y. Tang and C. Y. Suen, xe2x80x9cNew algorithms for fixed and elastic geometric transformation models,xe2x80x9d IP, vol. 3, pp.355-366, July 1994.
A method for region tracking using a single affine motion within each object has been proposed that assigns a second-order temporal trajectory to each affine model parameter. See, for example, F. G. Meyer and P. Bouthemy, xe2x80x9cRegion-based tracking using affine motion models in long image sequences,xe2x80x9d CVGIP: Image Understanding, vol. 60, pp.119-140, September 1994. Bascle et. al. propose to combine region and boundary tracking. See the aforementioned article by B. Bascle and R. Deriche, xe2x80x9cRegion tracking through image sequences,xe2x80x9d in Int. Conf. Computer Vision, pp. 302-307, 1995.
They use a region-based deformable model, which relies on texture matching for its optimization, that allows the tracking approach to handle relatively large displacements, cluttered images and occlusions. Moscheni et. al. suggest using spatio-temporal segmentation for every frame pair followed by the temporal linkage of an object, to track coherently moving regions of the images. See F. Moscheni, F. Dufaux, and M. Kunt, xe2x80x9cObject tracking based on temporal and spatial information,xe2x80x9d in IEEE Int. Conf. Acoust., Speech, and Signal Proc., (Atlanta, Ga.), May 1996.
Additional background information is provided in the following:
U.S. Pat. No. 5,280,530, entitled METHOD AND APPARATUS FOR TRACKING A MOVING OBJECT and issued Jan. 18, 1994 the name of Trew et al. discloses a method of tracking a moving object in a scene, for example the face of a person in videophone applications. The method comprises forming an initial template of the face, extracting a mask outlining the face, dividing the template into a plurality (for example sixteen) sub-templates, searching the next frame to find a match with the template, searching the next frame to find a match with each of the sub-templates, determining the displacements of each of the sub-templates with respect to the template, using the displacements to determine affine transform coefficients and performing an affine transform to produce an updated template and updated mask.
U.S. Pat. No. 5,625,715, entitled : METHOD AND APPARATUS FOR ENCODING PICTURES INCLUDING A MOVING OBJECT, issued Apr. 29, 1997 in the name of Trew et al. discloses a method of encoding a sequence of images including a moving object. The method comprises forming an initial template, extracting a mask outlining the object, dividing the template into a plurality (for example sixteen) sub-templates, searching the next frame to find a match with the template, searching the next frame to find a match with each of the sub-templates, determining the displacements of each of the sub-templates with respect to the template, using the displacements to determine affine transform coefficients and performing an affine transform to produce an updated template and updated mask. Encoding is performed at a higher resolution for portions within the outline than for portions outside the outline.
U.S. Pat. No. 5,473,369 entitled OBJECT TRACKING APPARATUS issued Feb. 23, 1994 in the name of Abe discloses an object detecting and tracking apparatus which detects a tracking target object from a moving image photographed by a television camera and tracks the same, wherein the movement of the object is detected reliably and with high accuracy to automatically track the target object. When tracking is started after putting the target object in a region designating frame xe2x80x9cWAKUxe2x80x9d displayed on a screen in such a way as to be variable in size and position, a video signal input from a photographic optical system is Y/C-separated. After that, the target object is specified from a tracking region histogram, and movement vectors are obtained from a color-time-space image, color-time-space differential image, and/or luminance-time-space image thereof, thereby making it possible to more reliably detect even the movement of varied objects as compared with the conventional block matching.
U.S. Pat. No. 5,592,228 entitled VIDEO ENCODER USING GLOBAL MOTION ESTIMATION AND POLYGONAL PATCH MOTION ESTIMATION and issued Jan. 7, 1997 in the name of Dachiku et al. discloses a video coding apparatus for a high coding efficiency even at a low bit rate which includes a moving object analyzer which extracts a moving part from an input picture signal, analyzes its motion, and outputs a residual signal relative to a reconstruction image and motion parameters. The apparatus further includes a residual coding device for coding the residual signal, a reconstruction device for reconstructing a picture using the motion parameters, and a device that performs a variable length coding of the motion parameters and residual coded information.
Patent document EP 0 805 405 A2, xe2x80x9cMOTION EVENT DETECTION FOR VIDEO INDEXINGxe2x80x9d, Courtney et al., discusses a motion segmentation method for object tracking. Given the segmented images and the motion of each segment, the method links the segments in consecutive frames based on their position and estimated velocity to achieve object tracking.
U.S. Pat. No. 5,684,715 entitled INTERACTIVE VIDEO SYSTEM WITH DYNAMIC VIDEO OBJECT DESCRIPTORS issued Nov. 4, 1997 in the name of Palmer discloses an interactive video system by which an operator is able to select an object moving in a video sequence and by which the interactive video system is notified which object was selected so as to take appropriate action. Interactive video is achieved through generation and use of video object descriptors which are synchronized to objects in the video sequence. Video object descriptors are generated by a generating tool which decomposes frames of video sequences and tracks movement of objects in those frames so as to generate a frame sequential file of video object descriptors. The file of video object descriptors are then used by an event interpreter which detects a match between the position of a pointing device on a display containing the video sequence and the position of a video object descriptor. When a match is detected, an interactive video operation is performed, such as jumping to a new video sequence, altering flow of the interactive video program or the like.
All these region and boundary tracking methods address tracking the shape and location of the object, if the object boundary is provided only in one frame, the starting frame, or not provided in any frame. In the latter, motion segmentation methods are employed to find the objects in a scene.
Selection of the object boundary in more than one frame is addressed by Z. Chen, S.-M. Tan, D. Xie, A. Sane, Y. Li and R. H. Campbell in http://www.vosaic.com/corp/papers/www5.html, and in J. Kanda, K. Wakimoto, H. Abe, and S. Tanaka, xe2x80x9cVideo hypermedia authoring using automatic object tracking,xe2x80x9d in Proceedings of SPIE on Storage and Retrieval for Image and Video Databases VI, (San Jose, Calif.), pp. 108-115, Jan. 18-30 1998.
Both methods proposed to use linear interpolation for predicting the location of the object in between from those given for two frames. In addition, Kanda et al. suggested using the location information of the object given in any two frames to check the validity of a template based tracking results from one frame to another. However, both of the above methods do not utilize the shape, location, and color information of the object in both frames to help and improve the template-based tracking performance.
An object of the present invention is to satisfy the aforementioned needs. In accordance with an aspect of the present invention, a program for use on a computer provides a location and marker of a video object from a plurality of images.
In accordance with another aspect of the invention, a computer readable storage medium has a computer program stored thereon performing the steps of (a) identifying an object to be tracked in a frame, referred as the first frame, and determining its shape in this frame; (b) selecting another frame, referred to as the last frame, and identifying the shape of the same object in this frame; and (c) finding the location and marker, hence the trajectory and global motion, of the object in every frame in between the first and last frames.
In accordance with another aspect of the invention, step (c) comprises the steps of (i) finding the location, marker and shape of the object, in a subset of frames in between the first and last frames; and (2) finding the location, marker and shape of the object in every frame from those.
In accordance with another aspect of the invention, step (1) further comprises the steps of (A) predicting the marker location and location of the object in the frame being processed from those for the neighboring processed frames; (B) predicting the marker and location of the object in the frame being processed based on histogram back-projection; (C) fusing of the two predictions for the marker and location in the frame being processed to find the search space for template matching; (D) template matching for finding the marker and location of the object in the frame being processed; and (E) identifying the boundary of the object in the frame being processed.
It is an object of the present invention to provide an object tracking technique that finds the location and marker of the object being tracked in an image sequence, while permitting partial occlusions by other objects in the video or self-occlusions.