Newer video coding standards, such as MPEG-4, allow arbitrary-shaped objects to be encoded and decoded as separate video object planes (VOPs). These emerging standards are intended to enable multimedia applications, such as interactive video, where access is universal, and where natural and synthetic objects are integrated. For example, one might want to “cut-and-paste” moving persons from one video to another. In order to identify the persons, the persons must first be tracked.
A VOP describes a video object in terms of, for example, shape, motion, and texture. The exact method of producing the VOP from the source images is not defined by the standards. It is assumed that “natural” objects are represented by shape information, in addition to the usual luminance and chrominance components. Because video objects vary extensively with respect to low-level features, such as, optical flow, color, and intensity, object tracking is a very difficult problem.
Recent advances in object tracking make it possible to obtain spatio-temporal motion trajectories of moving object for further analysis of concealed information. Although the extraction of trajectories is well known, precise comparison of the extracted trajectories and secondary outputs of the tracking process is not well understood.
A key issue in evaluating results of object tracking, i.e., the object trajectories, is a metric that determines a similarity of the trajectories. Any additional analysis, such as action recognition and event detection, depends highly on an accuracy of the similarity assessment.
Most prior art similarity metrics determine a mean distance of the corresponding positions of two equal duration trajectories, C. Jaynes, S. Webb, R. Steele, and Q. Xiong, “An open development environment for evaluation of video surveillance systems,” Proc. of PETS, June 2002, and A. Senior, A. Hampapur, Y. Tian, L. Brown, S. Pankanti, and R. Bolle. “Appearance models for occlusion handling,” Proc. of PETS, December 2001. These are strictly ‘distance’ metrics.
Supplementary statistics such as variance, median, minimum, and maximum distances are also known to extend the description of similarity.
An alignment based metric reveals a spatial translation and a temporal shift between given trajectories, C. Needham and R. Boyle. “Performance evaluation metrics and statistics for positional tracker evaluation,” Third International Conference of Computer Vision Systems, pages 278-289, April 2003. That method uses an area based metric that measures a total enclosed area between the trajectories using trajectory intersections.
Other statistical properties of the tracking performance use compensated means and standard deviations, T. Ellis, “Performance metrics and methods for tracking in surveillance,” Proc. of PETS, June 2002.
One main disadvantage of the prior art methods are that those methods are all limited to equal duration (lifetime) trajectories. This means that the number of coordinates that constitute the trajectories are equal.
Typically, the coordinates are sampled at different time instances. Because the conventional similarity metrics depend on mutual coordinate correspondences, those metrics cannot be applied to trajectories that have unequal or varying durations unless the trajectory duration is first normalized or parameterized. However, such a normalization destroys the temporal properties of the trajectories.
Conventional similarity metrics assume that the temporal sampling rates of the trajectories are equal. For instance, a ground truth trajectory labeled at a certain frame rate can be compared only with the trajectory generated by a tracker working at the identical frame rate. Those methods cannot handle uneven sampling instances, i.e., varying temporal distance between the coordinates.
This is a common case, especially for the real-time object trackers that process streaming video data. A real-time tracker works on the next available frame, which may not be the immediate temporal successor of the current frame, whenever the current frame is processed. Thus, the trajectory coordinates obtained have varying temporal distances.
Therefore, there is a need for a method for tracking multiple objects in videos that overcomes the problems of the prior art. Furthermore, there is a need for similarity metrics that can compare a wide variety of trajectory patterns.