1. Field of the Invention
The present invention is in the field of annotating image sequences, where object trajectories are specified by a user on an interactive display system.
2. Background of the Invention
Annotation of images has been used in a number of inventions. U.S. Pat. No. 6,480,186, by L. McCabe and J. Wojciechowski, describes an apparatus and method for annotating single images captured by an ultrasound machine. Characters are placed on the image using an alphanumeric input device. The device is limited to annotation of single images, and not on ultrasound image sequences. From single images, this can be extended to image sequences or video. U.S. Pat. No. 6,452,615, by P. Chiu, L. Wilcox and A. Kapuskar, disclosed a notetaking system wherein video streams are annotated. The notes taken during the notetaking session are time-stamped and indexed into the video stream for later playback. U.S. Pat. No. 6,867,880, by K. Silverbrook and P. Lapstun, disclosed a method and system for instructing a computer using coded marks. Through a drag and drop mechanism, the user is able to perform image manipulation from a given album or folder. U.S. Pat. No. 5,583,980, by G. Anderson, presents a method of annotating an image and synchronizing this annotation with a time-based program. Pen movements on the image are captured and synchronized with the program, which is played in another screen during annotation. It could be argued that this method could potentially be used for annotating object trajectories, but real-time marking of objects in video requires the user to accurately anticipate where the object is going. Furthermore, manual tracking in real-time is virtually impossible if the object exhibits a lot of random movement. U.S. Pat. No. 6,873,993, by Charlesworth, et al., relates to an apparatus and method for indexing sequences of sub-word units, such as sequences of phonemes or the like. This can be seen as a parallel in the text domain. U.S. Pat. No. 6,076,734, by Dougherty, et al., provides a variety of methods and systems for providing computer/human interfaces, but it only provides a generic method for any interface and does not talk about means of automating or accelerating data input tasks for any specific domain.
Manual annotation of video images has been used in the quantitative performance evaluation of vision algorithms. Pixel-accurate performance evaluation has been used for object detection algorithms. In the paper by Mariano, et al., “Performance Evaluation of Object Detection Algorithms”, Proceedings of Intl Conference on Pattern Recognition, 2002, images are annotated by marking objects like text with bounding boxes. Other evaluation methods used similar bounding-box annotations, like in the work of Hua, et al., “An automatic performance evaluation protocol for video text detection algorithms”, IEEE Transactions of Circuits and Systems for Video Technology, 2004. Since the test video data has a lot of frames to annotate, the inefficient frame-by-frame marking of text blocks makes the task very time consuming. For text that does not move, the bounding box where the text first appeared can be propagated to the subsequent frames. But for moving text, such as movie credits or scene text such as the characters on a moving truck, the text blocks have to be tediously tracked from frame to frame.
Manual annotation of image sequences is an important task in the analysis of image sequences. Objects in image sequences can be counted, tracked and marked with metadata such as text and colored tags. For example, in market research of retail stores, stored video can be watched and annotated to count people, track them around the store, and record the time they spend in particular spots.
Another important purpose of manual annotation of image sequences is in the development of object tracking algorithms. Many computer vision methods for tracking require manually-generated data of object trajectories. This data is also called the “ground-truth”. This data can be divided into two types—training and test data. The tracking methods “learn” from the training data to set its internal parameters. The trajectories in the test data are then used to quantitatively evaluate the performance of the algorithm. The tracking algorithm's internal parameters can then be optimized to maximize the performance measures. Another important use of ground-truth trajectory data is in comparing different tracking algorithms. Two or more tracking algorithms can be run on a single trajectory data set and their results are compared by some performance measure.
A tool for annotating video data is the result of the work of David Doermann and David Mihalcik, “Tools and Techniques for Video Performance Evaluation”, Proceedings of Intl Conference on Pattern Recognition, Volume 4, 2000. The system, called ViPEr, provides random access of frames in a video and allows objects to be marked and tracked across consecutive frames. The ViPEr tool was used to evaluate algorithms for detecting objects in video—the work of Mariano, et al., “Performance Evaluation of Object Detection Algorithms”, Proceedings of Intl Conference on Pattern Recognition, 2002. One inefficiency with using ViPEr is that an object trajectory is marked by clicking on the object location for each frame in the image sequence. Furthermore, the images in the sequence are displayed one at a time, requiring the user to skip to the next frame after marking the object location in the current frame. The repeated mark-then-skip routine takes a lot of time, especially when annotating hours of videos containing many objects to be tracked.
Instead of using the traditional “mark-then-skip-to-next-frame” routine, the present invention shows many consecutive frames in one screen and the user spatially tracks the object across the displayed consecutive frames.