A demand exists for a technique to be used for defining an object in video, and for extracting and tracking a group of regions. With such a technique, each object could be used to describe a video sequence, and this could be used as a key for the automatic extraction of the definition of the video sequence.
For video content, a person tends to be treated as an important object, and for sports images and applications using a surveillance camera, almost all the objects may be expressed by extracting the region of a person. However, since a person enjoys a high degree of freedom, a fixed template that can effectively extract a region composed of a solid material can not be employed. Thus, region extraction is a difficult operation. Especially when multiple persons are passing each other, the person in front, nearer the camera hides the person behind. Thus, the separation of overlapping objects (called the occlusion state) is not easy.
When the occlusion state is resolved, it is necessary to extract and track regions while the previous state is maintained. However, since it is currently impossible to automate object extraction processing, the trajectories of objects that are automatically extracted must be corrected manually. For this processing, the discovery of an error in the result provided by the automatic process, and the correction of the error must be performed for each of the extracted objects. When ten persons on the average appear in video content, a total of ten corrections are required. And since the person performing the corrections must repeatedly view the same content, the costs involved are huge.
Thus, various object extraction and tracking methods have been proposed and discussed. For example, a method for extracting the image of a person from video content and tracking the image is described in “Pfinder: Real-Time Tracking of the Human Body”, C. Wren, A. Azarbayejani, T. Darrell and A. Pentland, IEEE PAMI vol. 19, No. 7, pp. 780-785, July 1997” (document 1). In document 1, a background model is prepared using a Gaussian model, and segmentation is performed by using, as a reference, the Mahalanobis distance between the model and an input image. Further, a tracking method based on the prediction of motion using a Kalman filter is proposed.
A method that improves on the technique described in document 1 is disclosed in “Improved Tracking of Multiple Humans with Trajectory Prediction and Occlusion Modeling”, R. Rosales and S. Sclaroff, Proc. CVPR '98” (document 2). According to this method, the motions of two persons are predicted using an extended Kalman filter, and the occlusion state, wherein the objects are overlapped, is identified.
Another technique is disclosed in “An Automatic Video Parser for TV Soccer Games”, Y. Gong, C. Chuan and L. T. Sin, Proc. ACCV '95, vol. II, pp. 509-513 (document 3); “Soccer Player Recognition by Pixel Classification in a Hybrid Color Space”, N. Vanderbroucke, L. Macaire and J. Postaire, Proc. SPIE, Vol. 3071, pp. 23-33, August 1997 (document 4); “Where are the Ball and Players? Soccer Game Analysis with Color-Based Tracking and Image Mosaick”, Y. Seo, S. Choi, H. Kim and K. Hong, Proc. ICIAP '97, pp. 196-203 (document 5); and “CLICK-IT: Interactive Television Highlighter for Sports Action Replay”, D. Rees, J. I. Agbinya, N. Stone, F. Chen, S. Seneviratne, M. deBurgh and A. Burch, Proc. ICPR '98, pp. 1484-1487 (document 6). According to this technique, based on histogram backprojection described in “Color Indexing”, M. J. Swain and D. H. Ballard, IJCV, Vol. 7, No. 1, pp. 11-32, 1991 (document 7), a histogram to be tracked is entered in advance, and matching is performed in color space. For the determination of an occlusion, in document 5 the pixels in an occlusion are identified in RGB color space, and in document 4, the pixels are identified in hybrid color space. In document 6, color information is employed for tracking, and the motion prediction method is employed for the determination of an occlusion. As means for also handling information obtained in time space, a method is well known whereby a video sequence is analyzed in the spatio-temporal domain, and the surface of a tracking target. The obtained surface is tubular shaped, and an occlusion is determined based on the continuity along the time axis.
There is a well known technique for employing an interactive process (manual correction process) as a tracking method based on color information. That is, this is a technique whereby a user designates a tracking target, or tracks an object that corresponds to a shape or color (template) that has been entered in advance. For example, the technique for performing template matching based on information concerning the shape (sphere) and the color (white) of a soccer ball is described in “Analysis and Presentation of Soccer Highlights from Digital Video”, D. Yow, B. Yeo, M. Yeung and B. Liu, Proc. ACCV '95. Further, the technique whereby a user employs a mouse to designate a player to be tracked in a soccer game is described in “Determining Motion of Non-Rigid Objects by Active Tubes”, M. Takahata, M. Imai and S. Tsuji, Proc. ICPR '92, pp. 647-650, September, 1992.
However, the technique in document 1 can not extract the images of multiple persons from a video image and determine an occlusion. While in document 2, the technique for tracking two or more persons is not disclosed. Further, according to the methods in documents 3 to 6, only the information obtained in image space and color space is processed, and since the method that uses time space is based on the optimization process for the energy function, the calculation cost is high.
That is, although it is extremely common for two or more persons to appear and to overlap each other in a video image, the conventional techniques can not determine an occlusion. To improve on the conventional techniques, not only the information obtained in image space and in color space, but also the information obtained in time space must be employed. However, the cost of performing the required calculations is high, and to perform real-time tracking at a low cost is difficult.
Further, since currently it is difficult to perform the completely automated determination and tracking of an occlusion state, an interactive process is indispensable. However, it has been requested that means be found to simplify the interactive process, and to reduce the amount of manual labor and the operating time that is required.