The present invention relates to a method of describing object region data such that information about an object region in a video is described, an apparatus for generating object region data such that information about an object region in a video is generated, a video processing apparatus arranged to be given an instruction about an object in a video to perform a predetermined process or retrieve an object in a video, and a video processing method therefor.
Hyper media are configured such that related information called a hyper link is given in between mediums, such as videos, sounds or texts, to permit mutual reference. When videos are mainly used, related information has been provided for each object which appears in the video. When the object is specified, related information (text information or the like) is displayed. The foregoing structure is a representative example of the hyper media. The object in the video is expressed by a frame number or a time stamp of the video, and information for identifying a region in the video which are recorded in video data or recorded as individual data.
Mask images have frequently been used as means for identifying a region in a video. The mask image is a bit map image constituted by giving different pixel values between the inside portion of an identified region and the outside portion of the same. A simplest method has an arrangement that a pixel value of “1” is given to the inside portion of the region and “0” is given to the outside portion of the same. Alternatively, α values which are employed in computer graphics are sometimes employed. Since the a value is usually able to express 256 levels of gray, a portion of the levels is used. The inside portion of the specified region is expressed as 255, while the outside portion of the same is expressed as 0. The latter image is called an α map. When the regions in the image are expressed by the mask images, determination whether or not a pixel in a frame is included in the specified region can easily be made by reading the value of the pixel of the mask image and by determining whether the value is 0 or 255. The mask image has freedom with which a region can be expressed regardless of the shape of the region and even a discontinuous region can be expressed. The mask image must have pixels, the size of which is the same as the size of the original image. Thus, there arises a problem in that the quantity of data cannot be reduced.
To reduce the quantity of data of the mask image, the mask image is frequently compressed. When the mask image is a binary mask image constituted by 0 and 1, a process of a binary image can be performed. Therefore, the compression method employed in facsimile machines or the like is frequently employed. In the case of MPEG-4 in which ISO/IEC MPEG (Moving Picture Experts Group) has been standardized, an arbitrary shape coding method will be employed in which the mask image constituted by 0 and 1 and the mask image using the a value are compressed. The foregoing compression method is a method using motion compensation and capable of improving compression efficiency. On the other hand, complex compression and decoding processes are required.
To express a region in a video, the mask image or the compressed mask image has usually been employed. However, data for identifying a region is required to permit easy and quick extraction, to be reduced in quantity and to permit easy handling.
On the other hand, the hyper media, which are usually assumed that an operation for displaying related information of a moving object in a video is performed, have somewhat difficulty in specifying the object as distinct from handling of a still image. A user usually has difficulty in specifying a specific portion. Therefore, it can be considered that the user usually aims, for example, a portion in the vicinity of the center of the object in a rough manner. Moreover, a portion adjacent to the object which is deviated from the object is frequently specified according to the movement of the object. Therefore, data for specifying a region is desired to be adaptable to the foregoing media. Moreover, an aiding mechanism for facilitating specification of a moving object in a video is required for the system for displaying related information of the moving object in the video.
As described above, the conventional method of expressing a desired object region in a video by using the mask image suffers from a problem in that the quantity of data cannot be reduced. The method arranged to compress the mask image raises a problem in that coding and decoding become too complicated. What is worse, directly accessing to the pixel of a predetermined frame cannot be performed, causing handling to become difficult.
There arises another problem in that a device for permitting a user to easily instruct a moving object in a video has not been provided.