1. Field of the Invention
The present invention relates to a technique for separating a foreground object and a background sprite by using the sprite coding method which is an object coding method in MPEG-4. More particularly, the present invention relates to a technique for separating and extracting the foreground object from the background sprite, wherein the technique is supported by the sprite coding which represents a background object as a panoramic image. In this technique, the sprite coding is an object coding supported by MPEG-4 Version 1 Main Profile where coding is performed for each object.
In addition, the present invention relates to a segmentation mask extraction technique for generating a segmentation mask which is one of shape object representations, which are a texture map and the segmentation mask, in MPEG-4.
2. Description of the Related Art
In the description of this specification, a moving object will be described as a foreground object, and a background panorama will be described as a background sprite.
As for the technique for separating the foreground object and the background object, there are following techniques for extracting the foreground object from the background object.
A first method is as follows. An object such as a person is placed in front of a background which is colored with a uniform color. Then, the foreground object such as the person is extracted by using a chroma key technique.
A second method is as follows. A rough outline is manually specified beforehand. Then, it is determined whether a pixel around the outline is a foreground or a background.
A third method is that, a moving area outline is specified by obtaining differences between frames of an image taken by a fixed camera such that the inside of the outline is judged as the foreground and the outside is judged as the background.
There are following techniques for extracting the background sprite.
A first method is as follows. A global motion between adjacent frames is calculated as a common preprocess for generating a sprite, and then, transformation from standard coordinates (absolute global motion) is calculated. After that, a median or an average value is calculated in the time direction for frames which are aligned by using the absolute global motion.
A second method is as follows. After performing the preprocess, frames are aligned by using the absolute global motion, and then, frames are overwritten, or, underwritten (an area where a pixel value is not decided is filled).
However, there are two problems in the above-mentioned first method for extracting the foreground object. The first problem is that the method can not be applied to an existing image. The second problem is that the method requires a large-scale apparatus for the chroma key.
The second method for extracting the foreground object has a problem in that it is not suitable for a real-time application since it requires manual processing.
The third method for extracting the foreground object has a problem in that the outline information of the foreground object can not be obtained when a camera moves (such as panning, tilting) since the third method is based on calculating the differences between frames. In addition, even when frames are aligned such that camera movement is canceled before calculating differences, the camera movement can not be canceled completely. Thus, difference value appears in an area other than the foreground object. Therefore, the third method has a problem in that the outline can not be specified.
The first method for extracting the background sprite has a problem in that, when there is an error to a certain degree in the global motion, quality of the sprite is degraded since small deviation from alignment occurs in the frames.
The second method for extracting the background sprite has a problem in that a foreground of an image which is placed most to the front remains in the sprite even though the quality of the sprite is good.
In the following, techniques for generating a foreground object shape as a segmentation mask which is one of the shape object representations, which are a texture map and the segmentation mask, in MPEG-4 will be described.
As a conventional foreground object generation method, there is a technique in that differences between a background image and an arbitrary original image are processed by using a threshold operation, and, then, coordinates where the difference is bigger than a threshold are regarded as included in a moving object, that is, a foreground image. First, the object coding in MPEG-4 which is used for the technique will be described.
In MPEG-4, a foreground object of an arbitrary shape can be encoded. A foreground object can be represented by a pair of the texture map and the segmentation mask. There are two kinds of segmentation masks, that is, a multiple-valued shape which represents also transparency and a binary shape which does not represent the transparency. Only the binary shape will be concerned here. In the texture map, a brightness signal (Y signal) and a color-difference signal (Cb, Cr signal) which are used in conventional methods (MPEG 1, 2 and the like) are assigned to an area where an object exist. In the segmentation mask, 255 is assigned to an object area and 0 is assigned to other areas.
In a pixel (coordinates), three kinds of pixel values are assigned for the texture and one kind of pixel value (which will be called an alpha value) are assigned for the shape, that is, four kinds of pixel values are assigned. In order to distinguish the kinds, the pixel for the texture will be called a texture pixel and the pixel for the shape will be called a shape pixel. The texture pixel can take values ranging from 0 to 255. The shape pixel can take values of 0 or 255. FIG. 1A shows an example of the texture representation, and, FIG. 1B shows an example of the segmentation mask representation.
In the following, shape coding in MPEG-4 will be described. The following description is known to a person skilled in the art as the shape coding in MPEG-4. (A reference book, “All of MPEG-4”, pp. 38–116, kougyou chousakai, can be referred to for detailed information.)
Coding of a shape is performed by unit of a macro-block which is s pixels×s pixels. The macro-block can take any size such as 8 pixels×8 pixels and 16 pixels×16 pixels. There are two kinds of shape coding, which are loss less (reversible) and lossy (nonreversible). In the most lossy coding, amount of coding bits is smallest since the shape is approximated to the macro-block unit. More specifically, when equal to or more than half of pixels in the macro-block have the value of 255, that is, when equal to or more than half of the area of the macro-block is filled by an object shape, 255 is assigned to all pixels in the macro-block. In other cases, 0 is assigned to all pixels in the macro-block.
FIGS. 2A and 2B show an example of the above-mentioned macro-block approximation. FIG. 2A shows an original shape and FIG. 2B shows a typical example of the macro-block approximation for the foreground object extraction using the most lossy coded background image.
In the following, an example using the MPEG-4 object coding will be described. An original image will be divided into foreground objects and background objects. In addition, the background object is represented by a panoramic static image which is called a sprite (which is the above-mentioned background sprite). Then, the foreground object is encoded for the shape and the texture and the MPEG-4 sprite coding is performed on the background sprite. (The above-mentioned “All of MPEG-4” can be referred to for detailed information.) Accordingly, in comparison with MPEG-4 simple profile coding (conventional coding based on MC+DCT) without dividing an image into the foreground object and the background sprite, the same level of image quality can be achieved with smaller amount of coding bits.
However, the above-mentioned MPEG-4 shape coding has following problems.
First, amount of shape coding bits becomes large in the loss less coding and in the lossy coding having high degree of precision when the shape is complex. Especially, this tendency is strong when a foreground object is automatically generated.
Second, a process for supplying texture pixels which is called “padding” is necessary for decoding a shape in the loss less coding and in the lossy coding having high degree of precision, which needs large cost for decoding. This causes a problem for realizing real time decoding by software.
Third, by using the lossy coding of the least amount of coding bits, even though the above-mentioned two problems can be avoided, the shape is eroded into the inside of the object such that the shape is not good to look at as shown in FIG. 2B.
Fourth, when the MPEG-4 object coding is used for the foreground and the sprite coding is used for the background, it is when the area ratio of the foreground part to the entire image is equal to or smaller than a certain value that amount of coding bits can be decreased dramatically. Thus, there is a problem in that the amount of coding bits increases when the area ratio is more than the certain value.