1. Field of the Invention
The present invention relates to an image processing apparatus and method and, more particularly, to an image processing apparatus and method of extracting an object from an image and processing the extracted object.
2. Description of the Related Art
Coding systems such as MPEG-1, MPEG-2, and H.261 are known as conventional motion image coding systems. These coding systems can efficiently encode motion images by using an interframe correlation.
As a coding system having higher efficiency than those of the above conventional coding systems, standardization of a system, such as MPEG-4, has been examined which cuts out an object from an image, encodes the object separately from its background, and transmits only this object. When this coding system is used, an image region to be transmitted can be decreased, so motion images can be transmitted even by a low-bit-rate channel. Additionally, a receiving side can display suitable images by selectively displaying objects or changing the arrangement or sizes of objects. Furthermore, editing such as synthesizing an object with another background can be easily performed.
As a method of extracting an object from a motion image, an extraction technique using a chromakey used in broadcasting stations and the like is generally known. This technique is a method of photographing an object such as a person before a blue background and cutting out this person object from the image signal. Object photographing using the chromakey is usually performed in photographing studios and the like under well-ordered illuminating conditions, so no object shades are formed.
Also, the method can automatically separate an image region from a still image. To extract an object from a still image, a desired image region is cut out by manually designating the region by a user or uniting regions having similar colors.
Unfortunately, objects extractable by the chromakey are limited to relatively small ones which can be photographed only before a blue background. Extraction from motion images of natural images is one possible method of extracting relatively large objects. Known examples of the method are a method of previously inputting a background image and cutting out an object from a difference image of the background image and an input image and a method of previously acquiring color information and the like constructing a background and extracting a region having a different color from that of the color information from an input image (Picture Coding Symposium of Japan PCSJ97I-3.15).
A method of cutting out a helicopter 1051 as an object from an image 1050 as shown in FIG. 21 will be described below. That is, a difference between the image 1050 shown in FIG. 21 and a previously photographed background image 1052 shown in FIG. 22 is obtained. Processing such as noise reduction is performed for this difference to extract an object 1053 shown in FIG. 23. A new image shown in FIG. 25 can be edited by synthesizing the object 1053 and a background image 1054 shown in FIG. 24.
Unlike in object photographing for general chromakey synthesis, however, an object in a motion image of a natural image often has a shade because the object is photographed in natural light. Therefore, the shade of the helicopter appears in the sky in the image shown in FIG. 25, resulting in unnatural synthesis.
Likewise, an object 1057 shown in FIG. 27 is obtained by cutting out a cattle 1056 grazing herbage on a lawn from an image 1055 shown in FIG. 26. When this object 1057 is synthesized on an image of an urban district as shown in FIG. 28, the shade of the cattle contained in the object 1057 is also synthesized. An image of this cattle shade contains an image of the herbage on the lawn in the original image 1055. Therefore, when this object 1057 is synthesized on the asphalt background as shown in FIG. 28, a considerably unnatural image results.
To obtain an object having no shade by using the chromakey or the like and synthesize this object, a three-dimensional positional relationship between the object and the synthesized background, a light source, and the like can be set by, e.g., computer processing. This setting is effective in a limited environment such as a studio. However, no shade can be formed for an object once two-dimensionally input as a motion image, so an unnatural synthetic image having no shade is formed. FIG. 29 shows an image formed by synthesizing the cattle 1056 with no shade. As is evident from FIG. 29, this image is very unnatural. This unnaturalness of an image resulting from shadeless synthesis becomes conspicuous when the image is synthesized on a background image having large amounts of shades of other objects.
As described above, it is difficult to independently and appropriately process a main object extracted from an image and a secondary object attached to this main object.
The abovementioned MPEG-4 coding system is a method of separating a motion image into a background and a subject to be encoded, which is called an xe2x80x9cobjectxe2x80x9d, and separately encoding the background and the object. Unlike in encoding performed in units of frames such as in conventional MPEG-1, MPEG-2, h.261, and h.263, a background having no (or little) motion is encoded only once, so low-bit-rate encoding is possible. Additionally, a decoding side can easily perform editing such as selection, enlargement or reduction, and rotation of an object. This allows a user to perform desired decoding.
An example of coding in the MPEG-4 coding system will be described below. Note that a method of extracting a background and an object from an image is not a standard subject of MPEG-4, so any arbitrary method can be used. For example, a method as described in xe2x80x9cMorphological Segmentation Using Advance Knowledge Information in Sports Programsxe2x80x9d (1997 Image Media Processing Symposium (IMPS97) I-3, 15, Oct. 8th, 1997, Naemura et al.) This is a method of previously acquiring information of, e.g., a ground where no players as objects exist, as a background and, on the basis of this background information, extracting objects (players) from a motion image.
FIG. 47 is a block diagram showing the arrangement of a conventional motion image input apparatus. Referring to FIG. 47, motion image data obtained by image sensing by a TV camera 1001 is stored in a background memory 1002. Assume that an image of a yacht and a battle ship cruising on the sea as shown in FIG. 48 is to be processed. When a background region is extracted from this image by forming a color histogram, the sky in the upper half and the sea in the lower half can be detected as shown in FIG. 49. Since this background image shown in FIG. 49 is encoded by, a background image encoder 1003.
Subsequently, an image containing the yacht and the battle ship as objects is sensed as a motion image. An object extractor 1004 extracts the objects by calculating the difference from the background image or extracting regions having different colors from that of the background image. The extracted objects are as shown in FIG. 50. An object encoder 1005 encodes these objects.
FIG. 51 is a detailed block diagram of the object encoder 1005. Shape information of an object is input from a terminal 1020 and stored in a shape memory 1022. This shape information is represented by a binary image in which pixels indicating an object are white and other pixels are black. A boundary extractor 1023 extracts pixels where black and white are switched from this image as a boundary and inputs this boundary to a motion compensator 1024. If a frame mode is an I frame, this motion compensator 1024 does not operate but the input shape information is stored to a boundary memory 1025 and arithmetically encodes at a arithmetic encoder 1026. If the frame mode is a P frame or a B frame, the motion compensator 1024 performs motion compensation by comparing the input boundary with the contents of the boundary memory 1025 storing boundary conditions in past frames. The result of compensation is arithmetically encoded. After that, the boundary memory 1025 stores information of the boundary of the input shape information.
Meanwhile, image data (texture data) of the objects shown in FIG. 50 is input from a terminal 1021 and stored in an object memory 1027. The data stored in the object memory 1027 and the shape memory 1022 are input to a padding unit 1028. This padding unit 1028 pads pixels outside an object, i.e., pixels indicated by black in the shape information, in units of macro blocks in accordance with pixel values in a nearby object. FIG. 52 shows the result of padding. This padding is repeatedly performed vertically and horizontally. The padding unit 1028 inputs padding pixels to macro blocks apart from the objects.
A subtracter 1029 calculates the difference of the image data thus padded from the output data from the motion compensator 1037. A DCT unit 1030 performs DCT for the difference data. A quantizer 1031 quantizes the transformed data by using a predetermined quantization matrix. A coefficient encoder 1032 performs Huffman encoding for the quantized data. An inverse quantizer 1033 inversely quantizes the quantized data. An inverse DCT unit 1034 returns the inversely quantized data to the predicted difference value. This value is added to the output from the motion compensator 1037 to decode the pixel values. The decoded pixel values are stored in an object memory 1036 and used in the next motion compensation. In the P frame or the B frame, the motion compensator 1037 performs motion compensation by comparing the contents in the object memories 1036 and 1027, thereby calculating a predicted value and a motion vector. This motion vector is encoded and input to a synthesizer 1038. The synthesizer 1038 adds a header and the like to the outputs from the arithmetic encoder 1026, the motion compensator 1037, and the coefficient encoder 1032 to form MPEG-4 encoded data. This data is output from a terminal 1039.
Referring back to FIG. 47, a synthesizer 1006 synthesizes outputs from the background encoder 1003 and the object encoder 1005 and adjusts the synthetic data into the form of MPEG-4 encoded data by adding a header and the like. After that, the synthesizer 1006 transmits the data to a communication channel 1008 via a communication interface 1007 or stores the data in a storage device 1009.
In the object extracting method as described above, however, a portion except for a background is processed as an object. Hence, a plurality of objects supposed to be different from each other, e.g., the yacht and the battle ship shown in FIG. 50, are processed as one object. That is, it is very difficult to process these objects as individual objects.
Also, if objects move away from each other in the next frame as shown in FIG. 53, the size of two objects as a whole collected as one object increases. On a decoding side, therefore, the sizes of shape information and texture data of the decoded objects increase to occupy a large area in a memory. This decreases the processing efficiency.
Furthermore, since objects are extracted in units of frames, the interframe relationship between the extracted objects is unknown. This makes encoding using an interframe correlation difficult to perform.
As described above, if an object extracted from an image contains a plurality of objects, these objects are difficult to process as independent objects.
Accordingly, it is an object of the present invention to provide an image processing apparatus and method capable of performing suitable image processing for an object containing a main object and a secondary object attached to the main object.
According to the present invention, the foregoing object is attained by providing an image processing apparatus comprising: extracting means for extracting at least one object from an image; classifying means for classifying the extracted object into a main object and a secondary object attached to the main object; main object processing means for performing image processing for the main object; and
secondary object processing means for performing image processing for the secondary object.
With this arrangement, suitable image processing can be performed for each of a main object and a secondary object, such as a shade, attached to the main object.
And it is another object of the present invention to provide an image processing apparatus and method capable of efficiently encoding an object containing a main object and a secondary object attached to the main object.
According to another aspect of the present invention, the foregoing object is attained by providing an image processing apparatus comprising: extracting means for extracting at least one object from an image; classifying means for classifying the extracted object into a main object and a secondary object attached to the main object; main object encoding means for encoding the main object; and secondary object encoding means for encoding only shape information of the secondary object.
This arrangement improves the encoding efficiency.
And it is another object of the present invention to provide an image processing apparatus and method capable of giving a main object an arbitrary secondary object to be attached to the main object.
In still another aspect of the present invention, the foregoing object is attained by providing an image processing apparatus comprising: object extracting means for extracting at least one main object from an image; secondary object generating means for generating a secondary object to be attached to the main object; main object processing means for performing image processing for the main object; and secondary object processing means for performing image processing for the secondary object.
With this arrangement, a secondary object to be attached to a main object can be appropriately generated and given.
And it is another object of the present invention to provide an image processing apparatus and method capable of extracting an object from a motion image by dividing the object and efficiently encoding the divided objects.
In still another aspect of the present invention, the foregoing object is attained by providing an image processing apparatus comprising: input means for inputting motion image data; object extracting means for extracting at least one object from the motion image and outputting shape information of the object; shape dividing means for dividing the shape information; and object dividing means for dividing the object on the basis of the result of division by said shape dividing means.
Since an object can be properly divided, the size of an image to be encoded can be minimized.
And it is another object of the present invention to provide an image processing apparatus and method capable of efficiently dividing an object in an encoded motion image.
In still another aspect of the present invention, the foregoing object is attained by providing an image processing apparatus comprising: input means for inputting encoded motion image data; separating means for separating the motion image data into encoded data of a background and encoded data of an object; extracting means for extracting shape information from the separated encoded data of the object; shape dividing means for dividing the shape information; and object dividing means for dividing the encoded data of the object on the basis of the result of division by said shape dividing means.
With this arrangement, an encoded object can be efficiently divided without decoding it.
The invention is particularly advantageous since a secondary object such as a shade attached to a main object can be freely controlled.
For example, natural image synthesis having no artificialness is realized with a small information amount by reflecting the image processing result of a main object on the image processing result of a shade.
Additionally, even for an object having no shade, a shade can be generated on the basis of the shape of the object and added. This allows more natural image synthesis.
Also, according to the present invention, a plurality of objects in a motion image can be divided while an interframe correlation is checked. This makes efficient encoding possible.
Accordingly, editing and the like can be efficiently performed in units of objects in an image, and data can be efficiently transferred and stored.