Processing for separating and combining images on a per-object basis by utilizing digital techniques has become of interest in recent years. In particular, MPEG-4 coding has been standardized as an international standard in the coding of moving images. MPEG-4 coding makes it possible to perform coding/decoding object by object and offers the possibility of various applications that have been difficult to achieve heretofore, examples being an improvement in coding efficiency, data distribution conforming to the transmission path and re-manipulation of images.
A technique referred to as the background subtraction method is known generally as a method of extracting an object in the processing of moving images. This is a method in which points where changes occur are detected by comparing a previously captured background image and an actual input image. The principles of this method will now be described in simple terms.
First, let Pc(x,y) and Pb(x,y) represent a pixel value of an input image and a pixel value of a background image, respectively, at coordinates (x,y) in the plane of an image. The difference between Pc(x,y) and Pb(x,y) is calculated and the absolute value thereof is compared with a certain threshold value Th.
An example of a criterion formula is as follows:|Pc(x,y)−Pb(x,y)|≦Th   (1)
If the absolute value of the difference in Equation (1) is equal to or less than the threshold value Th, this indicates that there is no change at the point (x,y) and, hence, it is decided that Pc is background. On the other hand, if the absolute value of the difference in Equation (1) is greater than the threshold value Th, then this indicates that the value has changed and that the point is one that should be extracted. By performing the above discrimination process at all points in an image, extraction of one frame is achieved.
FIG. 13 is a block diagram illustrating the configuration of a conventional system in which the background subtraction method and MPEG-4 coding are combined. An image input unit 1201 in FIG. 13 is for inputting a moving image and is exemplified by the image sensing unit of a camera. Since the background subtraction method requires a background image for reference, the background image is generated by a background image generating unit 1202. Generating the background image by capturing one frame of an image beforehand under conditions in which an object does not yet appear is the simplest method.
A background subtraction processor 1203 generates shape data representing the shape of an object from the input image from the image input unit 1201 and the reference image from the background image generating unit 1202. The image from the image input unit 1201 and the shape data from the background subtraction processor 1203 are input to an encoder 1204 frame by frame and the encoder proceeds to apply coding processing. The encoder 1204 will be described as one which executes coding in accordance with the MPEG-4 coding scheme.
If an object is to be coded, it is necessary to code the object shape and position information. To accomplish this, first a rectangular area that encompasses the object is set and the coordinates of the upper left corner of the rectangle and the size of the rectangular area are coded. The rectangular area is referred to as a “bounding box”. The area within an object expressed by an image signal and shape signal is referred to as a “VOP” (Video Object Plane).
FIG. 14 is a block diagram illustrating in detail the structure of the encoder 1204, which executes VOP coding according to the prior art. It should be noted that the inputs to the encoder 1204 are image luminance and color difference signals as well as the shape signal. These signals are processed in macroblock units.
First, in an intra-mode, a DCT unit 1301 applies a discrete cosine transform (DCT) to each block and a quantizer 1302 quantizes the resultant signal. Quantized DCT coefficients and the quantization width are subjected to variable-length coding by a variable-length encoder 1312.
In an inter-mode, on the other hand, a motion detector 1307 detects motion by a motion detection method a primary example of which is block matching with respect to another VOP that is adjacent in terms of time. A motion-vector prediction unit 1308 detects a macroblock that is predicted to exhibit the smallest error relative to a macroblock of interest. A signal indicating motion toward a macroblock that is predicted to exhibit the smallest error is a motion vector. An image to which reference is made in order to generate the predicted macroblock is referred to as a “reference VOP”.
A motion compensator 1306 applies motion compensation to the reference VOP based upon the detected motion vector, thereby acquiring the optimum predicted macroblock. Next, the difference between the next macroblock of interest and the corresponding predicted macroblock is obtained, DCT is applied to the resulting difference signal by the DCT unit 1301 and the DCT coefficients are quantized by the quantizer 1302.
The shape data, on the other hand, is coded by a shape coding CAE unit 1309. What actually undergoes CAE coding here are boundary blocks only. With regard to a block inside a VOP (all data within the block lies within the object) and a block outside a VOP (all data within the block lies outside the object), only header information is sent to the variable-length encoder 1312. A boundary block that undergoes CAE coding is processed in a manner similar to that of the image data. Specifically, in the interframe mode, the boundary block undergoes motion detection by the motion detector 1307 and motion-vector prediction is performed by the motion-vector prediction unit 1308. CAE coding is applied to the difference value between the motion-compensated shape data and the shape data of the preceding frame.
However, two problems described below arise with he background subtraction method set forth above.
The first problem involves the fact that this method presumes that there is no change in the background image. Specifically, the problem is that if a value in the background changes owing to a change in illumination or the like, a stable result of extraction will not be obtained. A method of detecting a change in the background image by a statistical technique and updating the background image appropriately has been disclosed in the specification of Japanese Patent Application Laid-Open No. 7-302328 as a solution for dealing with this problem.
The second problem is how to deal with an instance in which a flash is fired in the middle of a scene or in which one object crosses in front of another object. These instances will be described with reference to the drawings. FIG. 15 is a diagram useful in describing shape data representing the shape of an object in a case where a flash is fired in the middle of a scene according to an example of the prior art. Reference numerals 1401 and 1402 denote frame data at a certain time and frame data at the next instant in time, respectively. The scene is illuminated by a flash in the second of these frames. Reference numeral 1403 denotes frame data at the instant in time that follows the frame data 1402. It will be understood that the frame 1402 illuminated by the flash differs greatly from the other results of extraction (1401, 1403).
An instance where there is a change in background illumination, which is the first problem mentioned above, primarily is merely a change in luminance value. In the case of a flash, however, which is the second problem mentioned above, hue also changes. As a consequence, accurate correction of background is difficult to achieve. Further, even if accurate shape data of an object has been obtained, the image data of the object itself also undergoes a major change. With a method such as MPEG-4, therefore, which uses an interframe difference, coding efficiency cannot be raised and the image appears unnatural visually.
FIG. 16 is a diagram useful in describing shape data representing the shape of an-object in a case where one object crosses in front of another object in an example of the prior art. Reference numeral 1501 in FIG. 16 denotes frame data immediately before one object (e.g., a vehicle) crosses in front another object (a person), reference numerals 1502 and 1503 denote frame data when the vehicle is crossing in front of the object that is the person, and reference numeral 1504 denotes frame data immediately after the vehicle has crossed in front of the person. If a second object thus happens to be extracted together with a first object, there is a major increase in amount of information ascribable to a major change in the shape data and image data. This leads to a major decline in image quality. Since this case represents a phenomenon that is entirely different from a change in background illumination, it is difficult to deal with the background image by the updating method.