Among techniques for extracting a background image and an object image which is a foreground image from plural images, there is a method which utilizes three-dimensional information. In this method, the three-dimensional information of an image scene is obtained using a stereo camera, a range finder, or the like, and the background image and an object image are separated from each other based on the three-dimensional information that is obtained. However, such a technique requires a device for measuring three-dimensional information.
There is also a technique proposed for extracting the background image and an object image which is a foreground image, without using three-dimensional information. For example, the technique described in Non-patent document 1 (Chris Stauffer and Eric Grimson, “Adaptive Background Mixture Models for Real-time Tracking,” IEEE Computer Society Conference Computer Vision and Pattern Recognition, pp.246-252, 1999) is capable of probabilistically modeling temporal variations in pixels to perform background differentiation which flexibly supports temporal variations in pixels. This technique enables the background image and an object image to be separated from each other in a reliable manner.
Furthermore, Non-patent document 2 (John Winn and Andrew Blake, “Generative Affine Localisation and Tracking”, Neural Information Processing Systems, No.17, pp.1505-1512, 2004) and Non-patent document 3 (John Winn and Christopher Bishop, “Variational Message Passing”, Journal of Machine Learning Research, Vol. 6, pp. 661-694, 2005) propose a technique for simultaneously extracting, from plural images, the following elements defined as hidden parameters: a background image, one object image, the shape of one object image, and the motion of one object image. In this technique, plural parameters defined as hidden parameters are extracted through joint optimization, using images as inputs. This technique enables robust extraction of parameters since plural hidden parameters act in a complementary manner even in the case where noise has occurred or the shape of an object has changed. This technique has another advantage in that there is no need to perform parameter tuning such as setting a threshold or weighting an energy function in the background differentiation process.
However, the techniques described in the above-mentioned Non-patent documents 1 to 3 have a problem of being unable to simultaneously extract plural objects and the motion of each of such objects in a reliable manner.
The image processing method represented by Non-patent document 1 is a technique for separating the background image from another object, and thus when plural objects exist in the image, it is not possible to extract them as individual objects. In order to be able to do so, this technique requires the additional use of a segmentation technique that utilizes information about the objects, such as their colors and motion.
Meanwhile, the image processing method represented by Non-patent documents 2 and 3 is capable of simultaneously extracting plural hidden parameters from image information only. However, the larger the number of objects included in an image, the more the number of hidden parameters that should be solved. There are also other causes that increase the number of hidden parameters. For example, the number of hidden parameters increases due to camera motion as well as a motion parameter for adapting to complexity in motion and an image degradation parameter for modeling degradation in image quality for the purpose of improving image quality. The use of these parameters means that there is a further expansion of the search space. This results in local minima and thus there is a higher risk of being unable to obtain a desired solution. For example, the use of this technique to extract two or more object images ends up extracting plural objects as one object that is at a local minimum. Thus, a local minimum must be avoided. One of the most important means for avoiding local minima is to give a constraint on an extensive search space made up of hidden parameters. However, while the provision of knowledge about an image scene in advance as a constraint serves as an effective means for avoiding local minima, there is a drawback in that applicable image scenes are limited. Therefore, it is not preferable to conduct supervised learning that utilizes previously given knowledge in relation to input images.