1. Field of the Invention
This invention relates to an image processing apparatus and an image processing method, applied to e.g. a TV conference system or a TV telephone system, in which an image to be transmitted or received is captured and reconstructed into a virtual view point image which may appear as if it is captured by a virtual camera.
2. Description of Related Art
There has so far been proposed a system in which plural users may have remote dialog as they view the images of counterpart side users from remote places, as typified by a TV telephone system or a teleconference system. In such system, it is possible to demonstrate an image of a counterpart side user on a display, to pick up an image of a user viewing the display as an object of imaging and to send resulting image signals over a network, such as public switched telephone network or dedicated network, to an image processing apparatus of the counterpart side user, thereby imparting the on-the-spot feeling to both users.
In a conventional teleconference system, the user viewing the image of the counterpart side party, demonstrated in the vicinity of the center of the display, is imaged by a camera mounted on the top of the display. Hence, it is the image of the user bending his/her head slightly downward that is demonstrated on a display unit of the counterpart party. The result is that the dialog between the users is carried out as the lines of sight of the users are not directed to each other, thus imparting the uncomfortable feeling to both users.
Ideally, the dialog may be carried out as the lines of sight of the users are directed to each other, if the cameras are mounted in the vicinity of the display units adapted for demonstrating the images of the counterpart parties. However, it is physically difficult to install the camera in the vicinity of the center of the display.
For overcoming the problems that the lines of sight of the parties having a dialog are not coincident with one another, there has been proposed an image processing apparatus in which the three-dimensional information of an object is extracted based on input images captured by multiple cameras arranged on both sides of the display, an output image of the object is reconstructed responsive to the information pertinent to the view point position of the receiving party and the three-dimensional information as extracted to cause the output image to be demonstrated on a display of the counterpart user (see Patent Publication 1, as an example). In this image processing apparatus, a virtual view point camera image is synthesized at the center of the image surface, using an epipolar planar image generated from images of multi-cameras arranged on a straight line, such as to realize communication with high on-the spot feeling, with the lines of sight of the users then coinciding with one another.
In order to have the parties to the TV conference look at one another, with the lines of sight of the users then coinciding with one another, an image communication apparatus has also been proposed in which the three-dimensional position information is generated on the basis of images picked up by two cameras placed on left and right sides of the image surface (see for example the Patent publication 2).
For reconstructing an output image of the object, as described above, the relation of correspondence between the respective images, obtained on imaging an object from different view points by at least two cameras, is found from one pixel position to another. The reason is that the object shape as well as the distance to the respective cameras may be found by the principle of triangulation and hence it becomes possible to generate a highly accurate virtual view point image, captured by a virtual camera imaginarily mounted in the vicinity of the display.
As a basic structure, the case of taking stereoscopic correspondence between two images, captured by two cameras mounted on the left and right sides of the image surface (screen), is explained by referring to FIG. 1.
If the image pickup operations are carried out with the two cameras, having the optical centers C1, C2, as the optical axes of the cameras are directed to a point M being imaged, from different view points, the normal vector p1, p2 of the images Ps1, Ps2, obtained on the image pickup surfaces of the cameras, point to different directions. That is, although the directions of straight lines, interconnecting the cameras and the point M, are coincident with the normal vector p1, p2 of the images Ps1, Ps2, obtained on the image pickup surfaces of the cameras, these normal vector point to different directions.
Meanwhile, the taking of correspondence is carried out by extracting the pixel positions and the luminance components at the same location, forming P as an object, in the images Ps1, Ps2, by way of coordinating the pixel positions and the luminance components at the same location. For example, a point of correspondence of a pixel m1 of the image Ps1 is on an epipolar line L1′ of the image Ps2, such that, by searching on the line L1′, a pixel m1′, most analogous to the pixel m1, may be detected as a corresponding point. The object P in a three-dimensional space may readily be estimated by exploiting the so coordinated pixels m1, m1′.
As a concrete technique for taking the correspondence, pixel-based matching, area-based matching and feature-based matching, for example, have so far been proposed. The area-based matching is a method of directly searching a corresponding point of a pixel in one image in the other image (see for example the non-patent publication 1). The area-based matching is such a method consisting in searching a corresponding point of a pixel in one image in the other image by having reference to a local image pattern around the corresponding point (see for example the non-patent publications 2 and 3). In the feature-based matching, a variable density edge, for example, is extracted from the image, and only feature portions of the images are referenced for taking the correspondence (see for example the non-patent publications 4 and 5).
However, these techniques specify the strongly analogous pixels, out of the pixels lying on the epipolar line, as the corresponding points, so that coordination between the images Ps1, Ps2, obtained on picking up an image of the user as an object, is difficult to achieve in an area of a repetitive pattern, such as both eyes of the user, or a so-called non-feature point where there scarcely occur changes in luminance, such as wall portion, as shown in FIG. 2.
On the other hand, in the images Ps1, Ps2, obtained on imaging from different view points, the displayed contents differ in cheeks or ears, shown in FIG. 2, due to the disparity ascribable to the separation between the object and the camera. These areas are referred to below as occlusion areas. In these occlusion areas, the corresponding point of the object, demonstrated on one Ps1 of the images, is hidden in the other image Ps2, thus giving rise to inconvenience in taking the correspondence.
Moreover, the images Ps1, Ps2, obtained on capturing from different view points, exhibit differential luminance or chroma components, in e.g. an area differing in brightness depending on the viewing direction, such as a window portion, or an area producing regular reflection, such as the nose of the user, with the result that the coordination is difficult to achieve in these areas.
For taking the correspondence between these images flexibly and robustly, a variety of techniques, based on global optimization, have so far been proposed. The method for image-to-image matching, by the dynamic programming method, is taught in, for example, the non-patent publications 6 and 7. This image-to-image matching method teaches that the aforementioned problem of the object with only small changes in texture or of the repetitive matching can be successfully coped with by coordination or extension/contraction matching between the feature points.    [Patent publication 1] Japanese Patent Application Laid-Open No. 2001-52177    [Patent publication 2] Japanese Patent Application Laid-Open No. 2002-300602    [Non-patent publication 1] C. Lawrence Zitnick and Jon A. Webb: Multi-Baseline Stereo Using Surface Extraction, Technical Report, CMU-CS-96-196 (1966)    [Non-patent publication 2] Okutomi. M and Kanade. T: A locally adaptive window for signal matching. Int. Journal of Computer Vision, 7(2), pp. 143-162 (1992)    [Non-patent publication 3] Okutomi. M and Kanade. T: Stereo matching exploiting plural base line lengths, Journal of Electronic Information Communication Soc. D-11, Vol. J175-D-11, No. 8, pp. 1317-1327, (1992)    [Non-patent publication 4] H. Baker and T. Binford: Depth from edge and intensity based stereo, In Proc. IJCAI' 81 (1981)    [Non-patent publication 5] W. E. L Grimson: Computational experiments with a feature based stereo algorithm, IEEE. Trans. PAMI. Vol. 7, No. 1, pp. 17 to 34, 1985    [Non-patent publication 6] Ohta Y and Kanade T.: Stereo by intra- and inter-scanline search using dynamic programming, IEEE PAMI-7(2), 139-154, 1985    [Non-patent publication 7] Cox I. J et al.: A Maximum likelihood stereo algorithm, Computer Vision and Image Understanding, 63(3), 542-567, 1966.
Meanwhile, in the above-described image-to-image matching, there are occasions where the face position and the hand position of the user differ with respect to the image pickup surface of the camera. In particular, some users perform body or hand gestures in having a dialog, so that, even in such case, accuracy in coordination needs to be improved.
However, since the face position and the hand position of the user are presented as disparities in the images Ps1, Ps2, obtained on capturing from different view points, there is room for improvement, particularly in connection with accuracy in coordination, such that it is not possible to reduce the mismatch between the images for all image patterns.