A fundamental task in the field of computer vision and computational photography is the estimation of a depth map of a real world visual scene on the basis of a 4D light field thereof, i.e. a plurality of 2D images of the real world visual scene captured on a regular grid of camera positions. As plenoptic cameras are becoming more and more popular and are expected to replace conventional digital cameras in the near future, the need for computationally efficient depth map estimation algorithms will increase even further in the future.
However, the task of estimating a depth map from a 4D light field, i.e. a plurality of 2D images of the real world visual scene captured on a regular grid of camera positions, still faces various challenges, such as the accurate depth map estimation of the visual scene at textureless, i.e. uniform color, areas and/or at depth discontinuities. Indeed, at uniform color areas, identifying corresponding points of the visual scene across multiple views/images is extremely difficult. Current algorithmic solutions tend to over smooth the estimated depth map. Unfortunately, this is the case at objects' boundaries as well as where depth discontinuities are stronger. This results in an inaccurate depth map estimation of the visual scene at those locations.
The article “Globally Consistent Depth Labeling of 4D Light Fields”, S. Wanner and B. Goldluecke, Computer Vision and Pattern Recognition (CVPR), 2012 describes a method for estimating the depth map of a visual scene via an orientation analysis (based on the so-called structure tensor) of the epipolar images. Each of these images is a 2D cut of the 4D light field. The structure tensor analysis provides an initial depth map estimation, i.e. a fast local solution, which then can be further improved by applying a global optimization approach, i.e. a slow global solution. This comes with the cost of added computational complexity. The fast local solution can be implemented in real time on standard graphics processing units (GPUs). For estimating the depth map of the visual scene a first depth map is obtained from images whose centers are positioned regularly along the horizontal line passing from the center of the reference image and a second depth map is obtained from the images positioned along the vertical direction. The first and the second depth maps are merged to obtain a final depth map, wherein the merging of the first and second depth maps is based on their confidence maps in that for each pixel the depth value with the highest confidence value among the two candidates is chosen.
The article “Scene Reconstruction from High Spatio-Angular Resolution Light Fields’, SIGGRAPH, 2013 describes an alternative solution for a visual scene reconstruction from 4D light fields which can deal better with uniform color areas while still preserving depth map discontinuities. Also in this case the computational complexity is high and for this reason a real time implementation is not possible. The input 4D light field must also be sampled densely enough, which in the case of plenoptic cameras is generally not possible. Also in this case a first depth map is obtained from images whose centers are positioned regularly along the horizontal line passing from the center of the reference image and a second depth map is obtained from the images positioned along the vertical direction.
Thus, there is a need for an improved image processing apparatus and method, in particular an image processing apparatus and method allowing for an improved depth estimation.