High quality, computationally-tractable 3D from images is a critical and essential enabler for many application markets. Two human eyes see a scene from different positions, thus giving us a sense of depth of the scene. The differences between the two views of the scene, called binocular disparity, allow our brain to calculate the depth for every point on the scene visible by the two eyes. A similar result could be achieved by using two cameras simultaneously capturing the scene and then comparing the two resulting images in order to compute depth information. This could be accomplished by moving individual pixels of one image to match pixels on another image. The degree of movement necessary, called disparity, depends on the distance from the cameras to the object resulting in the particular pixel pair, and also depends on the distance between the two cameras. The goal is to fine tune the best match of pixels from different cameras in order to calculate the most accurate depths.
There are several implementations using large, number of cameras organized in two-dimensional arrays. One example implementation is Stanford Multi-Camera arrays. These arrays capture light fields defined as a set of two-dimensional (2D) images capturing light from different directions for the whole scene. Using a larger number of cameras increases the accuracy of the depth map obtained. Another example implementation of camera arrays is the Pelican Imaging system which uses set of low resolution R, G, and B cameras positioned directly on top of image sensor chip. Both these systems are using lower resolution depth maps in order to obtain higher resolution RGB images, sometimes called super-resolution images.
For traditional cameras, depth of field depends on the so-called F ratio of the lens, which is the ratio of the focal length of the camera lens to the width of the lens aperture. Depending on the F ratio, there can be a particular range of distances from the camera on either side of the focal plane in which the image is sharp. Because a camera set produces three-dimensional (3D) images, which includes 2D color images plus we can compute the depth for every pixel of the image which called depth map. Using depth map and color image closed to all-in-focus it is possible to generate all in focus image. It is also possible to produce images with different synthetic aperture (level of blurring outside of in focus area), and also to control areas of the image, which are in focus (synthetic depth of field). This could be accomplished at any selected depth after the image had been shut. This feature is being called dynamic refocusing. The maximum synthetic aperture could be defined by size of camera set, synthetic apertures of the individual cameras as well as the accuracy of the generated depth map.
Generally, camera arrays use multiple cameras of same resolution and as a set; camera arrays contain information that allows generating an output image at a resolution higher than the original cameras in the camera array, which is, typically, called as super-resolution images. Generation of super-resolution images by camera arrays have to overcome number of challenges. The most important challenges area handling of occlusion areas, holes, accuracy and resolution of depth map, total number of computations to be performed (computational complexity), and/or occlusions.
Occlusions are one of the fundamental complications in generation of Super-resolution images using camera arrays are the occlusions. Occlusions are the areas which are seen by some of the cameras, but are not visible from the view of the other cameras because they are in the “shadow” of the other parts of the image (other objects in the image). Depth calculation requires at least two cameras seeing the same pixel. Special handling of occluded zones requires a determination of which cameras see a particular pixel and discarding information from the camera or cameras for which this pixel is occluded. It is possible that some of the pixels are seen by only one camera, and for such pixels depth cannot be determined.
Holes are parts of the image where it is not possible to determine depth map. An example is where there are flat areas in the image that do not have discernible textures, so there is no specific information within this area that will allow matching of pixels from different cameras, and therefore depth cannot be determined. The other special area is related to some special occlusion cases where there could be pixels which are visible only by central camera. For both of these cases, generation of super-resolution images will fail for some areas of the image and will create holes, which could be filled later with some level of success by quite sophisticated heuristic interpolation methods.
Traditional camera array techniques include using one of the cameras as a reference camera and then for each pixel of reference camera perform parallax shift operation on other cameras in order to determine depth at this pixel. Parallax shift for any given pixel depends on actual 3D position of this pixel and the distance between the cameras. This process usually involves performing parallax shift for number of depths. Conceptually, parallax shift is performed for each of these depths for all participating cameras in the camera array and then so called “cost function” for this depth is being generated. Then the depth with the minimal cost function will be defined as the depth for this pixel. Different implementations are using number of additional techniques for final determination of pixel depth. One of the objectives of these techniques is to find absolute minimum of cost function and to avoid the use of local minimum of cost function as a final depth for given pixel.
Initial depth set could be selected to minimize computations and the final depth could be refined by repeating the depth search for the new set of depths close to initial pixel depth. At the end of this process final depth for every pixel at reference camera position (excluding holes) is being determined and depth map is being formed. The resolution of this final depth map is typically the resolution of the reference camera.
The importance of getting accurate depth map for the generation of super-resolution images cannot be overestimated. The depth map is used to superimpose all images from the camera array onto the super resolution grid. Any error in the depth map will cause the placement of pixels from other than the reference camera in the wrong position, causing image artifacts. Usually such artifacts are more visible for areas that are closer to the cameras, resulting in big disparities, because the parallax shift for pixels corresponding to them is larger. This is especially true when a camera array consists of mono-color R, G and B cameras, because placing a color pixel at the wrong place can be highly visible to the human eye.
However, determining final depth map using existing techniques produce depth map having the same resolution of the cameras in the camera array, which is, typically, lower than the super-resolution of output image and such low resolution depth maps may be computationally intensive and could be very expensive both in terms of the total number of parallax computations for a large number of depths, and also due to the fact that the large number of images from different cameras being used simultaneously puts a lot of pressure on efficient memory use. Further, the use of high-resolution camera arrays may significantly increase hardware costs as well. Furthermore, existing techniques may require using laser or TOF systems that may be expensive, too big and may result in inflexible industrial design constraints.