In many applications of image capture, it can be advantageous to determine the distance from the image capture device to objects within the field of view of the image capture device. A collection of such distances to objects in an imaged scene is sometimes referred to as a depth map. A depth map of an imaged scene may be represented as an image, which may be of a different pixel resolution to the image of the scene itself, in which the distance to objects corresponding to each pixel of the depth map is represented by a greyscale or colour value.
A depth map can be useful in the field of consumer photography, as it enables several desirable post-capture image processing capabilities for photographs. For example, a depth map can be used to segment foreground and background objects to allow manual post-processing, or the automated application of creative photographic effects. A depth map can also be used to apply depth-related photographic effects such as simulating the aesthetically pleasing graduated blur of a high-quality lens using a smaller and less expensive lens.
Several features are desirable in any method of acquiring a depth map of a photographic scene. Depth accuracy is important, otherwise the resulting depth map may suggest that objects are at distances significantly different to their true distances. Depth resolution is important to allow the separation of objects that may be spatially close to one another in the scene and also to allow for accurate post-processing operations such as depth-dependent blurring. Spatial resolution of the depth map is also important in many applications, in particular, depth maps approaching the resolution of the photographic images themselves are useful for pixel-wise segmentation and avoiding visually obvious object boundary errors in many post-processing operations. Depth mapping methods should ideally be independent of the physical properties of the objects in the scene, such as reflectance, colour, texture, and orientation. This property is often referred to as scene independence. It is also desirable that depth mapping methods be tolerant of motion of objects in the scene and of motion of the image capture device. It is also desirable that depth mapping methods can be realised in practical devices such as consumer cameras with minimal additional cost, bulk, weight, image capture and processing time, and power consumption.
Several methods are known for determining a depth map from images of a scene. These can be classified into active and passive methods. Active depth mapping methods involve projecting beams or patterns of light or other radiation on to a scene. Distances can be measured either by timing the return of reflected rays, or by analysing the geometrical distortions of the patterns as they reflect off three-dimensional structures in the scene. Active methods require projection optics, which creates significant cost, weight, and power problems for applications such as consumer photography. In addition, active methods have limited range. For these reasons, passive depth mapping methods are more suitable than active methods for photography applications.
A known class of passive depth mapping methods involves capturing images of the scene from different viewpoints. The images of the scene can then be analysed to determine the apparent shifts in position of objects in the images of the scene caused by the stereoscopic effect. In general, stereoscopic methods suffer from the disadvantage of requiring multiple viewpoints. This necessitates either capturing images sequentially and moving the camera between shots, or capturing images using either multiple cameras or a camera with multiple lenses. In the case of capturing images sequentially, the time taken to move the camera may be problematic, especially for moving subjects, and precise alignment or calibration of the camera motion is needed. In the case of simultaneous capture, the requirement of multiple cameras or lenses increases the expense and difficulty of construction of the capture device.
Another class of passive depth mapping methods uses multiple shots taken by a single camera from a single viewpoint. These methods can be further split into two classes, named depth from focus (DFF), and depth from defocus (DFD). DFF methods use multiple shots taken of the scene at a large range of different focus positions. Analysis of image patches from each shot can then determine which shot corresponds to the best focus position for the object shown in a given image patch, which can in turn be associated with a calibrated depth. The main disadvantage of DFF methods is the requirement of taking a large number of images, resulting in long capture times, significant alignment problems for moving scenes, and long processing times.
DFD techniques attempt to measure the depths to objects in a scene by capturing a small number of images using different camera or capture parameters such as focus or aperture, and then comparing the images to analyse the difference in the amount of blurring of scene objects. Existing techniques then attempt to relate some measure of this blur difference to the depth of the imaged object by various theoretical calculations or empirical calibrations. DFD methods can estimate depths from as few as two images.
In addition to the desirable features for all depth mapping methods already mentioned—namely depth accuracy, depth resolution, spatial resolution, scene independence, motion tolerance, and low cost, weight, bulk, processing time, and power consumption—DFD methods in particular have further desirable feature requirements. DFD methods rely on quantification of blur difference to establish depth. Therefore it is desirable for DFD methods to operate well when the amount of blur difference achievable is limited by practical considerations of camera design. In particular, compact cameras typically have small lenses and sensors in order to keep costs low and produce a conveniently sized product. These constraints on the imaging system result in relatively small differences in blur (compared to larger cameras) because of the large depth of field of small optical systems. For example, typical blur differences achievable between two shots taken with a compact camera are of the order of a pixel or less. Another desirable feature of DFD methods is that they are based on a realistic model of the camera optical system. This allows a clear theoretical connection to be made between the measure of blur difference and the parameters of the image capture optical system. This further allows a thorough understanding of the connection between the blur difference measure and object depth so that appropriate consideration may be given to different imaging scenarios or difficult imaging conditions.
An example DFD method is given by Pentland in a paper titled “A New Sense for Depth of Field”, published in July 1987 in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-9, No. 4, pp. 523-531, hereafter “Pentland”. This method attempts to quantify the difference in amount of blur between two images taken with different apertures by estimating a blur radius for each image based on the assumption of a symmetrical Gaussian point spread function (P SF). This assumption assumes the lens optical transfer function (OTF) is a real Gaussian function, which is unrealistic for typical camera lenses, and consequently this assumption causes errors in the depth estimate. In addition, the Pentland method of calculating the blur radius is very sensitive to variations in scene texture and imaging noise. This sensitivity makes the method unsuitable for use with cameras taking photos of natural scenes.
Another example DFD method is given in U.S. Pat. No. 5,231,443 (Subbarao), granted in 1993. This method attempts to quantify the difference in amount of blur between two images taken with different camera parameters by summing rows or columns within an image region, performing a one-dimensional (1D) Fourier transform and then examining a small subset of the Fourier components. By the projection-slice theorem, this method is equivalent to examining a 1D slice through the two-dimensional (2D) Fourier transform of the image region. In photographs of a natural scene, there are usually a wide variety of textures. Two-dimensional Fourier transforms of these textures will have a variety of dominant orientations. Typical textures will have low energy along the spatial frequency axes, which means that the method of Subbarao will be sensitive to imaging noise and produce large errors in the depth estimate. This variation of errors with orientation of scene texture is highly undesirable for a depth mapping method.
An example DFD method using a different theoretical principle is given by McCloskey, et. al. in a paper titled “The Reverse Projection Correlation Principle for Depth from Defocus”, published by the IEEE Computer Society in June 2006 in the Proceedings of the Third International Symposium on 3D Data Processing, Visualization, and Transmission, pp. 607-614. This method attempts to quantify the blur difference between two images by independently estimating the amount of blur in each image using a measure based on pixel auto-correlations, and then comparing this measure between the images. In this method there is no clear theoretical connection between the correlation measures and the physical optics principles that produce the blurring. It is therefore difficult to establish the accuracy of the method under a wide range of imaging conditions. Sample depth results from this method are noisy.
Another example DFD method is given by Aydin & Akgul in a paper titled “An occlusion insensitive adaptive focus measurement method”, published in June 2010 in Optics Express, Vol. 18, pp. 14212-14224. This method attempts to quantify the blur difference between two images by calculating a cross-correlation between corresponding patches of the images. This produces a measure of similarity between the image patches, which is then related to object depth. A problem here is that an object with low contrast can appear more similar at high blur differences than an object with high contrast at a lower blur level, resulting in spurious depth assignments.
An example of spatial-domain DFD method using Gabor filters is given by Xiong and Shafer in a paper titled “Variable window Gabor filters and their use in focus and correspondence”, published by the IEEE Computer Society in June 1994 in the Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 668-671. Gabor filters are known to produce a response of the input image at and around a corresponding tuning frequency. Xiong and Shafer used a set of 120 Gabor filters over 12 orientations and 10 radial frequencies that covered most of the image spectrum. The difference in the amount of blur between two input images was then estimated from the difference in the logarithm of the amplitudes of the corresponding Gabor filter responses. Because the large number of Gabor filters used by Xiong and Shafer was implemented using non-separable 2D filters, the Xiong and Shafer method requires a lot of computation.
Another example of a spatial-domain DFD method using Gabor filters is given by Gokstorp in a paper titled “Computing depth from out-of-focus blur using a local frequency representation”, published by the IEEE in October 1994 in the Proceedings of the 12th International Conference on Pattern Recognition, pp. 153-158. Gokstorp used a set of Gabor filters along 0- and 90-degree orientations. Although these axis-aligned Gabor filters can be implemented very efficiently due to their separability, they respond mainly to vertical and horizontal structures. This limits the usefulness of Gokstorp's method in natural images where the scene content can appear in any orientation. Both Xiong and Gokstorp used local phase stability to weigh the confidence of the Gabor responses. Local phase is unstable around texture boundaries. Unfortunately, these boundaries often coincide with edges where the blur difference across the two input images is most discernible. By weighing down the filter response around edges, both Xiong's and Gokstorp's methods discount the majority of useful information for DFD.
These examples are illustrative of the shortcomings of existing DFD approaches. A disadvantage of DFD methods in general is the fact that depth estimates are prone to error because of the relatively small amount of data used, the effects of scene texture variations and imaging noise, any misalignment between objects in the images caused by camera or subject motion, and the fact that the relationship between object distance and blur is complicated. For many DFD algorithms there is a poor link between the quantitative measure extracted from analysing the images and the actual depth in the scene, because of camera calibration methods which use inaccurate models of camera lenses, weak or absent theoretical connections between the depth estimate and physical optics theory, and high depth estimation sensitivities to one or more of imaging noise, image misalignment, exposure difference, and the variation of textures of objects in the scene.
DFD methods are particularly problematic when applied to images taken with compact cameras. The small lens and sensor size restricts the amount of blur difference that can be achieved between two shots of a scene and a small sensor is more prone to imaging noise than a larger sensor. These make it difficult to quantify the blur difference accurately.