Optical imaging systems are designed to create a focused image of scene objects over a specified range of distances. The image is in sharpest focus in a two dimensional (2D) plane in space, called the focal or image plane. From geometrical optics, a perfect focal relationship between a scene object and the image plane exists only for pairs of object and image distances that obey the thin lens equation:
                              1          f                =                              1            s                    +                      1                          s              ′                                                          (        1        )            where f is the focal length of the lens, s is the distance from the object to the lens, and s′ is the distance from the lens to the image plane. This equation holds for a single thin lens, but it is well known that thick lenses, compound lenses and more complex optical systems are modeled as a single thin lens with an effective focal length f. Alternatively, complex systems can be modeled using the construct of principal planes, with the object and image distances s, s′ measured from these planes and using the effective focal length in the above equation, hereafter referred to as the lens equation.
It is also known that once a system is focused on an object at distance s1, in general only objects at this distance are in sharp focus at the corresponding image plane located at distance s1′. An object at a different distance s2 produces its sharpest image at the corresponding image distance s2, determined by the lens equation. If the system is focused at s1, an object at s2 produces a defocused, blurred image at the image plane located at s1′. The degree of blur depends on the difference between the two object distances, s1 and s2, the focal length f of the lens, and the aperture of the lens as measured by the f-number, denoted f/#. For example, FIG. 1 shows a single lens 10 of focal length f and clear aperture of diameter D. The on-axis point P1 of an object located at distance s1 is imaged at point P1′ at distance s1′ from the lens. The on-axis point P2 of an object located at distance s2 is imaged at point P2′ at distance s2′ from the lens. Tracing rays from these object points, axial rays 20 and 22 converge on image point P1′, while axial rays 24 and 26 converge on image point P2′, then intercept the image plane of P1′ where they are separated by a distance d. In an optical system with circular symmetry, the distribution of rays emanating from P2 over all directions results in a blur circle of diameter d at the image plane of P1′.
As on axis point P1 moves farther from the lens, tending towards infinity, it is clear from the lens equation that s′1=f. This leads to the usual definition of the f-number as f/#=f/D. At finite distances, the working f-number can be defined as (f/#)w=f/s′1. In either case, it is clear that the f-number is an angular measure of the cone of light reaching the image plane, which in turn is related to the diameter of the blur circle d. In fact, it is known that
                    d        =                              f                                          (                                  f                  /                  #                                )                            ⁢                              s                2                ′                                              ⁢                                                                                    s                  2                  ′                                -                                  s                  1                  ′                                                                    .                                              (        2        )            
Given that the focal length f and f-number of a lens or optical system is accurately measured, and given that the diameter of the blur circle d is measured for various objects in a two dimensional image plane, in principle one can obtain depth information for objects in the scene by inverting the above blur circle equation, and applying the lens equation to relate the object and image distances. Unfortunately, such an approach is limited by the assumptions of geometrical optics. It predicts that the out-of-focus image of each object point is a uniform circular disk. In practice, diffraction effects, combined with lens aberrations, lead to a more complicated light distribution that is more accurately characterized by a point spread function (psf), a 2D function that specifies the intensity of the light in the image plane due to a point light source at a corresponding location in the object plane. Much attention has been devoted to the problems of measuring and reversing the effects of the psf on images captured from scenes containing objects spread over a variety of distances from the camera.
For example, Bove (V. M. Bove, Pictorial Applications for Range Sensing Cameras, SPIE vol. 901, pp. 10-17, 1988) models the defocusing process as a convolution of the image intensities with a depth-dependent psf:idef(x,y)=i(x,y)*h(x,y,z),  (3)where idef(x,y) is the defocused image, i(x,y) is the in-focus image, h(x,y,z) is the depth-dependent psf and * denotes convolution. In the spatial frequency domain, this is written:Idef(Vx,Vy)=I(Vx,Vy),H(Vx,Vy,z),  (4)where Idef(Vx,Vy) is the Fourier transform of the defocused image, I(Vx, Vy) is the Fourier transform of the in-focus image, and H(Vx,Vy,z) is the Fourier transform of the depth-dependent psf. Bove assumes that the psf is circularly symmetric, i.e. h(x,y,z)=h(r,z) and H(Vx,Vy,z)=H(ρ,z), where r and ρ are radii in the spatial and spatial frequency domains, respectively. In Bove's method of extracting depth information from defocus, two images are captured, one with a small camera aperture (long depth of focus) and one with a large camera aperture (small depth of focus). The Discrete Fourier Transform (DFT) is applied to corresponding blocks of pixels in the two images, followed by a radial average of the resulting power spectra within each block. Then the radially averaged power spectra of the long and short depth of field (DOF) images are used to compute an estimate for H(ρ, z) at corresponding blocks. This assumes that each block represents a scene element at a different distance z from the camera, and therefore the average value of the spectrum is computed at a series of radial distances from the origin in frequency space, over the 360 degree angle. The system is calibrated using a scene containing objects at known distances [z1, z2, . . . zn] to characterize H(ρ, z), which then is then taken as an estimate of the rotationally-symmetric frequency spectrum of the spatially varying psf. This spectrum is then applied in a regression equation to solve for the local blur circle diameter. A regression of the blur circle diameter vs. distance z then leads to a depth or range map for the image, with a resolution corresponding to the size of the blocks chosen for the DFT. Although this method applies knowledge of the measured psf, in the end it relies on a single parameter, the blur circle diameter, to characterize the depth of objects in the scene.
Other methods which infer depth from defocus seek to control the behavior of the psf as a function of defocus, i.e. the behavior of h(x,y,z) as a function of z, in a predictable way. By producing a controlled depth-dependent blurring function, this information is used to deblur the image and infer the depth of scene objects based on the results of the deblurring operations. There are two main parts to this problem: the control of the psf behavior and deblurring of the image, given the psf as a function of defocus.
The psf behavior is controlled by placing a mask into the optical system, typically at the plane of the aperture stop. For example, FIG. 2 shows a schematic of an optical system from the prior art with two lenses 30 and 34, and a binary transmittance mask 32, including an array of holes, placed in between. In most cases, the mask is the element in the system that limits the bundle of light rays that propagate from an axial object point and is therefore, by definition, the aperture stop. If the lenses are reasonably free from aberrations, the mask, combined with diffraction effects, will largely determine the psf and OTF (see J. W. Goodman, Introduction to Fourier Optics, McGraw-Hill, San Francisco, 1968, pp. 113-117). This observation is the working principle behind the encoded blur or coded aperture methods. In one example of the prior art, Veeraraghavan et. al. (Dappled Photography: Mask Enhanced Cameras for Heterodyned Light Fields and Coded Aperture Refocusing, ACM Transactions on Graphics 26 (3), July 2007, paper 69) demonstrate the use of a defocus psf which is approximately a scaled image of the aperture mask, a valid assumption for large amounts of defocus, to obtain depth information by deblurring. This requires solving the deconvolution problem, i.e. inverting Eq. (3) to obtain h(x,y,z) for the relevant values of z. In principle, it is easier to invert the spatial frequency domain counterpart of Eq. (3), i.e. Eq. (4), which is possible at frequencies for which H(vx,vy,z) is nonzero.
In practice, finding a unique solution for deconvolution is well known as a challenging problem. Veeraraghavan et. al. solves the problem by first assuming the scene is composed of discrete depth layers, and then forming an estimate of the number of layers in the scene. Then, the scale of the psf is estimated for each layer separately, using the modelh(x,y,z)=m(k(z)x/w,k(z)y/w)  (5)where m(x,y) is the mask transmittance function, k(z) is the number of pixels in the psf at depth z, and w is the number of cells in the 2D mask. The authors apply a model for the distribution of image gradients, along with Eq. (5) for the psf, to deconvolve the entire image once for each assumed depth layer in the scene. The results of the deconvolutions are desirable only for those psfs whose scale they match, thereby indicating the corresponding depth of the region. These results are limited in scope to systems behaving according to the mask scaling model of Eq. (5), and masks composed of uniform, square cells.
Levin et. al. in Image and Depth from a Conventional Camera with a Coded Aperture, ACM Transactions on Graphics 26 (3), July 2007, paper 70) follow a similar approach to Veeraraghavan, however, Levin et. al. relies on direct photography of a test pattern at a series of defocused image planes, to infer the psf as a function of defocus. Also, Levin et. al. investigated a number of different mask designs in an attempt to arrive at an optimum coded aperture. They assume a Gaussian distribution of sparse image gradients, along with a Gaussian noise model, in their deconvolution algorithm. Therefore, the optimized coded aperture solution is dependent on assumptions made in the deconvolution algorithm.
The solutions proposed by both Veeraraghavan and Levin have the feature that they proceed by performing a sequence of deconvolutions with a single kernel over the entire image area, followed by subsequent image processing to combine the results into a depth map.
Hiura and Matsuyama in Depth Measurement by the Multi-Focus Camera, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1998, pp. 953-959, disclose digital camera-based methods for depth measurement using identification of edge points and coded aperture techniques. The coded aperture techniques employ Fourier or deconvolution analysis. In all cases, the methods employed require multiple digital image captures by the camera sensor at different focal planes.