Optical imaging systems are designed to create a focused image of scene objects over a specified range of distances. The image is in sharpest focus in a two dimensional (2D) plane in the image space, called the focal or image plane. From geometrical optics, a perfect focal relationship between a scene object and the image plane exists only for combinations of object and image distances that obey the thin lens equation:
                              1          f                =                              1            s                    +                      1                          s              ′                                                          (        1        )            where f is the focal length of the lens, s is the distance from the object to the lens, and s′ is the distance from the lens to the image plane. This equation holds for a single thin lens, but it is well known that thick lenses, compound lenses and more complex optical systems are modeled as a single thin lens with an effective focal length f. Alternatively, complex systems are modeled using the construct of principal planes, with the object and image distances s, s′ measured from these planes, using the effective focal length in the above equation, hereafter referred to as the lens equation.
It is also known that once a system is focused on an object at distance s1, in general only objects at this distance are in sharp focus at the corresponding image plane located at distance s1′. An object at a different distance s2 produces its sharpest image at the corresponding image distance s2′, determined by the lens equation. If the system is focused at s1, an object at s2 produces a defocused, blurred image at the image plane located at s1′. The degree of blur depends on the difference between the two object distances, s1 and s2, the focal length f of the lens, and the aperture of the lens as measured by the f-number, denoted f/#. For example, FIG. 1 shows a single lens 10 with clear aperture of diameter D. The on-axis point P1 of an object located at distance s1 is imaged at point P1′ at distance s1′ from the lens. The on-axis point P2 of an object located at distance s2 is imaged at point P2′ at distance s2′ from the lens. Tracing rays from these object points, axial rays 20 and 22 converge on image point P1′, while axial rays 24 and 26 converge on image point P2, then intercept the image plane of P1′ where they are separated by a distance d. In an optical system with circular symmetry, the distribution of rays emanating from P2 over all directions results in a circle of diameter d at the image plane of P1′, which is called the blur circle or circle of confusion.
On axis point P1 moves farther from the lens, tending towards infinity, it is clear from the lens equation that s1′=f. This leads to the usual definition of the f-number as f/#=f/D. At finite distances, the working f-number is defined as (f/#)w=f/s′1. In either case, it is clear that the f-number is an angular measure of the cone of light reaching the image plane, which in turn is related to the diameter of the blur circle d. In fact, it is shown that
                    d        =                              f                                          (                                  f                  /                  #                                )                            ⁢                              s                2                ′                                              ⁢                                                                                    s                  2                  ′                                -                                  s                  1                  ′                                                                    .                                              (        2        )            
By accurate measure of the focal length and f-number of a lens, and the diameter d of the blur circle for various objects in a two dimensional image plane, in principle it is possible to obtain depth information for objects in the scene by inverting the Eq. (2), and applying the lens equation to relate the object and image distances. This requires careful calibration of the optical system at one or more known object distances, at which point the remaining task is the accurate determination of the blur circle diameter d.
The above discussion establishes the principles behind passive optical ranging methods based on focus. That is, methods based on existing illumination (passive) that analyze the degree of focus of scene objects and relate this to their distance from the camera. Such methods are divided into two wide categories: depth from defocus methods assume that the camera is focused once and that a single image is captured and analyzed for depth, whereas depth from focus methods assume that multiple images are captured at different focus positions and the parameters of the different camera settings are used to infer the depth of scene objects.
The method presented above provides insight into the problem of depth recovery, but unfortunately is oversimplified and not robust in practice. Based on geometrical optics, it predicts that the out-of-focus image of each object point is a uniform circular disk or blur circle. In practice, diffraction effects and lens aberrations lead to a more complicated light distribution, characterized by a point spread function (psf), specifying the intensity of the light at any point (x,y) in the image plane due to a point light source in the object plane. As explained by Bove (V. M. Bove, Pictorial Applications for Range Sensing Cameras, SPIE vol. 901, pp. 10-17, 1988), the defocusing process is more accurately modeled as a convolution of the image intensities with a depth-dependent psf:idef(x,y; z)=i(x,y)*h(x,y; z),  (3)where idef(x,y; z) is the defocused image, i(x,y) is the in-focus image, h(x,y; z) is the depth-dependent psf and * denotes convolution. In the Fourier domain, this is written:Idef(vx,vy)=I(vx,vy)H(vx,vy; z),  (4)where Idef(vx, vy) is the Fourier transform of the defocused image, i(vx, vy) is the Fourier transform of the in-focus image, H(vx,v; z) is the Fourier transform of the depth-dependent psf, and vx, vy are two dimensional spatial frequencies. Note that the Fourier Transform of the psf is the Optical Transfer Function, or OTF. Bove describes a depth-from-focus method, in which it is assumed that the psf is circularly symmetric, i.e. h(x,y; z)=h(r; z) and H(vx,vy; z)=H(ρ; z), where r and ρ are radii in the spatial and spatial frequency domains, respectively. Two images are captured, one with a small camera aperture (long depth of focus) and one with a large camera aperture (small depth of focus). The Discrete Fourier Transform (DFT) is taken of corresponding windowed blocks in the two images, followed by a radial average of the resulting power spectra, meaning that an average value of the spectrum is computed at a series of radial distances from the origin in frequency space, over the 360 degree angle. At that point the radially averaged power spectra of the long and short depth of field (DOF) images are used to compute an estimate for H(ρ,z) at corresponding windowed blocks, assuming that each block represents a scene element at a different distance z from the camera. The system is calibrated using a scene containing objects at known distances [z1, z2, . . . zn] to characterize H(ρ; z), which then is related to the blur circle diameter. A regression of the blur circle diameter vs. distance z then leads to a depth or range map for the image, with a resolution corresponding to the size of the blocks chosen for the DFT.
Methods based on blur circle regression have been shown to produce reliable depth estimates. Depth resolution is limited by the fact that the blur circle diameter changes rapidly near focus, but very slowly away from focus, and the behavior is asymmetric with respect to the focal position. Also, despite the fact that the method is based on analysis of the point spread function, it relies on a single metric (blur circle diameter) derived from the psf.
Other depth from defocus methods seek to engineer the behavior of the psf as a function of defocus in a predictable way. By producing a controlled depth-dependent blurring function, this information is used to deblur the image and infer the depth of scene objects based on the results of the deblurring operations. There are two main parts to this problem: the control of the psf behavior, and deblurring of the image, given the psf as a function of defocus.
The psf behavior is controlled by placing a mask into the optical system, typically at the plane of the aperture stop. For example, FIG. 2 shows a schematic of an optical system from the prior art with two lenses 30 and 34, and a binary transmittance mask 32, including an array of holes, placed in between. In most cases, the mask is the element in the system that limits the bundle of light rays that propagate from an axial object point and is therefore, by definition, the aperture stop. If the lenses are reasonably free from aberrations, the mask, combined with diffraction effects, will largely determine the psf and OTF (see J. W. Goodman, Introduction to Fourier Optics, McGraw-Hill, San Francisco, 1968, pp. 113-117). This observation is the working principle behind the encoded blur or coded aperture methods. In one example of the prior art, Veeraraghavan et. al. (Dappled Photography: Mask Enhanced Cameras for Heterodyned Light Fields and Coded Aperture Refocusing, ACM Transactions on Graphics 26 (3), July 2007, paper 69) demonstrate that a broadband frequency mask composed of square, uniformly transmitting cells can preserve high spatial frequencies during defocus blurring. By assuming that the defocus psf is a scaled version of the aperture mask, a valid assumption when diffraction effects are negligible, the authors show that depth information is obtained by deblurring. This requires solving the deconvolution problem, i.e. inverting Eq. (3) to obtain h(x,y; z) for the relevant values of z. In principle, it is easier to invert the spatial frequency domain counterpart of Eq. (3), i.e. Eq. (4), which is done at frequencies for which H(vx,v; z) is nonzero.
In practice, finding a unique solution for deconvolution is well known as a challenging problem. Veeraraghavan et. al. solve the problem by first assuming the scene is composed of discrete depth layers, and then forming an estimate of the number of layers in the scene. Then, the scale of the psf is estimated for each layer separately, using the modelh(x,y,z)=m(k(z)x/w,k(z)y/w),  (5)where m(x,y) is the mask transmittance function, k(z) is the number of pixels in the psf at depth z, and w is the number of cells in the 2D mask. The authors apply a model for the distribution of image gradients, along with Eq. (5) for the psf, to deconvolve the image once for each assumed depth layer in the scene. The results of the deconvolutions are desirable only for those psfs whose scale they match, thereby indicating the corresponding depth of the region. These results are limited in scope to systems behaving according to the mask scaling model of Eq. (5), and masks composed of uniform, square cells.
Levin et. al. in Image and Depth from a Conventional Camera with a Coded Aperture, ACM Transactions on Graphics 26 (3), July 2007, paper 70) follow a similar approach to Veeraraghavan, however, Levin et. al. rely on direct photography of a test pattern at a series of defocused image planes, to infer the psf as a function of defocus. Also, Levin et. al. investigated a number of different mask designs in an attempt to arrive at an optimum coded aperture. They assume a Gaussian distribution of sparse image gradients, along with a Gaussian noise model, in their deconvolution algorithm. Therefore, the optimized coded aperture solution is dependent on assumptions made in the deconvolution analysis.
Other techniques that rely on the circular symmetry of the psf, but do not use a coded aperture, include the approach described by Nayar et. al. in Real-Time Focus Range Sensor, IEEE Transactions on Pattern Analysis and Machine Intelligence 18 (12), 1186-1198 (1996). This method uses two captures of the scene at two focus positions, along with a focus measure, to infer depth of scene objects. In another example, Aslantas et. al. in Depth from Automatic Defocusing, Optics Express 15(3), 1011-1023 (2007) describe a technique in which a focused, or defocused image is captured, at certain camera parameters or settings. One or more camera parameters are then changed, which alters the sharpness of the original image. Next, one or more camera parameters are changed (that is, one of the parameters that was not altered previously), with the aim of restoring the image to its original sharpness, and a second image is captured. The camera parameters are altered iteratively until the sharpness match with respect to the original capture is achieved. Equations are given that relate the camera parameters to the change in depth.