Digital cameras, including digital single-lens reflex (DSLR) cameras and digital cameras integrated into mobile devices, often have sophisticated hardware and software that enables a user to capture digital images using a combination of different user-defined and camera-defined configuration settings. A digital image provides a digital representation of a particular scene. A digital image may subsequently be processed, by itself or in combination with other images, to derive additional information from the image. For example, one or more images may be processed to estimate the depths of the objects depicted within the scene, i.e., the distance of each object from a location from which the picture was taken. The depth estimates for each object in a scene, or possibly each pixel within an image, are included in a file referred to as a “depth map.” Among other things, depth maps may be used to improve existing image editing techniques (e.g., cutting, hole filling, copy to layers of an image, etc.)
Depth from defocus is a conventional technique used to estimate depth of a scene using out-of-focus blur (i.e., to generate depth maps). Depth estimation using such techniques is possible because imaged scene locations will have different amounts of out-of-focus blur (i.e., depth information) based on the configuration settings of the camera (e.g., aperture setting and focus setting) used to take the images. Estimating depth, therefore, involves estimating the amount of depth information at the different scene locations, whether the depth information is derived from one image or from multiple images. Conventionally, the accuracy of such estimates depends on the number of images used, and the amount of depth information. This is because the greater the number of images that are inputted, the greater the amount of depth information that can be compared for any one position (e.g., pixel) in the scene.
A conventional depth from defocus technique compares blurry patches in a single image with certain assumptions about the scene derived from prior image models. While these assumptions may hold true for certain scenes, they fail when the underlying image does not have sufficient depth information to fit the assumptions. Another conventional technique estimates depth by processing multiple images captured as a focal stack (i.e., same aperture, different focus) and fitting those images to an image model. The number of images typically corresponds to the number of available focus settings for the digital camera. This can lead to inefficiencies because often more images are taken than may otherwise be required to provide sufficient depth information. In addition, this technique requires that the images be fitted to an image model, which can lead to imprecise depth estimates. Yet another conventional technique estimates depth by processing multiple images captured as an aperture stack (i.e., same focus, different aperture). Similar to the focal stack technique, this conventional technique requires that many images be taken, which can be inefficient when fewer images would provide sufficient depth information. And even though this aperture stack technique often captures more images than may otherwise be required, because the camera configuration settings are not predetermined to preserve optimal depth information, the resulting images often have areas where the depth information is insufficient. Thus, the depth maps outputted from processing images captured using this aperture stack technique are often very coarse. Finally, a last conventional technique processes a dense set of images (i.e., hundreds of images) with varying aperture and focus settings and compares each pixel in the dense set of images to estimate depth. This conventional technique outputs a precise depth map, but still requires the dense set of images that can be inefficient to capture and process.
Thus, conventional depth from defocus techniques rely on assumptions about the scene, require the user to capture a large number of input images, require the user to capture images using patterns of camera settings that are not predetermined to preserve sufficient depth information, and/or are capable of outputting only a low quality depth map. Accordingly, it is desirable to provide improved solutions for predictively determining a minimum number of images to be captured and the camera configuration settings used for capturing them, such that the images collectively provide sufficient depth information from which a quality depth map can be generated.