Since the early 1970's, estimating the three-dimensional (3D) structure of a scene from two-dimensional (2D) imagery has been one of the most actively researched areas in the fields of digital image processing and computer vision. Thirty years of active research indicates the difficulty of developing efficient computational stereo techniques that are able to reconstruct dense scene structure estimates from stereo or monocular imagery.
The first step in bounding the problem is to define which sensors are considered imaging or visual sensors that generate the “visual images” for input. Typically sensors are categorized as active or passive. Active sensors include radar, synthetic aperture radar, ladar, sonar, and sonograms, which recover 3D information directly by sending out energy and analyzing the timing and/or content of reflections or returns. Also considered active are structured light sensors, which actively transmit a known lighting pattern to illuminate the target scene and then analyze images of the augmented scene. Active sensors generate depth estimates directly and thus do not need to estimate depth. Active sensors stand in contrast with passive staring visual sensors that analyze incoming energy that they did not generate. The common visible-light camera and video camera are passive visual sensors, as are electro-optic (EO) sensors that operate on other wavelengths like infrared, ultraviolet, multispectral, or hyperspectral.
Even within the more constrained realm of passive visual sensors, a wide variety of sensor configurations have been explored. Some approaches use a single camera or viewpoint (monocular), while others use two or more synchronized cameras capturing images from different viewpoints (binocular). Some approaches use video sequences (again, monocular, binocular, or even trinocular) with incremental changes in camera position between frames, and others operate on sets of a few images captured from widely varying viewpoints.
A number of different approaches can be followed to extract information on 3D structure from one or more images of a scene. “Shape from focus” techniques estimate depth by varying a camera's focal length or other intrinsic parameters, and identifying which parts of the image are sharply in focus at which set of parameters. “Shape from shading” techniques analyze changes in image intensity over space in a single image to infer the gradient of is the surface being imaged. “Semantic information” can also be used—if the real-world size and geometry of an Abrams M-1 tank is known and an Abrams M-1 tank is recognized in an image, the known size and appearance of its projection can be used to infer its distance and pose relative to the camera. Finally, direct per-pixel depth estimates can be extracted by using “structure from stereo” and “structure from motion” techniques, collectively known as computational stereo techniques. “Structure from stereo” refers to approaches based on two or more cameras, and “structure from motion” refers to approaches that use a single camera and the motion of that camera relative to the scene to simulate the existence of two cameras. The final output of these passive structure recovery systems is almost always depth (or range).
Computational stereo approaches generate depth estimates at some set of locations (or directions) relative to a reference frame. For two-camera approaches, these estimates are often given relative to the first camera's coordinate system. Sparse reconstruction systems generate depth estimates at a relatively small subset of possible locations, where dense reconstruction systems attempt to generate estimates for most or all pixels in the imagery.
Computational stereo techniques estimate a range metric such as depth by determining corresponding pixels in two images that show the same entity (scene object, element, location or point) in the 3D scene. Given a pair of corresponding pixels and knowledge of the relative position and orientation of the cameras, depth can be estimated by triangulation to find the intersecting point of the two camera rays. Once depth estimates are computed, knowledge of intrinsic and extrinsic camera parameters for the input image frame is used to compute equivalent 3D positions in an absolute reference frame (e.g., global positioning system (GPS) coordinates), thereby producing, for example, a 3D point cloud for each frame of imagery, which can be converted into surface models for further analysis using volumetric tools.
While it is “depth” which provides the intuitive difference between a 2D and a 3D image, it is not necessary to measure or estimate depth directly. “Disparity” is another range metric that is analytically equivalent to depth when other parameters are known. Disparity refers, generally, to the difference in pixel locations (i.e., row and column positions) between a pixel in one image and the corresponding pixel in another image. More precisely, a disparity vector L(i,j) stores the difference in pixel indices between matching pixels in image IA and image IB. If pixel IA(10,20) matches pixel IB(15,21), then the disparity is L(10,20)=(15,21)−(10,20)=(5,1), assuming L is computed relative to reference frame IA. Zero disparity means that pixel IA(m,n) corresponds to pixel IB(m,n), so L(m,n)=(0,0). If camera position and orientation are known for two frames being processed, then quantities such as correspondences, disparity, and depth hold equivalent information: depth can be calculated from disparity by triangulation.
A disparity vector field stores a disparity vector at each pixel, and thus tells how to find the match (or correspondences) for each pixel in the two images. When intrinsic and extrinsic camera parameters are known, triangulation converts those disparity estimates into depth estimates and thus 3D positions relative to the camera's frame of reference.
An important problem in dense computational stereo is to determine the correspondences between all the pixels in the two (or more) images being analyzed. This computation, which at its root is based on a measure of local match quality between pixels, remains a challenge, and accounts for the majority of complexity and runtime in computational stereo approaches.
Academia and industry have provided many advances in automated stereo reconstruction, but the domain still lacks a general solution that is robust and deployable in real-world scenarios. A number of facets of potential general solutions remain open research problems. Runtime and efficiency continue to be challenges, as well as finding match quality metrics that are robust to low-quality imagery or changing scene conditions. Robustness to camera path changes and scene orientation are also issues.
Calibrated monocular aerial modeling is an application that has received somewhat less attention than other areas in computational stereo, and it lacks a generally applicable solution. In these applications, the camera typically follows an aerial platform's known but independently controlled path, with position and orientation changing incrementally between frames. A standard stereo geometry is not available and stereo groupings must be selected from within a set of buffered frames. Intrinsic and extrinsic camera parameters are typically known to a high degree of accuracy. Unlike some other applications, expected characteristics include large absolute ranges to the scene (hundreds or thousands of meters), large absolute disparities (tens or hundreds of pixels), and large disparity search ranges. Approaches encounter complex and uncontrolled outdoor scenes that may contain moving objects and are imaged under uncontrolled outdoor lighting. Images may also contain other various artifacts.
Reliable solutions in these areas would enable a wide variety of applications in the commercial, military, and government domains. Rapid passive modeling of urban or rural areas is valuable in itself for virtual training and virtual tourism, but that capability also enables improved tracking, surveillance, and change detection, supports disaster response, and facilitates more robust autonomous systems through visually-aided navigation, object recognition, and other follow-on processing.