Imaging systems in the field of the invention generally rely on the basic principle of triangulation. The most basic implementation of this principle involves images from only two locations where the effective aperture for the pixels in the two images is small relative to the separation between the two points. (Herein the effective aperture is considered to be the portion of the physical aperture that contains all of the rays that reach the active part of the sensing pixel.) This implementation with two images from different locations is called stereo vision and is often implemented with two separate cameras and lenses. To perform triangulation, a correspondence problem for the images from different locations needs to be solved to determine the location of an object in both images. The location within the images determines a direction from the positions of the cameras to the object. The intersection of these two lines determines the object's location in a scene, which gives the depth of the object. (The depth of an object in the scene is the distance from the imaging system to the object, and the scene is the part of the three dimensional world outside the camera that is visible to the camera. Typically the camera captures a two dimensional representation—an image—of the three dimensional scene.) In other words, the disparity, which is the shift in the object's position between the two images, is used to determine the depth of the object.
When the geometry of the imaging system is known, only certain matches should be considered which are referred to as feasible matches. These matches are the ones where the associated lines into the scene from the camera's locations intersect each other. For an imaging system with two cameras or view images, this means that for a given region in a first image the set of possible matches in the second image lie along a straight line through the second image. Solving the correspondence problem accurately requires the region in the first image to only accurately resemble a region centered at a single point along this line of possible matches.
Because of the geometry of triangulation, the disparity increases with a larger distance between the locations of the views, called the baseline. For imaging systems, the disparity is inherently measured in units of pixels in an image. A disparity of one pixel between two images from different viewpoints may be considered the minimum disparity necessary to reliably estimate depth from the two images. Therefore, depth accuracy increases as the baseline increases. However, for baselines larger than the diameter of a single lens, this principle may not be true because the scene appears different from different locations in a manner that cannot be approximated by local translations of objects. For example, near occlusions objects may be visible in only one image. However this effect is negligible for monocular imaging systems due to the limited baseline imposed by the dimensions of the lens relative to the distance to objects in the scene.
Since every pixel in a traditional camera has an effective aperture equal to the physical aperture of the camera, disparity cannot be observed using traditional cameras. FIG. 1 shows an example of such a basic camera setup, including an optical axis 100, a main lens 101, a micro-lens array 102 and an image sensing unit 103.
Imaging systems in the general field of the invention compare different view images to determine the disparity and in turn estimate the depth of an object. Some approaches use a small percent of pixels to obtain at a few locations two view images where the effective aperture is half of the physical aperture, typically the left and right halves of the aperture. For simplicity of description consider only the design that uses the left and right halves, which is functionally equivalent to the use of the top and bottom halves. These depth sensing pixels are often placed adjacent to each other in a section of a few rows of the sensor so that within any local region of the sensor all of the depth sensing pixels occupy a single row. Therefore, the depth may only be estimated at a small number of locations of the scene. Potentially knowing the depth at a small number of locations is sufficient for autofocus detection, which is the intended use of these pixels. However it is insufficient for many applications where an entire depth image is needed.
The effective apertures of the depth sensing pixels in these sensors is generally implemented by one of two designs. The first design includes placing the depth sensing pixels behind micro-lenses that are horizontally approximately twice as wide as the pixel pitch. Generally all of the light that falls on the micro-lens from the left or right halves of the physical aperture is directed to the appropriate pixel behind the micro-lens. The second design includes a light mask so that the light falling on the pixel from the undesired part of the physical aperture is either blocked before reaching the pixel or not measured by the pixel. Although the two designs achieve nearly equivalent effective apertures, there are a few differences. The light mask blocks light which reduces the signal to noise ratio of the resultant measurements. Light masks can be built for a single pixel whereas the micro-lens must apply to two adjacent pixels to achieve complementary effective apertures.
These designs that only acquire two view images do not offer robust depth estimation. Consider part of a scene that contains a flat surface with the primary feature of a horizontal line. For example a uniformly colored part plane with a horizontal line viewed by such an imaging system. It is impossible to accurately solve the correspondence problem for this scene. Image regions near the horizontal line in the scene accurately match all similar regions in the other image. Since the imaging system only offers a horizontal change in viewpoint due to the horizontal baseline between the two effective apertures and the scene contains only a horizontal feature, the depth is impossible to accurately estimate. This problem applies not only to lines in the scene that are parallel to the baseline but also to the component of any line in the scene that is parallel to the baseline. The inability of the imaging system to use this clearly defined feature in the scene reduces the accuracy of any subsequent depth estimation.
An alternate design is to have all or nearly all pixels of the sensor have an effective aperture of the left or right half of the physical aperture, such as described above. This design overcomes the challenge of the previously described approach of only being able to estimate depth at a small number of locations in the scene. However, such sensors with all or nearly all pixels as depth sensing pixels suffer a significant loss in spatial resolution. They can only output optical intensity images with half of the total pixels that exist in the sensor because each output pixel is the average of two sensor pixels. By averaging pixels with left and right half effective apertures, a traditional pixel (herein a traditional pixel is a pixel with an effective aperture approximately centered at the center of the physical aperture) with a complete effective aperture is simulated. The significant loss of spatial resolution is a serious limitation of this design.
There is a need for systems and methods of depth estimation that can provide accurate depth estimation over a wide area of the scene, without sacrificing spatial imaging resolution.