1. Field of the Invention
This invention relates in general to an apparatus and a method to determine the depth of an object in a scene, and, more particularly, to a multi-image, single sensor depth recovery system.
2. Description of the Related Art
Previously, many depth recovery systems attempted to extract depth from a scene by mimicking the human binocular stereo depth perception system to produce two or more images, each image being viewed from a different environmental, i.e., spatially separated, point. In order for these depth recovery systems to operate properly, two separate but interconnected aspects must be considered and implemented. These aspects are the available hardware and software components that are to be used in constructing the depth recovery system.
On the hardware side, many of these systems require the use of multiple sensors, such as CCD cameras, and typically employ one sensor per desired image. Other systems replace the multiple sensor array with a single sensor that is moved with respect to the scene in a precisely known manner with sequential data being taken and associated with the different environmental points of the known sensor movement.
Both the multiple sensor array and single moving sensor systems attempt to simulate the apparently simple human binocular system, but each results in incredible hardware and computational complexity as their implementation is actually attempted.
Among the common problems of such multiple or moving sensor systems is the difficulty in maintaining real-time computational calibration of each of the sensors forming the system. This is an especially critical problem since depth analysis processing of the system data often requires precise sensor location and known movement to provide useful data. When possibly chaotic and unsteady movement of the sensor platform is considered, the time required for depth analysis processing of the data stream from the sensors increases to the point of rendering real-time analysis of a scene extremely difficult if not practically impossible. Likewise, the possibly chaotic and unsteady movement of the sensor platform often results in high correlation errors in its depth analysis with that of the physical scene.
Likewise, on the software or computational algorithm implementation side, numerous methods for computation of depth from multiple images derived from multiple sensors have been developed over the last 20 or so years.
Typically, as a necessary step in extracting depth from a scene, these depth analysis computational algorithms assume some sort of constraint on the possible locations for matching points or features between multiple images viewed from different environmental points. In the simplest case, two images are used (see FIG. 1), which enables accurate depth recovery in many cases, but forces difficult problems in finding corresponding points between images, aligning sensors, calibrating algorithms, and digitizing images. For repetitive and highly textured surfaces, depth recovery from two stereo images obtained from two cameras becomes completely ambiguous. Multiple camera methods can simplify the problem of finding corresponding image points and eliminate any ambiguity from repetitive textures, but at the expense of increasing the quantity of data to process (and potentially the computer processing time) while simultaneously increasing difficulty in the calibration of the multiple sensors.
A primary barrier to the use of a multi-image lens in conjunction with a single imaging sensor, is the small baseline separation between image centers, normally a few millimeters. With such a small baseline separation, poor accuracy of depth measurements is likely. Consequently, some previous attempts have used large, cumbersome mirror systems to reflect two images of a scene to the same sensor. These systems are prone to error in calibration, and are both large and difficult to move rendering them useless for moving sensor platforms such as on vehicles.
Recent multi-image stereo algorithms, however, have been developed with the capability to use very small baseline separations of the order of a few millimeters and still obtain accurate depth to near objects. These systems use multi-camera arrangements or accurate camera positioning to achieve sufficient sensor, and hence baseline, separation of the images for processing purposes.
A traditional two sensor stereo system using cameras as sensors is shown in FIG. 1. In FIG. 1 two cameras 10 and 12 are positioned in a known spatial relationship to one another and accurately held in this spatial relationship to maintain integrity of the data stream during processing to determine depth of object 14 in the scene being viewed by the cameras.
Typically in this structure, the optical axes 16 and 18 of the two cameras 10 and 12 respectively, are accurately aligned to be both parallel and vertical to one another. The translational separation between the aligned optical axes of the two cameras is called the baseline.
The light rays 22 and 24 from the surface of object 14 are respectively directed by imaging arrays 26 and 28 to cameras 10 and 12 where each camera forms a data stream corresponding to the view of object 14 as would be seen by an observer located at the environmental point occupied by each respective camera.
Once the cameras are aligned, the difference in position of the same environmental point between the two images (known as disparity or parallax), together with the known geometry of the two camera system, can be used to compute the depth to the environmental point using known geometric principles.
This approach, however, requires not only an accurate initial and maintained alignment of the sensors, but further suffers from the fact that it is extremely difficult to search the produced sensor data stream to find corresponding points in the two environmentally displaced images.
A three or more sensor system using cameras is shown in FIG. 2 where elements similar to those described above for FIG. 1 are indicated by a prime appearing on a similar reference numeral.
In FIG. 2, multiple cameras 10' are positioned in a known spatial relationship to one another and accurately held in this spatial relationship to maintain integrity of the data stream during processing to determine depth of the object 14' in the scene being viewed by the cameras.
Typically in this structure, the optical axes 16' of each of the cameras 10' are accurately aligned to be both parallel and vertical to one another. The translational separation between the aligned optical axes of the cameras is called the baseline.
The light rays 22' the surface of object 14' are respectively directed by imaging arrays 26' to each of the cameras 10' so that each camera 10' forms a data stream corresponding to the view of object 14' as would be seen by an observer located at the environmental point occupied by each respective camera 10'.
Once the cameras are aligned, the difference in position of the same environmental point between each of the images (known as disparity or parallax), together with the known geometry of the multiple camera system, can be used to compute the depth to the environmental point using known geometric principles.
The advantage of this three or more sensor system over the two sensor system shown in FIG. 1, is that the addition of data from three or more environmentally displaced sensors imposes additional constraints on the location in the image and produced data stream from the sensors where corresponding points may be found, thus somewhat reducing the processing time required to find the corresponding points to more manageable proportions than that found in the two camera system described above.
The disadvantage of this three or more sensor system over the two sensor system shown in FIG. 1, is that each of the sensors must be accurately aligned and positioned, or if a single sensor is translated along a known direction, its instantaneous position must be continuously known to a high degree of accuracy for useful processing of the produced data stream.
In an article appearing in IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 14, No. 2, February 1992, pp. 99 to 106, titled "Single Lens Stereo with a Plenoptic Camera", E. H. Adelson and J. Y. A. Wang described a single sensor method for obtaining depth to environmental points in a scene which uses a device which they call a plenoptic camera. The plenoptic camera uses a optical system called a lenticular array consisting of an array of microlenses. Each microlens forms an image of the main lens aperture onto the sensor plane. (See FIG. 6 on page 102 of this article).
The system proposed here uses a multi-image lens instead of a lenticular array as proposed in the article. Multi-image lenses are commercially available with a wide range in the number of facets they possess. In the present system, multiple images are formed through the multi-image lens onto whatever optics the original sensor uses.