1. Field of Invention
The present invention relates to the field of catadioptric cameras. More specifically, the present invention relate to the use of a catadioptric camera to capture a light field of a real-world scene.
2. Description of Related Art
Computer generated three dimensional, i.e. 3D, images are desirable in various applications, such as research and entertainment. There are various methods of rendering 3D images. For example, a scene consisting of geometric primitives composed of different materials and a set of lights may be input to a three-dimensional graphics system, which then computes and renders an output image based on this information. This approach, however, is very computer and labor intensive.
An alternate approach, which may be termed image-based rendering, generates different, i.e. new, views of an environment based on a set of existing, pre-acquired images. This approach may be used to render 3D images from a collection of two dimensional, i.e. 2D, images. Indeed, reconstruction of 3D scenes from multiple 2D views is probably one of the most explored problems in computer vision. This typically requires that the processing device be able to match corresponding objects in two or more images of a common scene. Classical stereo matching algorithms then use the pinhole camera model to infer depth.
In the field of computer vision, this matching of objects (or object features or feature points) common to two or more images is often termed correspondence matching (or the correspondence problem). Correspondence matching tries to figure out which parts of a first image correspond to (i.e. are matched to) which parts of a second image, assuming that the second image was taken after the camera that took the first image had moved, time had elapsed, and/or the pictured objects had moved. For example, the first image may be of a real-world scene taken from a first view angle with a first field-of-vision, FOV, and the second image may be of the same scene taken from a second view angle with a second FOV. Assuming that the first and second FOVs at least partially overlap, correspondence matching refers to the matching of common features points in the overlapped portions of the first and second images.
Thus, correspondence matching is an essential problem in computer vision, especially in stereo vision, view synthesis, and 3D (or perspective) reconstruction. Assuming that a number of image features, or objects, in two images taken from two view angles have been matched, epipolar geometry may then be used to identify the positional relationship between the matched image features to achieve stereo view synthesis, or 3D reconstruction.
Epipolar geometry is basically the geometry of stereo vision. For example in FIG. 1, two cameras 11 and 13 create two 2D images 15 and 17, respectively, of a common 3D scene 10 consisting of a larger sphere 19 and a smaller sphere 21. 2D images 15 and 17 are taken from two distinct view angles 23 and 25. Epipolar geometry describes the geometric relations between points in 3D scene 10 (for example spheres 19 and 21) and their relative projections in 2D images 15 and 17. These geometric relationships lead to constraints between the image points, which are the basis for epipolar constraints, or stereo constraints, described more fully below.
FIG. 1 illustrates a horizontal parallax where, from the view point of camera 11, smaller sphere 21 appears to be in front of larger sphere 19 (as shown in 2D image 15), but from the view point of camera 13, smaller sphere 21 appears to be some distance to the side of larger sphere 19 (as shown in 2D image 17). Nonetheless, since both 2D images 15 and 17 are of the same 3D scene 10, both are truthful representations of the relative positions of larger sphere 19 and smaller sphere 21. The positional relationships between camera 11, camera 13, smaller sphere 21 and larger sphere 19 thus establish geometric constraints on 2D images 15 and 17 that permit one to reconstruct 3D scene 10 given only 2D images 15 and 17, as long as the epipolar, or stereo, constraints are known.
Epipolar geometry is based on the pinhole camera model, a simplified representation of which is shown in FIG. 2. In the pinhole camera model, cameras are represented by a point, such as left point OL and right point OR, at each respective camera's focal point. Point PO represents the point of interest (i.e. an object) in the 3D scene being imaged, which in the present example is represented by two crisscrossed lines.
Typically, the image plane (i.e. the plane on which a 2D representation of the imaged 3D scene is captured) is behind a camera's focal point and is inverted. For ease of explanation, and to avoid the complications of a an inverted captured image, two virtual image planes, ImgL and ImgR, are shown in front of their respective focal points, OL and OR, to illustrate non-inverted representations of captured images. One may think of these virtual image planes as windows through which the 3D scene is being viewed. Point PL is the 2D projection of point PO onto left virtual image ImgL, and point PR is the 2D projection of point PO onto right virtual image ImgR. This conversion from 3D to 2D may be termed a perspective projection, or image projection, and is described by the pinhole camera model, as it is known in the art. It is common to model this projection operation by rays that emanate from a camera and pass through its focal point. Each modeled emanating ray would correspond to a single point in the captured image. In the present example, these emanating rays are indicated by dotted lines 27 and 29.
Epipolar geometry also defines the constraints relating the positions of each camera relative to each other. This may be done by means of the relative positions of focal points OL and OR. The focal point of a first camera would project onto a distinct point on the image plane of a second camera, and vise-versa. In the present example, focal point OR projects onto image point EL on virtual image plane ImgL, and focal point OL projects onto image point ER on virtual image plane ImgR. Image points EL and ER are termed epipoles, or epipole points. The epipoles and the focal points they project from lie on a single line, i.e. line 31.
Line 27, from focal OL to point PO, is seen as a single point PL in virtual image plane ImgL, because point PO is directly in front of focal point OL. This is similar to how in image 15 of FIG. 1, smaller sphere 21 appears to be in front of larger sphere 19. However, from focal point OR, the same line 27 from OL to point PO is seen a displacement line 33 from image point ER to point PR. This is similar to how in image 17 of FIG. 1, smaller sphere 21 appears to be displaced to the side of larger sphere 19. This displacement line 33 may be termed an epipolar line. Conversely from focal point OR, line 29 is seen as a single point PR in virtual image plane ImgR, but from focal point OL line 29 is seen as displacement line, or epipolar line, 35 on virtual image plane ImgL.
Epipolar geometry thus forms the basis for triangulation. For example, assuming that the relative translation and rotation of cameras OR and OL are known, if projection point PL on left virtual image plane ImgL is known, then the epipolar line 33 on the right virtual image plane ImgR is known by epipolar geometry. Furthermore, point PO must projects onto the right virtual image plane ImgR at a point PR that lies on this specific epipolar line, 33. Essentially, for each point observed in one image plane, the same point must be observed in another image plane on a known epipolar line. This provides an epipolar constraint that corresponding image points on different image planes must satisfy.
Another epipolar constraint may be defined as follows. If projection points PL and PR are known, their corresponding projection lines 27 and 29 are also known. Furthermore, if projection points PL and PR correspond to the same 3D point PO, then their projection lines 27 and 29 must intersect precisely at 3D point PO. This means that the three dimensional position of 3D point PO can be calculated from the 2D coordinates of the two projection points PL and PR. This process is called triangulation.
Epipolar geometry also forms the basis for homography, i.e. projective transformation. Homography describes what happens to the perceived positions of observed objects when the point of view of the observer changes. An example of this is illustrated in FIG. 3, where the shape of a square 12 is shown distorted in two image projections 14 and 16 as viewed from two different points of view V1 and V2, respectively. Like before, image planes 14 and 16 may be thought of as windows through which the square 12 is viewed.
Homography would identify the points in common between image projections 14 and 16 and square 12 (i.e. point registration). For example, the four corners A, B, C and D of square 12 correspond to points A′, B′, C′ and D′ in image projection 14, and correspond to points A″, B″, C″ and D″ in image projection 16. Thus, points A′, B′, C′ and D′ in image projection 14 correspond respectively to points A″, B″, C″ and D″ in image projection 16.
Assuming that the pinhole model applies, epipolar geometry permits homography to relate any two images of the same planar surface in space, which permits image rectification, image registration, or computation of camera motion (rotation and translation) between two images. Once camera rotation and translation have been extracted from an estimated homography matrix, this information may be used for navigation, or to insert models of 3D objects into an image or video, so that they are rendered with the correct perspective and appear to have been part of the original scene.
For example in FIG. 4, cameras 22 and 24 each take a picture of a 3D scene of a cube 26 from different points of view. From the view point of camera 22, cube 26 looks as shown in 2D image 28, and from the view point of camera 24, cube 26 looks as shown in 2D image 30. Homography permits one to identify correlating points, some of which are shown by dotted lines for illustration purposes. This permits both 2D images 28 and 30 to be stitched together to create a 3D image, as shown in image 32. Thus, automatically finding correspondence between pairs of images is the classic problem of stereo vision, but unfortunately the available algorithms to achieve this task may not always find the correct correspondences.
Another method of creating and manipulating 3D images (particularly in the area of computer vision) is the use of voxels. A voxel (i.e. volumetric pixel or volumetric picture element) is a volume element, representing a value on a regular grid in three dimensional space similar to how a pixel represents a value on a two dimensional space (i.e. a bitmap). Voxels are frequently used in the visualization and analysis of medical and scientific data, as well as representation of terrain in video games and computer simulations.
An example of a voxel representation of a 3D image is shown in FIG. 5A. Teapot TB is a voxel representation of teapot TA. A volume containing voxels can be visualized either by direct volume rendering or by the extraction of polygon iso-surfaces which follow the contours of given threshold values. Irrespective of how the voxels are defined, voxels generally contain volumetric information that facilitates the manipulation of 3D images. The resolution of a voxel representation is determined by the size of the voxel. For example, FIG. 5B shows a higher resolution voxel image TC of teapot TA. Some volumetric displays thus use voxels to describe their resolution. For example, a display might be able to show 512×512×512 voxels. The higher the voxel resolution, the more detailed the 3D representation.
Another method of rendering perspective representations of 3D objects is through direct capture the light field around an object. The light field is a function that describes the amount of light traveling in every direction through every point in space. With reference to FIG. 6, if the concept is restricted to geometric optics, i.e. to incoherent light and to objects larger than the wavelength of light, then the fundamental carrier of light is a light ray, or ray. The measure for the amount of light traveling along a ray is radiance, denoted by L and measured in watts (W) per steradian (sr) per meter squared (m2). The steradian is a measure of solid angle, and meters squared are a measure of cross-sectional area.
The radiance along all such rays in a region of three-dimensional space illuminated by an unchanging arrangement of lights is called the plenoptic function. The plenoptic illumination function is an idealized function used in computer vision and computer graphics to express the image of a scene from any possible viewing position at any viewing angle at any point in time. Since rays in space can be parameterized by three coordinates, x, y, and z and two angles θ and φ, as illustrated in FIG. 6B, it is a five-dimensional function, although higher-dimensional functions may be obtain if one considers time, wavelength, and polarization angle as additional variables.
The light field may also be treated as an infinite collection of vectors, one per direction impinging on a point, with lengths proportional to their radiances. Integrating these vectors over any collection of lights, or over the entire sphere of directions, produces a single scalar value—the total irradiance at that point, and a resultant direction. For example, FIG. 6C shows two light rays rA and rB emanating from two light sources IA and IB, and impinging on point P′. Light rays rA and rB produce vectors DA and DB, and these vectors combine to define vector D′, which specifies the total irradiance at point P′. The vector-valued function in a 3D space may be called the vector irradiance field, and the vector direction at each point in the field can be interpreted as the orientation one would face a flat surface placed at that point to most brightly illuminate it.
For practical application in the field of computer graphics, however, it is beneficial to reduce the number of dimensions used to describe a light field. If locations in a 3D scene are restricted to outside a convex hull of an object (i.e. the subject under study), such as if the object was shrink-wrapped, the light function would then contain redundant information because the radiance along a ray remains constant from point to point along its length path until it collides with the object. It has been found that the redundant information is one dimension, leaving a four-dimensional function. This function is sometimes termed the photic field, 4D light field or Lumigraph. Formally, the 4D light field is defined as radiance along rays in empty space. Using this reduced dimensional definition, the plenoptic function can be measured using a digital camera. A fuller explanation of this is provided in U.S. Pat. No. 6,097,394 to Levoy, herein incorporated in its entirety by reference.
Levoy explains that the set of rays in a light field may be parameterized using two-plane parameterization, as illustrated in FIG. 6D. This parameterization has the advantage of relating closely to the analytic geometry of perspective imaging, as explained above. Indeed, a simple way to think about a two-plane light field is as a collection of perspective images of the st plane (and any objects that may lie astride or beyond it), each taken from an observer position on the uv plane. A light field parameterized this way is sometimes called a light slab. An example of this is shown in FIG. 7A. In this case, a plurality of cameras C1 to Cn on the uv plane create a light slab LS by providing multiple views (i.e. perspective images) of the st plane.
In computer graphics, light fields are typically produced either by rendering a 3D model or by photographing a real scene. In either case, to produce a light field, multiple views must be obtained from a large collection of viewpoints. Depending on the parameterization employed, this collection will typically span some portion of a line, circle, plane, sphere, or other shape. For example in FIG. 7B, four light slaps LS1 to LS4 are used to capture a light field around a cylinder C at its center. Thus, capturing a light field photographically requires many images from various view angles and intricate setups. This often complicates the creation of light fields, especially for everyday use.
As discussed above, there are multiple approaches towards rendering 3D images in computer applications. But because of the versatility of light fields (such as the ability to change the view point and the focal point of a rendered 3D image) and their ability to be created by use of captured digital images, light fields are of particular interest. However, the use of light fields is complicated by their need for a plurality of digital images of a 3D subject taken from various view angles.
One method of reducing the number of imaging devices (or the number of times a single imaging device is repeatedly used) to generate multiple images from various view angle is the use of catadioptric cameras. Catadioptric cameras, or systems, can image a subject from a wider field of vision than pinhole cameras and thus reduce the need for multiple images from different FOVs. Catadioptric camera systems, however, do not fall under the pinhole camera model. Consequently, they are not subject to epipolar geometry, upon which the above described 3D rendering methods are based. This makes catadioptric cameras systems ill-suited for the above-described methods of generating 3D images. One may attempt applying pinhole model methods directly, as described above, to catadioptric cameras, but the results will have inherent errors, tend to exhibit distortions and not be optimal.
An object of the present invention is to provide a simple and economic method of capturing light fields.
Another object of the present invention is to reduce the number of cameras needed for rendering perspective images from 2D captured images.
Still another object of the present invention is to provide a method for utilizing catadioptric systems in the capturing of light fields and in the creation of 3D images.