The present invention pertains to the areas of image processing and machine processing of images. In particular, the present invention relates to a method for performing photo-realistic 3-D content creation from 2-D sources such as photographs or video.
It is often desirable to generate a three-dimensional (xe2x80x9c3-Dxe2x80x9d) model of a 3-D object or scene. A 3-D representation of an object may be utilized in a computer graphics context for presenting 3-D content to users to increase the effectiveness and reality of images. One method for generating 3-D information is achieved utilizing the techniques of projective geometry and perspective projection: a 2-D projective space is a perspectively projected image of a 3-D space.
Typically, image data is obtained utilizing a digital imaging system, which captures a plurality of 2-D digitally sampled representations of a scene or object (scene data sources). In the alternative, an analog source (such as an image obtained from a traditional camera) may be utilized and digitized/sampled using a device such as a scanner. The 2-D digital information representing the scene or object may then be processed using the techniques of computational projective geometry and texture mapping to generate the structure of a 3-D image.
FIG. 1a is a flowchart that depicts a general paradigm for generating a 3-D model of a scene or object from a plurality of 2-D scene data sources. The process is initiated in step 151. In step 154, a plurality of 2-D scene data sources are generated by generating respective 2-D images of the scene utilizing a variety of perspectives (i.e., camera positions). In step 155, using the techniques of computational projective geometry, the shape of the image is deduced by determining a 3-D feature set associated with the scene or object. For example, if the prominent features determined in step 153 are a points, a point cloud {X} may be generated. A point cloud is 3-D information {X} for a set of pre-determined points on a desired image or object. In step 157, a texture mapping process is applied to the point cloud solution to generate a 3-D model of the scene or object. The process ends in step 159.
Generating a 3-D point cloud set from a plurality of 2-D sources depends upon two interrelated sub-problems: the camera calibration problem and the point-matching problem. Specifically, the camera calibration problem requires calculating the relative camera rotations R and translations T associated with the plurality of 2-D scene data sources. The point matching problem requires determining a correspondence of image points in at least two scene data sources.
One known technique for determining the 3-D shape of an object (the point cloud) relies upon the use of input from two or more photographic images of the object, taken from different points of view. This problem is known as the shape from motion (xe2x80x9cSFMxe2x80x9d) problem, with the motion being either the camera motion, or equivalently, the motion of the object. In the case of two images, this problem is known also as the stereo-vision problem. The process of extracting a 3-D point cloud from stereo pairs is known as photogrammetry. Once the shape is determined, it is then possible to map the textures of the object from the photographs to the 3-D shape and hence create a photo-realistic virtual model of the object that can then be displayed in standard 3-D viewers such as a VRML (xe2x80x9cVirtual Reality Modeling Languagexe2x80x9d) browser.
In order to solve the SFM problem, the relative positions and orientations of the cameras must be known or calculated. This is known as solving the camera calibration problem. This camera calibration problem can be solved if at least 5-8 points can be matched on each of the images, corresponding to the same physical 3-D points on the actual object or scene (In practice, the number of points required for robust estimation is typically far greater than 8). In place of points, line segments, or complete object sub-shapes may be matched instead of points.
Once the cameras have been calibrated, or simultaneously, object shape can be calculated. Knowing point correspondences, together with the camera calibration provides a sufficient set of constraints to allow calculation of the 3-D positions of all corresponding points utilizing the techniques of projective geometry. If enough points are matched then the object shape emerges as a point cloud. These points can be connected to define surfaces and hence determine the complete 3-D surface shape of objects.
Automatic point-matching algorithms have been developed to match image points from one 2-D image source to another (see [Lew et al] for a brief survey of the better known feature matching algorithms). These automatic point matching algorithms, however, have difficulties when the camera points of view differ significantly. In that case, there is significant distortion in the patterns required for point matching, due to perspective foreshortening, lighting variations, and other causes. For this reason, such approaches tend to work best when the camera points of view are very close to each other, but this limits their applicability.
Another approach, which has been exploited, is to assume that a set of prominent points on the object in the images can be determined, but that the point correspondence between images is not known. In this case, the 3-D constraints are used to solve not only the calibration and shape problem, but also the point-matching problem. If this can be done, then there are many practical cases where automation can be used to find the object shape. This approach is referred to herein as the method of uncalibrated point matching through perspective constraints (xe2x80x9cUPMPCxe2x80x9d).
Relatively little work has been done to solve the UPMPC problem, partly because it has appeared to be more difficult than solving the point matching directly, or because it appears to require extremely time consuming calculations, proportional to the number of points factorial squared: (N!)(N!). For example, Dellaert et al proposes a UPMPC solution relying upon on a statistical sampling method. Jain proposes a UPMPC solution utilizing a set of constraints, assumptions and a minimization methodology.
However, known UPMPC solutions are generally deficient in two respects. First, these UPMPC methods typically are very slow due to the great computational complexity associated with the (N!)(N!) complexity limiting their application. Second, known UPMPC methods are not typically tolerant to noise and are therefore not robust solutions for use in industry.
Computational Projective Geometry Mathematical Background and Definitions
The following mathematical background and definitions were taken from Kanatani, Geometric Computation for Machine Vision (Clarendon Press, 1993). FIG. 1b depicts an illustration of a camera model, which provides a conventional model for the 3-D interpretation of perspective projection. Lens 114 projects 3-D object 130 onto film 125 as image 120. Known constant f (focal length) is the distance between lens center 105 and film surface 125.
FIG. 1c depicts a perspective projection of a scene and a relationship between a space point and an image point. Points on image plane 135 are typically designated by a triplet (m1, m2, m3) of real numbers and are referred to as homogeneous coordinates. If m3xe2x89xa00, point (m1, m2, m3) is identified with the point:       x    =          f      ⁢                        m          1                          m          3                      ,      y    =          f      ⁢                        m          2                          m          3                    
on image plane 135. A line is also defined by a triplet (n1, n2, n3) of real numbers, not all of them being 0. These three numbers are referred to as the homogenous coordinates of the line.
By definition, homogenous coordinates can be multiplied by an arbitrary nonzero number, and the point or line that they represent is still the same. then, they are represented as normalized vectors, or N-vectors, such that:       N    ⁡          [      u      ]        =      u          "LeftDoubleBracketingBar"      u      "RightDoubleBracketingBar"      
Space point 150 having coordinates (X,Y,Z) in space is projected onto the intersection of image plane 135 (Z=f) with the ray starting from the viewpoint O and passing through space point 150 such that space point 150 is projected onto corresponding image point 160 on image plane 135. Image point 160, having image coordinates (x,y), is associated with space point 150 (X,Y,Z) by the perspective projection equation       x    =          f      ⁢              X        Z              ,      y    =          f      ⁢                        Y          Z                .            
Thus, scene coordinates (e.g., coordinates (X,Y,Z) of scene point 160) can be identified with homogeneous coordinates (X, Y, Z) on image plane 135. The x and y-coordinates are called image coordinates or inhomogeneous coordinates. Thus, space point P 150 is mapped to inhomogeneous image point 160 as follows:       P    ⁡          (              X        ,        Y        ,        Z            )        - greater than       f    ⁡          (                                                  X              Z                                                                          Y              Z                                          )      
The N-vector m of a point P on image plane 120 refers to the unit vector starting from viewpoint O 115 and pointing toward or away from the image point (e.g., 160). (i.e., the N-vector indicates the orientation of the line connecting that point and viewpoint O). The N-vector of a space point (e.g., 150) is defined to be the N-vector of its projection on image plane 135. Images may be analyzed in terms of N-vectors by regarding the images as perspective projections of 3-D scenes utilizing the camera model shown in FIG. 1b. 
3-D content may be generated from a plurality of 2-D scene data sources (images) obtained using photographic methods. The 3-D data is generated utilizing computational geometry based upon perspective constraints associated with the 2-D scene data sources, having different camera positions and orientations.
Thus, for example, for each 2-D scene data source (image) from which 3-D information is to be deduced, the camera is associated with a translation and rotation parameter. In particular, the 3-D motion of a camera is specified by the motion parameters {R, T} where R is a 3xc3x973 rotation matrix, RT(=Rxe2x88x921)R=I, representing the rotation between camera 1 (camera 1 pertaining to the first image) and camera 2 (camera 2 pertaining to the second image) and T is a 3-D translation vector representing the translation between camera 1 and camera 2.The Xxe2x80x2Yxe2x80x2Zxe2x80x2 camera coordinate system after the motion is obtained from the XYZ camera coordinate system before the motion by (i) rotating the coordinate axes around the origin O by R and translating the axes by T where R and T are defined with respect to the original XYZ coordinate system. If the camera is rotated around the center of the lens a new image is observed. Specifically, for the group of camera rotation transformations, a space point P 150 associated with N-vector m is transformed into N-vector mxe2x80x2 after the camera rotation. If the camera rotation is specified by rotation matrix R, then
mxe2x80x2=xc2x1RTm, nxe2x80x2=xc2x1RTn
where (m is the N-vector of a space point and n is the N-vector of a space line) because rotating the camera relative to the scene by R is equivalent to rotating the scene relative to the camera by Rxe2x88x921 (=RT).
Similarly, a translation of the camera causes an effective translation of the object relative to the camera and the resulting image motion of points and lines reveal their 3-D geometries. If P is a space point 150 with associated N-vector m. Then, {overscore (OP)}=rm where r is the depth of P. Thus, the 3-D position of a space point P 150 is completely specified by the pair {m, r}. 3-D reconstruction of an image point based upon 2-D data therefore means computation of the depth of the space point represented by {m, r}.
If the camera is translated by T, the representation {m, r} of a space point changes into {mxe2x80x2, rxe2x80x2} in the form:
mxe2x80x2=N[rmxe2x88x92T],rxe2x80x2=∥rmxe2x88x92T∥
For a camera motion {R, T}, if R is known, the camera rotation transformation can be applied to undo the effect of the rotation by rotating the second frame by Rxe2x88x921(=RT). In particular, if mxe2x80x2 and nxe2x80x2 are the N-vectors of a space point and a space line in the second frame respectively, they are replaced by Rmxe2x80x2 and Rnxe2x80x2, respectively. Thus, after a camera motion {R, T}, the representation {m,r} of a space point changes into {mxe2x80x2, rxe2x80x2} in the form:
i mxe2x80x2=RT N[rmxe2x88x92T],rxe2x80x2=∥rmxe2x88x92T∥
FIG. 2a depicts a camera motion with respect to the camera coordinate system according to one embodiment of the present invention. As shown in FIG. 2a, space point P 150 is associated with N-vector m and depth r such that. {overscore (OP)}=rm.
FIG. 2b depicts a camera motion with respect to the scene coordinate system according to one embodiment of the present invention. Camera motion {R, T} is applied such that with respect to the camera coordinate system {overscore (Oxe2x80x2P)}=rxe2x80x2mxe2x80x2 because the second camera coordinate system is rotated by R.
As shown in FIG. 2b, the following geometric relationship is obtained:
rm=T+rxe2x80x2Rmxe2x80x2, or rmxe2x88x92rxe2x80x2Rm=T
Thus, given two sets of unit vectors {mxcex1} and {mxe2x80x2xcex1}=1, . . . N, the following relationship is obtained:
rxcex1mxcex1xe2x88x92rxe2x80x2xcex1Rmxe2x80x2xcex1=T, xcex1=1, . . . N.
FIG. 3 depicts a relationship between epipolar lines and corresponding points within a 3-D scene. In particular, as shown in FIG. 3, m is the N-vector of a 2-D point on a first image. mxe2x80x2 is the N-vector of a corresponding 2-D point on a second image. u is the N-vector of the correct epipole on the first image. uxe2x80x2=Ru is the epipole on the second image. Note that image line 335a, which is an epipolar, connects epipole associated with u with the image point associated with N-vector m. The N-vector associated with epipolar 335a may be expressed as the N-vector of the cross-product mxc3x97u. FIG. 3 also shows second image point associated with N-vector mxe2x80x2. The image line defined by the image point associated with N-vector mxe2x80x2 and the image point associated with uxe2x80x2 defines epipolar 335b. Epipolar 335b may be expressed as R(mxc3x97u).
FIG. 3 further shows N-vector mxe2x80x3=Rmxe2x80x3 which is associated with the image point generated by rotating the image point associated with N-vector mxe2x80x2 back to the first image. If the point correspondence m, mxe2x80x2 is correct, the image point associated with mxe2x80x3=Rmxe2x80x2 should lie on the epipolar line, mxc3x97u, formed by the epipole u and the image point m.
FIG. 4 depicts three correctly matching sets of points (m1, m1xe2x80x3), (m2, m2xe2x80x3), (m3, m3xe2x80x3) (335a, 335b and 335c). Note that epipolar lines 335a-335c intersect at common epipole u.
FIG. 5 depicts three incorrectly matching sets of points (m1, m1xe2x80x3), (m2, m2xe2x80x3) and (m3, m3xe2x80x3) (335a, 335b and 335c). Note that a common intersection is highly unlikely in the case of incorrect point correspondences or choice of incorrect R matrix. This is a key observation, since the common crossing can therefore be used to select out lines corresponding to correct pairs and correct R.
The present invention provides a UPMPC method and system for generation of 3-D content from 2-D sources. In particular, the present invention provides a UPMPC method for determining a point cloud from one or more 2-D sources. The method of the present utilizes a search process for locating a point of maximal crossing of line segments, wherein each line segment line is generated for each possible point match between an image point in a first 2-D scene data source and an image point in a second 2-D scene data source. The search is performed over all possible camera rotation angles. The process defined by the present invention may be applied to each possible pair of 2-D scene data sources to generate a highly robust solution. Use of the method provided by the present invention allows significantly reduced computational complexity for generating a 3-D point cloud (order N{circumflex over ( )}2 vs. (N!)(N!) for conventional approaches), which results in significant speed improvement the 3-D content generation process. The process defined by the present invention is also highly noise tolerant. The point cloud solution generated by the methods of the present invention may then be utilized as input to perform texture mapping or further 3-D processing.
A plurality of 2-D scene data sources of the image are obtained either from traditional photographic imaging techniques or from digital imaging techniques. Each of the 2-D scene data sources is associated with a translation parameter and a rotation parameter respectively describing a camera translation and rotation. Each 2-D scene data source includes a plurality of image points, which are associated with the previously identified space points in the 3-D scene. The correspondence of image points between any two 2-D scene data sources is unknown. Furthermore, an epipole is implicitly defined between any two of the 2-D scene data sources, wherein the epipole is a function of the translation parameter relating the 2-D scene image sources.
The present invention utilizes a search algorithm to find a point of maximal crossing of a plurality of the line segments across all possible camera rotation angles. The present invention relies upon the observation that corresponding points between two 2-D scene data sources will lie on the same epipolar line (when the points are considered with respect to a common rotation angle) if the rotation angle and point correspondence are both correct (i.e., there is a zero rotation angle of the points with respect to one anotherxe2x80x94the points are rotationally neutral with respect to one another). The point of maximal crossing of the line segments defines the epipole.
In order to significantly reduce computational complexity, a motion constraint is defined, wherein the motion constraint defines a maximum value of a ratio (r/rxe2x80x2) and a minimum value of the ratio (r/rxe2x80x2), wherein the ratio (r/rxe2x80x2) represents a ratio between a first distance (r) to an image point and a second distance (rxe2x80x2) to the image point after a camera translation (T). Effectively the motion constraint defines a continuous range of values of possible camera translations, the range of values defining the length of each line segment. If the chosen camera angle and point correspondence are correct, the true epipole will lie somewhere on the line segment (that is the line segment will span a portion of an epipolar line).
By searching over all possible point correspondences and rotation angles, a point of maximal crossing is determined. The rotation parameter generating the maximal crossing point defines the desired rotation parameter. The point of maximal crossing defines the epipole and directly from this, the translation parameter is determined. Finally, point correspondence information is determined by determining the point correspondences, which produced the line segments meeting at the point of maximal crossing. A 3-D point cloud {X} is determined as a function of the best rotation parameter, the best translation parameter and the point correspondence information using projective geometry relations.
According to one embodiment, the present invention provides a method for generating 3-D content from a plurality of 2-D sources, herein referred to as the xe2x80x9cT-Crossing method,xe2x80x9d a UPMPC method that generates a point cloud solution. The T-Crossing method provides significantly reduced computational complexity by utilizing the motion constraint described above.
In general, the T-Crossing method operates by performing the following: (i) generating a line segment for each possible point correspondence between an image point associated with a first 2-D scene data source and an image point associated with a second 2-D scene data source as a function of an associated motion constraint; (ii) for each of a plurality of possible rotation parameters, determining a point of maximal crossing point associated with the line segments generated in (i); (iii) determining a best rotation parameter, wherein the best rotation parameter is associated with a best maximal crossing point; (iv) determining a best translation parameter, wherein the best translation parameter is associated with an N-vector of the best maximal crossing point determined in (iii); (v) determining point correspondence information, wherein the point correspondence information is obtained by finding all point correspondences that produced line segments meeting at the maximal crossing point.
According to one embodiment, the T-Crossing algorithm receives as input, a plurality of scene data sources (set of image points on two images). Based upon this input, the T-Crossing method determines a best rotation matrix R, translation vector T and index matrix Ixe2x80x2 for a first 2-D scene data source and a second 2-D scene data source, wherein the index matrix Ixe2x80x2 represents point correspondences between the first and scene data sources. Use of the T-Crossing method obviates the need for determining in advance which points match, and there is no need for all points on the first image to have matches on the second image.
According to one alternative embodiment, the T-Crossing method is applied multiple times, respectively to multiple image pairs to generate multiple point cloud solutions {X}1-{X}N. The multiple point cloud solutions {X}1-{X}N are combined to determine a best solution {X}best. To compare the multiple point cloud solutions {X}1-{X}N, the respective calibrations R and T are utilized to transform the {X} from each pair to a common coordinate system where the comparison may be performed. {X}best is then calculated by averaging, determining the median or using some other method.
In some situations a degeneracy arises that cannot be resolved utilizing the T-Crossing method alone. This degeneracy is indicated were Ixe2x80x2 is a block diagonal matrix. According to one alternative embodiment a method herein referred to as the Epipolar Clustering Method (xe2x80x9cEPMxe2x80x9d) is applied to reduce Ixe2x80x2 from a block diagonal matrix to a diagonal matrix.
The T-crossing method also works well when the point matches are known. In this case, the problem is the more traditional one, and the difficulty being solved is the noise coming from poor automatic or hand matching. The T-Crossing method deals with this noise in a different way and is quite noise tolerant. Also, the algorithm can be used for any of the intermediate problems, where some, but not all, point matches are known, or there are simply some constraints on the possible point matches.
A significant advantage of the T-Crossing algorithm is its ability to easily exploit any known constraints on possible point matches. For example, an unreliable feature-matching algorithm might be good enough to limit possible point matches to small sub-sets. Or a known restriction on camera movement might imply that a point in one image must have its match within some restricted region on the second image.
Moreover, the T-Crossing algorithm can also exploit constraints on the 3-D points themselves. Such constraints may come from the use of other approaches that provide crude initial results. They may also come from having applied the T-Crossing algorithm to other image pairs. That is, an imperfect solution from one pair can be bootstrapped to improve the solutions for successive pairs.