When an item moves without any constraints (freely) in a three-dimensional environment with respect to stationary objects, knowledge of the item's distance and inclination to one or more of such stationary objects can be used to derive a variety of the item's parameters of motion, as well as its complete pose. The latter includes the item's three position parameters, usually expressed by three coordinates (x, y, z), and its three orientation parameters, usually expressed by three angles (α, β, γ) in any suitably chosen rotation convention (e.g., Euler angles (ψ, θ, φ) or quaternions). Particularly useful stationary objects for pose recovery purposes include ground planes, fixed points, lines, reference surfaces and other known features.
Many mobile electronics items are now equipped with advanced optical apparatus such as on-board cameras with photo-sensors, including high-resolution CMOS arrays. These devices typically also possess significant on-board processing resources (e.g., CPUs and GPUs) as well as network connectivity (e.g., connection to the Internet, Cloud services and/or a link to a Local Area Network (LAN)). These resources enable many techniques from the fields of robotics and computer vision to be practiced with the optical apparatus on-board such virtually ubiquitous devices. Most importantly, vision algorithms for recovering the camera's extrinsic parameters, namely its position and orientation, also frequently referred to as its pose, can now be applied in many practical situations.
An on-board camera's extrinsic parameters in the three dimensional environment are typically recovered by viewing a sufficient number of non-collinear optical features belonging to the known stationary object or objects. In other words, the on-board camera first records on its photo-sensor (which may be a pixelated device or even a position sensing device (PSD) having one or just a few “pixels”) the images of space points, space lines and space planes belonging to one or more of these known stationary objects. A computer vision algorithm to recover the camera's extrinsic parameters is then applied to the imaged features of the actual stationary object(s). The imaged features usually include points, lines and planes of the actual stationary object(s) that yield a good optical signal. In other words, the features are chosen such that their images exhibit a high degree of contrast and are easy to isolate in the image taken by the photo-sensor. Of course, the imaged features are recorded in a two-dimensional (2D) projective plane associated with the camera's photo-sensor, while the real or space features of the one or more stationary objects are found in the three-dimensional (3D) environment.
Certain 3D information is necessarily lost when projecting an image of actual 3D stationary objects onto the 2D image plane. The mapping between the 3D Euclidean space of the three-dimensional environment and the 2D projective plane of the camera is not one-to-one. Many assumptions of Euclidean geometry are lost during such mapping (sometimes also referred to as projectivity). Notably, lengths, angles and parallelism are not preserved. Euclidean geometry is therefore insufficient to describe the imaging process. Instead, projective geometry, and specifically perspective projection is deployed to recover the camera's pose from images collected by the photo-sensor residing in the camera's 2D image plane.
Fortunately, projective transformations do preserve certain properties. These properties include type (that is, points remain points and lines remain lines), incidence (that is, when a point lies on a line it remains on the line), as well as an invariant measure known as the cross ratio. For a review of projective geometry the reader is referred to H. X. M. Coexter, Projective Geometry, Toronto: University of Toronto, 2nd Edition, 1974; O. Faugeras, Three-Dimensional Computer Vision, Cambridge, Mass.: MIT Press, 1993; L. Guibas, “Lecture Notes for CSS4Sa: Computer Graphics—Mathematical Foundations”, Stanford University, Autumn 1996; Q.-T. Luong and O. D. Faugeras, “Fundamental Matrix: Theory, algorithms and stability analysis”, International Journal of Computer Vision, 17(1): 43-75, 1996; J. L. Mundy and A. Zisserman, Geometric Invariance in Computer Vision, Cambridge, Mass.: MIT Press, 1992 as well as Z. Zhang and G. Xu, Epipolar Geometry in Stereo, Motion and Object Recognition: A Unified Approach. Kluwer Academic Publishers, 1996.
At first, many practitioners deployed concepts from perspective geometry directly to pose recovery. In other words, they would compute vanishing points, horizon lines, cross ratios and apply Desargues theorem directly. Although mathematically simple on their face, in many practical situations such approaches end up in tedious trigonometric computations. Furthermore, experience teaches that such computations are not sufficiently compact and robust in practice. This is due to many real-life factors including, among other, limited computation resources, restricted bandwidth and various sources of noise.
Modern computer vision has thus turned to more computationally efficient and robust approaches to camera pose recovery. An excellent overall review of this subject is found in Kenichi Kanatani, Geometric Computation for Machine Vision, Clarendon Press, Oxford University Press, New York, 1993. A number of important foundational aspects of computational geometry relevant to pose recovery via machine vision are reviewed below to the benefit of those skilled in the art and in order to better contextualize the present invention.
To this end, we will now review several relevant concepts in reference to FIGS. 1-3. FIG. 1 shows a stable three-dimensional environment 10 that is embodied by a room with a wall 12 in this example. A stationary object 14, in this case a television, is mounted on wall 12. Television 14 has certain non-collinear optical features 16A, 16B, 16C and 16D that in this example are the corners of its screen 18. Corners 16A, 16B, 16C and 16D are used by a camera 20 for recovery of extrinsic parameters (up to complete pose recovery when given a sufficient number and type of non-collinear features). Note that the edges of screen 18 or even the entire screen 18 and/or anything displayed on it (i.e., its pixels) are suitable non-collinear optical features for these purposes. Of course, other stationary objects in room 10 besides television 14 can be used as well.
Camera 20 has an imaging lens 22 and a photo-sensor 24 with a number of photosensitive pixels 26 arranged in an array. A common choice for photo-sensor 24 in today's consumer electronics devices are CMOS arrays, although other technologies can also be used depending on application (e.g., CCD, PIN photodiode, position sensing device (PSD) or still other photo-sensing technology). Imaging lens 22 has a viewpoint O and a certain focal length f. Viewpoint O lies on an optical axis OA. Photo-sensor 24 is situated in an image plane at focal length f behind viewpoint O along optical axis OA.
Camera 20 typically works with electromagnetic (EM) radiation 30 that is in the optical or infrared (IR) wavelength range (note that deeper sensor wells are required in cameras working with IR and far-IR wavelengths). Radiation 30 emanates or is reflected (e.g., reflected ambient EM radiation) from non-collinear optical features such as screen corners 16A, 16B, 16C and 16D. Lens 22 images EM radiation 30 on photo-sensor 24. Imaged points or corner images 16A′, 16B′, 16C′, 16D′ thus imaged on photo-sensor 24 by lens 22 are usually inverted when using a simple refractive lens. Meanwhile, certain more compound lens designs, including designs with refractive and reflective elements (catadioptrics) can yield non-inverted images.
A projective plane 28 conventionally used in computational geometry is located at focal length f away from viewpoint O along optical axis OA but in front of viewpoint O rather than behind it. Note that a virtual image of corners 16A, 16B, 16C and 16D is also present in projective plane 28 through which the rays of electromagnetic radiation 30 pass. Because any rays in projective plane 28 have not yet passed through lens 22, the points representing corners 16A, 16B, 16C and 16D are not inverted. The methods of modern machine vision are normally applied to points in projective plane 28, while taking into account the properties of lens 22.
An ideal lens is a pinhole and the most basic approaches of machine vision make that an assumption. Practical lens 22, however, introduces distortions and aberrations (including barrel distortion, pincushion distortion, spherical aberration, coma, astigmatism, chromatic aberration, etc.). Such distortions and aberrations, as well as methods for their correction or removal are understood by those skilled in the art.
In the simple case shown in FIG. 1, image inversion between projective plane 28 and image plane on the surface of photo-sensor 24 is rectified by a corresponding matrix (e.g., a reflection and/or rotation matrix). Furthermore, any offset between a center CC of camera 20 where optical axis OA passes through the image plane on the surface of photo-sensor 24 and the origin of the 2D array of pixels 26, which is usually parameterized by orthogonal sensor axes (Xs, Ys), involves a shift.
Persons skilled in the art are familiar with camera calibration techniques. These include finding offsets, computing the effective focal length feff (or the related parameter k) and ascertaining distortion parameters (usually denoted by α's). Collectively, these parameters are called intrinsic and they can be calibrated in accordance with any suitable method. For teachings on camera calibration the reader is referred to the textbook entitled “Multiple View Geometry in Computer Vision” (Second Edition) by R. Hartley and Andrew Zisserman. Another useful reference is provided by Robert Haralick, “Using Perspective Transformations in Scene Analysis”, Computer Graphics and Image Processing 13, pp. 191-221 (1980). For still further information the reader is referred to Carlo Tomasi and John Zhang, “How to Rotate a Camera”, Computer Science Department Publication, Stanford University and Berthold K. P. Horn, “Tsai's Camera Calibration Method Revisited”, which are herein incorporated by reference.
Additionally, image processing is required to discover corner images 16A′, 16B′, 16C′, 16D′ on sensor 24 of camera 20. Briefly, image processing includes image filtering, smoothing, segmentation and feature extraction (e.g., edge/line or corner detection). Corresponding steps are usually performed by segmentation and the application of mask filters such as Guassian/Laplacian/Laplacian-of-Gaussian (LoG)/Marr and/or other convolutions with suitable kernels to achieve desired effects (averaging, sharpening, blurring, etc.). Most common feature extraction image processing libraries include Canny edge detectors as well as Hough/Radon transforms and many others. Once again, all the relevant techniques are well known to those skilled in the art. A good review of image processing is afforded by “Digital Image Processing”, Rafael C. Gonzalez and Richard E. Woods, Prentice Hall, 3rd Edition, Aug. 31, 2007; “Computer Vision: Algorithms and Applications”, Richard Szeliski, Springer, Edition 2011, Nov. 24, 2010; Tinne Tuytelaars and Krystian Mikolajczyk, “Local Invariant Feature Detectors: A Survey”, Journal of Foundations and Trends in Computer Graphics and Vision, Vol. 3, Issue 3, January 2008, pp. 177-280. Furthermore, a person skilled in the art will find all the required modules in standard image processing libraries such as OpenCV (Open Source Computer Vision), a library of programming functions for real time computer vision. For more information on OpenCV the reader is referred to G. R. Bradski and A. Kaehler, “Learning OpenCV: Computer Vision with the OpenCV Library”, O'Reilly, 2008.
In FIG. 1 camera 20 is shown in a canonical pose. World coordinate axes (Xw,Yw,Zw) define the stable 3D environment with the aid of stationary object 14 (the television) and more precisely its screen 18. World coordinates are right-handed with their origin in the middle of screen 18 and Zw-axis pointing away from camera 20. Meanwhile, projective plane 28 is parameterized by camera coordinates with axes (Xc,Yc,Zc). Camera coordinates are also right-handed with their origin at viewpoint O. In the canonical pose Zc-axis extends along optical axis OA away from the image plane found on the surface of image sensor 24. Note that camera Zc-axis intersects projective plane 28 at a distance equal to focal length f away from viewpoint O at point o′, which is the center (origin) of projective plane 28. In the canonical pose, the axes of camera coordinates and world coordinates are thus aligned. Hence, optical axis OA that always extends along the camera Zc-axis is also along the world Zw-axis and intersects screen 18 of television 14 at its center (which is also the origin of world coordinates). In the application shown in FIG. 1, a marker or pointer 32 is positioned at the intersection of optical axis OA of camera 20 and screen 18.
In the canonical pose, the rectangle defined by space points representing screen corners 16A, 16B, 16C and 16D maps to an inverted rectangle of corner images 16A′, 16B′, 16C′, 16D′ in the image plane on the surface of image sensor 24. Also, space points defined by screen corners 16A, 16B, 16C and 16D map to a non-inverted rectangle in projective plane 28. Therefore, in the canonical pose, the only apparent transformation performed by lens 22 of camera 20 is a scaling (de-magnification) of the image with respect to the actual object. Of course, mostly correctable distortions and aberrations are also present in the case of practical lens 22, as remarked above.
Recovery of poses (positions and orientations) assumed by camera 20 in environment 10 from a sequence of corresponding projections of space points representing screen corners 16A, 16B, 16C and 16D is possible because the absolute geometry of television 14 and in particular of its screen 18 and possibly other 3D structures providing optical features in environment 10 are known and can be used as reference. In other words, after calibrating lens 22 and observing the image of screen corners 16A, 16B, 16C, 16D and any other optical features from the canonical pose, the challenge of recovering parameters of absolute pose of camera 20 in three-dimensional environment 10 is solvable. Still more precisely put, as camera 20 changes its position and orientation and its viewpoint O travels along a trajectory 34 (a.k.a. extrinsic parameters) in world coordinates parameterized by axes (Xw,Yw,Zw), only the knowledge of corner images 16A′, 16B′, 16C′, 16D′ in camera coordinates parameterized by axes (Xc,Yc,Zc) can be used to recover the changes in pose or extrinsic parameters of camera 20. This exciting problem in computer and robotic vision has been explored for decades.
Referring to FIG. 2, we now review a typical prior art approach to camera pose recovery in world coordinates (a.k.a. absolute pose, since world coordinates defined by television 14 sitting in room 10 are presumed stable for the purposes of this task). In this example, camera 20 is mounted on-board item 36, which is a mobile device and more specifically a tablet computer with a display screen 38. The individual parts of camera 20 are not shown explicitly in FIG. 2, but non-inverted image 18′ of screen 18 as found in projective plane 28 is illustrated on display screen 38 of tablet computer 36 to aid in the explanation. The practitioner is cautioned here, that although the same reference numbers refer to image points in the image plane on sensor 24 (see FIG. 1) and in projective plane 28 to limit notational complexity, a coordinate transformation exists between image points in the actual image plane and projective plane 28. As remarked above, this transformation typically involves a reflection/rotation matrix and an offset between camera center CC and the actual center of sensor 24 discovered during the camera calibration procedure (also see FIG. 1).
A prior location of camera viewpoint O along trajectory 34 and an orientation of camera 20 at time t=t−i are indicated by camera coordinates using camera axes (Xc,Yc,Zc) whose origin coincides with viewpoint O. Clearly, at time t=t−i camera 20 on-board tablet 36 is not in the canonical pose. The canonical pose, as shown in FIG. 1, obtains at time t=to. Given unconstrained motion of viewpoint O along trajectory 34 and including rotations in three-dimensional environment 10, all extrinsic parameters of camera 20 and correspondingly the position and orientation (pose) of tablet 36 change between time t=t−i and t=to. Still differently put, all six degrees of freedom (6 DOFs or the three translational and the three rotational degrees of freedom inherently available to rigid bodies in three-dimensional environment 10) change along trajectory 34.
Now, at time t=t1 tablet 36 has moved further along trajectory 34 from its canonical pose at time t=to to an unknown pose where camera 20 records corner images 16A′, 16B′, 16C′, 16D′ at the locations displayed on screen 38 in projective plane 28. Of course, camera 20 actually records corner images 16A′, 16B′, 16C′, 16D′ with pixels 26 of its sensor 24 located in the image plane defined by lens 22 (see FIG. 1). As indicated above, a known transformation exists (based on camera calibration of intrinsic parameters, as mentioned above) between the image plane of sensor 24 and projective plane 28 that is being shown in FIG. 2.
In the unknown camera pose at time t=t1 a television image 14′ and, more precisely screen image 18′ based on corner images 16A′, 16B′, 16C′, 16D′ exhibits a certain perspective distortion. By comparing this perspective distortion of the image at time t=t1 to the image obtained in the canonical pose (at time t=to or during camera calibration procedure) one finds the extrinsic parameters of camera 20 and, by extension, the pose of tablet 36. By performing this operation with a sufficient frequency, the entire rigid body motion of tablet 36 along trajectory 34 of viewpoint O can be digitized.
The corresponding computation is traditionally performed in projective plane 28 by using homogeneous coordinates and the rules of perspective projection as taught in the references cited above. For a representative prior art approach to pose recovery with respect to rectangles, such as presented by screen 18 and its corners 16A, 16B, 16C and 16D the reader is referred to T. N. Tan et al., “Recovery of Intrinsic and Extrinsic Camera Parameters Using Perspective Views of Rectangles”, Dept. of Computer Science, The University of Reading, Berkshire RG6 6AY, UK, 1996, pp. 177-186 and the references cited by that paper. Before proceeding, it should be stressed that although in the example chosen we are looking at rectangular screen 18 that can be analyzed by defining vanishing points and/or angle constraints on corners formed by its edges, pose recovery does not need to be based on corners of rectangles or structures that have parallel and orthogonal edges. In fact, the use of vanishing points is just the elementary way to recover pose. There are more robust and practical prior art methods that can be deployed in the presence of noise and when tracking more than four reference features (sometimes also referred to as fiducials) that do not need to form a rectangle or even a planar shape in real space. Indeed, the general approach applies to any set of fiducials defining an arbitrary 3D shape, as long as that shape is known.
For ease of explanation, however, FIG. 3 highlights the main steps of an elementary prior art approach to the recovery of extrinsic parameters of camera 20 based on the rectangle defined by screen 18 in world coordinates parameterizing room 10 (also see FIG. 2). Recovery is performed with respect to the canonical pose shown in FIG. 1. The solution is a rotation expressed by a rotation matrix R and a translation expressed by a translation vector h, or {R, h}.
In other words, the application of inverse rotation matrix R−1 and subtraction of translation vector h return camera 20 from the unknown recovered pose to its canonical pose. The canonical pose at t=to is marked and the unknown pose at t=t1 is to be recovered from image 18′ found in projective plane 28 (see FIG. 2), as shown on display screen 38. In solving the problem we need to find vectors pA, pB, pC and pD from viewpoint O to space points 16A, 16B, 16C and 16D through corner images 16A′, 16B′, 16C′ and 16D′. Then, information contained in computed conjugate vanishing points 40A, 40B can be used for the recovery. In cases where the projection is almost orthographic (little or no perspective distortion in screen image 18′) and vanishing points 40A, 40B become unreliable, angle constraints demanding that the angles between adjoining edges of candidate recovered screen 18 be 90° can be used, as taught by T. N. Tan et al., op. cit.
FIG. 3 shows that without explicit information about the size of screen 18, the length of one of its edges (or other scale information) only relative lengths of vectors pA, pB, pC and pD can be found. In other words, when vectors pA, pB, pC and pD are expressed by corresponding unit vectors {circumflex over (n)}A, {circumflex over (n)}B, {circumflex over (n)}C, {circumflex over (n)}D times scale constants λA, λB, λC, λD such that pA={circumflex over (n)}AλA, pB={circumflex over (n)}BλB, pC={circumflex over (n)}C, and pD={circumflex over (n)}DλD, then only relative values of scale constants λA, λB, λC, λD can be obtained. This is clear from looking at a small dashed candidate for screen 18* with corner points 16A*, 16B*, 16C*, 16D*. These present the correct shape for screen 18* and lie along vectors pA, pB, pC and pD, but they are not the correctly scaled solution.
Also, if space points 16A, 16B, 16C and 16D are not identified with image points 16A′, 16B′, 16C′ and 16D′ then the in-plane orientation of screen 18 cannot be determined. This labeling or correspondence problem is clear from examining a candidate for recovered screen 18*.
Its recovered corner points 16A*, 16B*, 16C* and 16D* do not correspond to the correct ones of actual screen 18 that we want to find. The correspondence problem can be solved by providing information that uniquely identifies at least some of points 16A, 16B, 16C and 16D. Alternatively, additional space points that provide more optical features at known locations in room 10 can be used to break the symmetry of the problem. Otherwise, the space points can be encoded by any suitable methods and/or means. Of course, space points that present intrinsically asymmetric space patterns could be used as well.
Another problem is illustrated by candidate for recovered screen 18**, where candidate points 16A**, 16B**, 16C**, 16D** do lie along vectors pA, pB, pC and pD but are not coplanar. This structural defect is typically resolved by realizing from algebraic geometry that dot products of vectors that are used to represent the edges of candidate screen 18** not only need to be zero (to ensure orthogonal corners) but also that the triple product of these vectors needs to be zero. That is true, since the triple product of the edge vectors is zero for a rectangle. Still another way to remove the structural defect involves the use of cross ratios.
In addition to the above problems, there is noise. Thus, the practical challenge is not only in finding the right candidate based on structural constraints, but also distinguishing between possible candidates and choosing the best one in the presence of noise. In other words, the real-life problem of pose recovery is a problem of finding the best estimate for the transformation encoded by {R, h} from the available measurements. To tackle this problem, it is customary to work with the homography or collineation matrix A that expresses {R, h}. In this form, the well-known methods of linear algebra can be brought to bear on the problem of estimating A. Once again, the reader should remember that these tools can be applied for any set of optical features (fiducials) and not just rectangles as formed by screen 18 used for explanatory purposes in this case. In fact, any set of fiducials defining any 3D shape in room 10 can be used, as long as that 3D shape is known. Additionally, such 3D shape should have a geometry that produces a sufficiently large image from all vantage points (see definition of convex hull).
FIGS. 4A & 4B illustrate realistic situations in which estimates of collineation matrices A are computed in the presence of noise for our simple example. FIG. 4A shows on the left a full field of view 42 (F.O.V.) of lens 22 centered on camera center CC while camera 20 is in the canonical pose (also see FIG. 1). Field of view 42 is parameterized by sensor coordinates of photo-sensor 24 using sensor axes (Xs,Ys) Note that pixelated sensors like sensor 24 usually take the origin of array of pixels 26 to be in the upper corner. Also note that camera center CC has an offset (xsc,ysc) from the origin. In fact, (xsc,Ysc) is the location of viewpoint O and origin o′ of projective plane 28 in sensor coordinates (previously shown in camera coordinates (Xc,Yc,Zc)—see FIG. 1). Working in sensor coordinates is initially convenient because screen image 18′ is first recorded along with noise by pixels 26 of sensor 24 in the image plane that is parameterized by sensor coordinates. Note the inversion of real screen image 18′ on sensor 24 in comparison to virtual screen image 18′ in projective plane 28 (again see FIG. 1).
On the right, FIG. 4A illustrates screen image 18′ after viewpoint O has moved along trajectory 34 and camera 20 assumed a pose corresponding to an unknown collineation A1 with respect to the canonical pose shown on the left. Collineation A1 consists of an unknown rotation and an unknown translation {R, h}. Due to noise, there are a number of measured image points {circumflex over (p)}i=({circumflex over (x)}i,ŷi), indicated by crosses, for corner images 16A′, 16B′, 16C′ and 16D′. (Here the “hat” denotes measured values not unit vectors.) The best estimate of collineation A1, referred to as Θ (estimation matrix), yields the best estimate of the locations of corner images 16A′, 16B′, 16C′ and 16D′ in the image plane. The value of estimation matrix Θ is usually found by minimizing a performance criterion through mathematical optimization. Suitable methods include the application of least squares, weighted average or other suitable techniques to process measured image points {circumflex over (p)}i({circumflex over (x)}i,ŷi). Note that many prior art methods also include outlier rejection of certain measured image points {circumflex over (p)}i=({circumflex over (x)}i,ŷi) that could “skew” the average. Various voting algorithms including RANSAC can be deployed to solve the outlier problem prior to averaging.
FIG. 4B shows screen image 18′ as recorded in another pose of camera 20. This one corresponds to a different collineation A2 with respect to the canonical pose. Notice that the composition of collineations behaves as follows: collineation A1 followed by collineation A2 is equivalent to composition A1A2. Once again, measured image points {circumflex over (p)}i=({circumflex over (x)}i,ŷi) for the estimate computation are indicated.
The distribution of measured image points {circumflex over (p)}i=({circumflex over (x)}i,ŷi) normally obeys a standard noise statistic dictated by environmental conditions. When using high-quality camera 20, that distribution is thermalized based mostly on the illumination conditions in room 10, the brightness of screen 18 and edge/corner contrast (see FIG. 2). This is indicated in FIG. 4B by a dashed outline indicating a normal error region or typical deviation 44 that contains most possible measured image points {circumflex over (p)}i=({circumflex over (x)}i,ŷi) excluding outliers. An example outlier 46 is indicated well outside typical deviation 44.
In some situations, however, the distribution of points {circumflex over (p)}i=({circumflex over (x)}i,ŷi) does not fall within typical error region 44 accompanied by a few outliers 46. In fact, some cameras introduce persistent or even inherent structural uncertainty into the distribution of points {circumflex over (p)}i=({circumflex over (x)}i,ŷi) found in the image plane on top of typical deviation 44 and outliers 46.
One typical example of such a situation occurs when the optical system of a camera introduces multiple reflections of bright light sources (which are prime candidates for space points to track) onto the sensor. This may be due to the many optical surfaces that are typically used in the imaging lenses of camera systems. In many cases, these multiple reflections can cause a number of ghost images along radial lines extending from the center of the sensor or camera center CC as shown in FIG. 1 to the point where the optical axis OA of the lens intersects with the sensor. This condition results in a large inaccuracy when using the image to measure the radial distance of the primary image of a light source. The prior art teaches no suitable formulation of the homography or collineation to nonetheless recover parameters of camera pose under such conditions.