1. Field of the Invention
This invention relates generally to a method and apparatus for linear fitting with missing data: applications to structure-from-motion and to characterize intensity images.
2. Discussion of the Prior Art
Several vision problems can be reduced to the problem of fitting a linear surface of low dimension to data. For example, C. Tomasi et al. (1992) "Shape and Motion From Image Streams Under Orthography: A Factorization Method" Int. J. of Comp. Vis. 9(2): 137-154 show how to determine the structure and motion of a set of 3-D points tracked in a sequence of 2-D images by fitting a rank three matrix to a data matrix constructed from the noisy images, assuming an affine imaging model. P. Belhumeur et al. (1996) "What is the Set of Images of an Object Under All Possible Lighting Conditions?" IEEE Conf. of Comp. Vis. and Pat. Rec.: 270-277 show how to characterize the set of intensity images produced by a convex, Lambertian object, also by fitting a rank three matrix to a matrix constructed from noisy data. One serious issue for solutions based on this work is how to deal with a data matrix with some missing elements. In structure from motion, missing elements will occur in the data matrix if some point features are not visible in the image throughout the motion sequence, an almost inevitable occurrence. To construct the intensity manifold as suggested by P. Belhumeur et al. (1996), missing matrix elements will to arise when the surface normals of some scene points do not face the light source in some images; this must happen for any scene containing smooth objects, and will probably happen in most real scenes.
Recently, several important vision problems have been addressed by using affine imaging models, which reduce the problems to ones of finding a low-dimensional linear or affine surface that represents some important scene structure. For example, in the linear combinations work of S. Ullman et al. (1991) "Recognition by Linear Combinations of Models" IEEE Trans. PAMI 13(10): 992-1007 and R. Basri et al. (1988) "The Alignment of Objects With Smooth Surfaces" Computer Graphics, Vision and Image Processing: Image Understanding 57(3): 331-345, the set of images produced by any 3-D structure from different viewpoints is represented as a low-dimensional affine surface in the space of all possible images. This structure is then used for object recognition. C. Tomasi et al. (1992) use a related insight to efficiently perform structure-from-motion in long motion sequences. Since the noise-free images should all lie in a low-dimensional linear surface, such a surface can be fit to the data and then used to derive structure and motion. Y. Moses (1993) Face recognition: generalization to novel images Ph.D. Thesis, Weizmann Institute of Science and A. Shashua (1992) Geometry and Photometry in 3D Visual Recognition MIT TR-1401 show that the set of intensity images produced by a 3-D Lambertian structure from a single viewpoint, but with a point light source of varying position and intensity can also be described as a 3-D linear surface in a space of possible images. These form a 3-D linear surface in a space of possible images (the intensity manifold). P. Belhumeur et al. (1996) use this result to describe the set of intensity images produced by a convex object, allowing for multiple light sources and self-shadowing. In all of this work, the problem is greatly simplified by finding a reasonable set of assumptions to make it linear.
All of these methods require fitting a matrix of low rank r to a data set. The linear space spanned by this matrix corresponds to scene structure. In several approaches, this is done by acquiring only the minimal amount of data needed to construct an r.times.n matrix [see, S. Ullman et al. (1991), Y. Moses (1993), A. Shashua (1992) and P. Belhumeur et al. (1996)], which has exactly rank r even in the presence of noise. This method is simple, but can be quite sensitive to noise. S. Ullman et al. (1991) also propose using singular value decomposition (SVD) to fit a low rank matrix to a larger matrix built when extra images are used to over determine the problem. C. Tomasi et al. (1992) use a long motion sequence to build a large matrix, and then find the nearest low rank matrix using SVD. This method finds the low rank matrix that minimizes the sum of square differences between each element in the new matrix and the data matrix. By over determining the problem and finding a least squares fit to the data, this method can reduce the effects of noise.
A significant limitation of these methods is that they require a data matrix with no missing elements. In the case of structure-from-motion, this means that every point feature must be visible in every image. This is a strong assumption because often points enter or leave the field of view during a motion sequence, or may be blocked from view by other objects in the middle of a sequence. The methods that derive the intensity manifold produced by a scene use a set of images in which illumination in each image comes from a single point light source. The data matrix will have missing elements when scene points are not illuminated by the light source; therefore, these methods require that every visible scene point be illuminated in every image. However, for scenes with smooth objects, this assumption will be met only when the light source direction is the same as the viewing direction, i.e., for one possible image. More generally, most visible points will be illuminated only when the light source is near the viewing direction. However, if images are generated with similar light source directions, the resulting solution may be unstable.
C. Tomasi et al. (1992) describe a simple heuristic method for fitting a matrix that has missing elements. H. Shum et al. (1995) "Principal Component Analysis With Missing Data and Its Applications to Polyhedral Object Modeling" IEEE Trans. PAMI 17(9): 854-867 apply an iterative method due to T. Wiberg (1976) "Computation of Principal Components When Data Are Missing" Proc. Second Symp. Computational Statistics: 229-236 to a related structure from motion problem. This method minimizes the sum of square differences between the fitted, low rank matrix and the elements that are not missing in the data matrix. This approach has the virtue of converging to a locally optimal solution, however, it is not guaranteed to find the globally optimal one. While this iterative method has its advantages, it can often converge to the wrong answer if it does not have a good starting point.
The methods proposed by S. Ullman et al. (1991) and C. Tomasi et al. (1992) assume that a sequence of matched image points is given, and that these were projected from 3-D to 2-D using a camera model that is affine, or a subset of affine (e.g., orthographic, weak perspective or paraperspective, as set forth in R. Basri (1996) "Paraperspective.ident.Affine" Int. J. of Comp. Vis. 19(2): 169-179). The translation component of motion is estimated separately, by estimating the translation of the center of mass of the points we therefore assume for simplicity that there is no in-plane translation. Let p.sub.i =(x.sub.i, y.sub.i, z.sub.i).sup.T be the i'th 3-D point feature. Then the model to image transformation may be written as a 2.times.3 matrix S.sub.i. Letting q.sub.i =(X.sub.i, Y.sub.i) stand for the image point produced by p.sub.i, this gives: ##EQU1## For the case of scaled-orthographic projection, S.sub.i is the first two rows of a scaled rotation matrix.
We can now represent an entire image sequence in simple matrix notation. Let P be a matrix whose i'th column is p.sub.i : ##EQU2## Let T stand for a matrix containing all rows of all transformations S.sub.i, that is, row 2i-1 of T is the first row of S.sub.i, and row 2i of T is its second row. ##EQU3## Finally, let Q stand for a single matrix containing all tracked image points. Each column of Q contains all the image coordinates of a single point. That is Q.sub.2i-1,j gives the x coordinate of q.sub.j in image i, and Q.sub.2i,j gives the y coordinate of q.sub.j, in image i. We write: ##EQU4## where (X.sub.i,j) is the X coordinate of the j'th noisy image point in the i'th image, and Y.sub.i,j is similarly defined. Let E be a matrix of the same dimension as Q, whose elements indicate the noise encountered in sensing the corresponding elements of Q. Then we may simply describe the generation of the whole image sequence as: EQU TP+E=Q
We further define: EQU Q.ident.Q-E=TP
Although we have no direct access to the error-free data Q, we wish to estimate it from Q. Because T has three columns, it can only have rank three, and consequently, Q can have only rank three. (That is, the columns, or rows, of Q, regarded as points, must occupy no more than a 3-D linear subspace of R.sup.m (R.sup.n). Any column, or row, will be a linear combination of any other three linearly independent columns, or rows.)
The matrix containing the coordinates of sensed image points Q will be a noisy version of Q, and will have full rank. However, if we find the Q matrix of rank three that is closest to Q, we can obtain a maximum likelihood estimate of Q. In particular, we can use SVD to find the Q matrix that minimizes .SIGMA..sub.i .SIGMA..sub.j (Q.sub.ij -Q.sub.ij).sup.2. We can also use SVD to factor this Q matrix into T and P, providing us with estimates of the structure and motion of the 3-D points, up to an affine transformation. See C. Tomasi et al. (1992) for further details.
When some points are not visible in some images, Q will contain missing entries for these points. We still know that the error free data, were it all visible, would give rise to a Q matrix of rank three (we can still remove the effects of translation between adjacent images from our data by computing the translation of the center of mass of those points present in adjacent images. For this we need only assume that some points will always be present from one image to the next). We can, therefore, find the affine structure and motion that minimizes error by finding the Q matrix that is closest to Q where we now compare only those elements actually present in Q to the corresponding elements of Q. The problem of finding structure from motion in the presence of missing data, then, depends on finding the nearest rank three matrix to a data matrix with missing elements.
A similar formulation has been used by A. Shashua (1992), Y. Moses (1993) and P. Belhumeur et al. (1996) to describe the set of intensity images produced by a 3-D Lambertian structure exposed to a point light source at infinity. Let the direction of the source scaled by its intensity be the 3-D vector: -h. In this case, let q.sub.j be the intensity of image pixel j. Finally, assume that this intensity is produced by a 3-D planar patch of the scene with a surface normal n.sub.j, and albedo p.sub.j. Then, intensities are produced according to the equation: EQU q.sub.j =h.multidot.p.sub.j n.sub.j
provided that this produces a positive value (i.e., each surface normal is facing the light source). Since we can never recover the absolute magnitude of h or p.sub.j from a set of the equations, we may without loss of generality assume that p.sub.j n.sub.j is simply an unconstrained 3-D vector (that is, the constraint that albedo is less than one has no effect without some limitation on the scale of h). Next, assume we take a set of images in which the viewpoint is fixed, but the lighting intensity and direction may vary. Let N be a scene matrix in which the j'th column is p.sub.j n.sub.j. Let H be a lighting matrix in which the i'th row is the h vector for the i'th image. And let Q be a data matrix in which the j'th column gives the intensity of the j'th pixel in every image. Then we have: EQU Q=HN+E EQU Q.ident.Q-E=HN
where E is again a matrix of sensing errors. Again, Q has rank three. Again, our noisy images provide a full rank matrix Q, and again, we can estimate Q, H and N from Q by using SVD to find the nearest rank three matrix to Q, along with its decomposition. In previous work, A. Shashua (1992) and P. Belhumeur et al. (1996) have used the simpler, but less robust, method of using only three images to form Q, in which case Q has only rank three, and is the best estimate of Q. A. Shashua (1992) uses the matrix N to determine whether a novel image could be generated by the same scene, by determining whether the novel image is a linear combination of the rows of N.
For this problem, Q will have missing elements when some of the scene surface normals do not face the light source. For if h.multidot.p.sub.j n.sub.j is negative, the scene point is self-shadowed, and we perceive zero intensity plus noise, not negative intensity. We can tell from the fact that a scene point is dark that it does not face the light source, but we have no further information about its orientation. P. Belhumeur et al. (1996) use N to determine the more complicated illumination cone that describes the set of images that a scene can produce allowing for this self-shadowing effect, but they must still build N using images in which there is no self-shadowing. However, if we have many images containing self-shadowing, we can construct a Q matrix with missing elements. Again, the key problem to estimating the quantities of interest will be to find the best rank three approximation to a matrix with missing elements.
C. Tomasi et al. (1992) propose one method for finding the best rank three approximation to a matrix with missing data. In their method, one first locates a rectangular subset of the matrix with no missing data. SVD is used to find the best rank three approximation to this submatrix. Then the missing elements of an additional column (or row) are filled in by finding the linear combination of a basis for the columns (or rows) of the submatrix that best fits the non-missing elements of the new column (or row). In this way, the solution is iteratively extended one column (or row) at a time. The resulting solution is then refined using a steepest descent minimization to find the rank three matrix that is nearest to all the data present in the Q matrix.
This solution seems like a reasonable heuristic because missing data is filled in using over constraining information, and C. Tomasi et al. (1992) demonstrate its effectiveness on two motion sequences. However, the proposed method for initially filling in the missing matrix elements has several potential disadvantages. First, the problem of finding the largest full submatrix of a matrix with missing elements is NP-hard (max-clique can be easily reduced to this problem) so heuristics must be used to find a starting point. Second, the data is not used symmetrically. A small subset of the data may be used to compute the first missing elements, which may lead to significant inaccuracies in these elements. It is not clear how these inaccuracies will propagate in the computation of additional missing elements. Third, for an m.times.n matrix, the method requires O(max(m,n)) applications of SVD. Finally, it is not certain that the final application of steepest descent will cause the method to converge to a globally optimal solution.
H. Shum et al. (1995) have also done recent work relevant to the missing data problem. They consider the problem of deriving structure and motion based on 3-D data whose significance can be weighted, leading to a weighted least squares problem of fitting a low rank matrix to a data matrix. For this problem, they suggest the use of an iterative method due to the findings set forth in T. Wiberg (1976); this iterative method clearly can also be applied to the missing data problem considered in this paper. This method appears likely to have better convergence properties than a steepest descent approach, and we use it in our experiments. This method is described in detail in H. Shum et al. (1995), however, for completeness, we will synopsize their description here, since we use their method in our experiments. We can briefly describe the method as formulating the least squares problem as a bilinear optimization, and then iteratively holding one set of variables constant while the others are optimized, so that each optimization is linear.
In more detail, suppose first that we wish to estimate Q=TP from a noisy Q matrix with no missing elements, where Q is an m.times.n matrix of rank r, meaning that T is m.times.r and P is r.times.n. We will recursively update estimates of T and P, letting T.sub.i and P.sub.i denote our i'th estimate of each value. We first fix T.sub.1, using some initial estimate. Then, we wish to find P.sub.i, such that T.sub.i P.sub.i is nearest to Q in a least squares sense. This is just an over constrained linear estimation problem, which may be solved by taking: EQU P.sub.i =T.sub.i.sup.+ Q
where T.sub.i.sup.+ is the psuedoinverse of T.sub.i. Next we can fix P.sub.i and compute EQU T.sub.i+1 =QP.sub.i.sup.+
Clearly, this produces a series of estimates of Q, T and P in which the magnitude of Q-Q is non-increasing.
We can apply this same algorithm when Q has missing elements, after a slight modification. To do this, we can rewrite Q=TP as: ##EQU5## where we have written Q and P as vectors, and written T as a big, block-diagonal matrix, with T repeated n times. Again, we can apply the same procedure, fixing T, and multiplying the pseudo inverse of the block-diagonal matrix of it by a vector constructed from Q. This will give a new estimate of P. The second part of the iteration can be performed by simply writing T as a vector, and building a similar block-diagonal matrix of P.
We can now readily modify this procedure when Q has missing elements. We just delete all rows of the vectors and block-diagonal matrix corresponding to missing elements of Q, and perform the procedure as before. Note that taking the pseudoinverse of the block-diagonal matrix is not too expensive, since the pseudoinverse of each block may be taken separately.