The relation between point correspondences in an optical flow to a shape of a three-dimensional (3D) rigid-body for the purpose of modeling has been extensively described, see, for example, Barron et al., “The feasibility of motion and structure from noisy time-varying image velocity information,” IJCV, 5(3):239–270, December 1990, Heyden et al., “An iterative factorization method for projective structure and motion from image sequences,” IVC, 17(13):981–991, November 1999, Stein et al., “Model-based brightness constraints: On direct estimation of structure and motion,” PAMI, 22(9):992–1015, September 2000, Sugihara et al., “Recovery of rigid structure from orthographically projected optical flow,” CVGIP, 27(3):309–320, September 1984, and Waxman et al., “Surface structure and three-dimensional motion from image flow kinematics,” IJRR, 4(3):72–94, 1985.
Most modern methods for extracting 3D information from image sequences (e.g., a video) are based on the Tomasi & Kanade “rank theorem” as described by Tomasi et al. in “Shape and motion from image streams under orthography: A factorization method,” International Journal of Computer Vision, 9(2):137–154, 1992. Matrices used for orthographically projected rigid-body motion have rank-3. That is, the matrices can be expressed as three linearly independent vectors. It is well known that the matrices can be factored into shape and projection via a thin single value decomposition (SVD). Bregler et al. in “Recovering non-rigid 3D shape from image streams,” Proc. CVPR, 2000, describe an extension to k-mode non-rigid motion via rank-3k double-SVD. To date, all such factorization methods require successful point tracking data as input.
Non-rigid two-dimensional (2D) modeling methods for object matching or tracking are also known. These are either based on eigenspace representations of variability of shape, see Black and Yacoob, “Eigentracking: Robust matching and tracking of articulated objects using a view-based representation,” IJCV, pages 63–84, 1998, Cootes et al., “Active appearance models,” Proc. ECCV, volume 2, pages 484–498, 1998, and Covell, “Eigen-points: Control-point location using principal component analysis,” Proc. 2nd IWAFGR, 1996, or parametric representations of variability, see Black and Jepson “Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion,” Proc. ICCV, 1995, and Sclaroff et al., “Active blobs,” Proc. ICCV, 1998.
Most of these methods require a large number of hand-marked images for training the model. Covell's eigenpoint tracker employs an eigen-basis to relate affine-warped images of individual facial features to hand-marked fiduciary points on those features. Black and Yacoob described parametric 2D models of flow for non-rigid facial features, and Black and Jepson also use an eigen-basis of views for 2D tracking of non-rigid objects. Cootes et al. employ statistical models of 2D shape to handle variation in facial images due to pose and identity, but not expression. Many of these approaches require robustizing methods to discard outliers. Clearly, there is a price to pay for using 2D models of what is essentially 3D variability.
Bascle et al. in “Separability of pose and expression in facial tracking and animation,” Proc. ICCV, 1998, describe an interesting compromise between 2D and 3D tracking by factoring the motion of tracked contours into flexion and 2D affine-with-parallax warps via SVD.
None of the prior art addresses the full problem of tracking a non-rigid 3D object in video and recovering its 3D motion and flexion parameters, nor recovering such parameters directly from variations in pixel intensities. It is desired to provide an improved method for acquiring models and their motions from a sequence of images. The method determines 3D motion and flexion directly from intensities in the images without losing information while determining intermediate results. The method should minimize uncertainty, and prior probabilities should give confidence measures.