Motion is an important cue for image segmentation tasks considering the fact that parts of a rigid object often exhibit similar motions over time. In particular, it is often desirable to segment objects having different motions in an image, or sequence of images (video) acquired of a scene.
Epipolar plane image (EPI) analysis assumes that an image is composed of homogeneous regions bounded by straight lines no matter what shape, texture or intensity changes are contained in the image. Such observations have been utilized to construct 3D geometric description of a static scene from a video sequence, see Bolles et al., “Epipolar-plane image analysis: A technique for analyzing motion sequences,” Readings in Computer Vision: Issues, Problem, Principles, and Paradigms, page 26, 2014.
Epipolar geometry is an intrinsic projective geometry between two images that can be used for motion segmentation, see Micusik et al., “Estimation of omnidirectional camera model from epipolar geometry,” Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 1, pages 1-485. IEEE, 2003. One limitation of using two images is that the motion within the epipolar plane cannot be detected. To overcome this limitation, epipolar constraints can be extended to three images. For example, a three-view epipolar constraint called “parallax-based multiplanar constraint” can be used to classify each image pixel as either belonging to a static background or to objects moving in the foreground, see Xu et al., “Motion segmentation by new three-view constraint from a moving camera,” Mathematical Problems in Engineering, 2015.
Another approach for motion segmentation uses dynamic textures analysis based on a spatio-temporal generative model for video, which represents video sequences as observations from a linear dynamical system. Another method uses mixtures of dynamic textures as a representation for both appearance and dynamics of a variety of visual processes, see Chan et al., “Modeling, clustering, and segmenting video with mixtures of dynamic textures,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(5):909-926, 2008. However, that approach suffers in the presence of strong perspective effects because there is no accounting for the epipolar geometry of the scene.
Sparse subspace clustering (SSC) has been used for motion segmentation, see Elhamifar et al., “Sparse subspace clustering: Algorithm, theory, and applications,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(11):2765-2781, 2013. In SSC, trajectories of feature points are extracted from video frames. Sparse optimization is used to find trajectory associations by estimating each feature trajectory using a sparse linear combination of other feature trajectories. Sparse weights are used to construct a graph that relates the features, and graph spectral clustering is used to segment the features that occupy the same subspace. The limitation of that approach is its reliance on computing trajectories across multiple images. Moreover, the computational complexity of the sparse optimization problem quickly increases as the number of feature points increase.
In a related approach, a “hypergraph” is constructed based on similarities defined on higher order tuples, rather than pair of nodes, see Ochs et al., “Higher order motion models and spectral clustering,” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 614-621. IEEE, 2012.
Yet another approach for motion segmentation relies on a variation of robust principal component analysis (RPCA) where a moving background is separated from moving foreground objects, see U.S. 20150130953, “Method for Video Background Subtraction Using Factorized Matrix Completion,” Mansour et al. Motion vectors can be used to align images to the same perspective before applying RPCA to extract a low-rank background from sparse moving foreground objects. One limitation of that scheme is that the background alignment assumes that objects are in the same depth plane, which may not necessarily be true. Another limitation is that the technique requires multiple images to produce an accurate segmentation.
In summary, a common limitation observed with conventional methods is their inability to deal with complex motion, especially when strong perspective effects appear in the scene.
Based on an assumption that the optical flow of an object share one focus of expansion (FOE) point, one motion segmentation method extracts feature points, e.g., Kanade-Lucas-Tomasi (KLT) feature points, generates an optical flow (motion field), e.g., using template matching, and then groups the motion, e.g., using a Random Sample Consensus (RANSAC) approach, see U.S. Pat. No. 8,290,209, Akita et al. In particularly, the RANSAC approach can be applied in a x-MVx and y-MVy plane, where any two points are connected with a straight line, the number of points lying within a tolerance of the straight line is counted, and a straight line having the greatest number of points is selected. The points on the selected straight line with the tolerance are segmented as one group of motion.