Relevant previous work is mainly in the area of video segmentation. However, very few video segmentation algorithms are intended for the very general context discussed here. Most were developed in the context of a stationary camera (e.g., [P. Kornprobst and G. Medioni. Tracking segmented objects using tensor voting. In Proceedings of the 2000 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 118-125, 2000], [. Paragios and R. Deriche. A PDE-based level-set approach for detection and tracking of moving objects. In Proceedings of the 6th IEEE International Conference on Computer Vision, pages 1139-1145, 1998], [H. Y. Wang and K. K. Ma. Automatic video object segmentation via 3D structure tensor. In Proceedings of the 2003 IEEE International Conference on Image Processing, volume 1, pages 153-156, 2003]) or under the assumption that the background has a global, parametric motion (e.g., affine [F. Precioso, M. Barlaud, T. Blu, and M. Unser. Robust real-time segmentation of images and videos using a smooth-spline snake-based algorithm. Image Processing, 14(7):910-924, 2005] or projective [H. Tao, H. S. Sawhney, and R. Kumar. Object tracking with Bayesian estimation of dynamic layer representations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1):75-89, 2002], [Y. Tsaig and A. Averbuch. Automatic segmentation of moving objects in video sequences: a region labeling approach. IEEE Transactions on Circuits, Systems, and Video, 12(7):597-612, 2002].) Recently, the last restriction was relaxed to a planar scene with parallax [J. Kang, I. Cohen, G. Medioni, and C. Yuan. Detection and tracking of moving objects from a moving platform in presence of strong parallax. In Proceedings of the 10th IEEE International Conference on Computer Vision, pages 10-17, 2005]. Other algorithms were constrained to track video objects modeled well by parametric shapes (e.g., active blobs [S. Sclaroff and J. Isidoro. Active blobs: region-based, deformable appearance models. Computer Vision and Image Understanding, 89(2):197-225, 2003]) or motion (e.g., translation [R. Cucchiara, A. Prati, and R. Vezzani. Real-time motion segmentation from moving cameras. Real-Time Imaging, 10(3):127-143, 2004], 2D rigid motion [H. Tao, H. S. Sawhney, and R. Kumar. Object tracking with Bayesian estimation of dynamic layer representations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1):75-89, 2002], affine [M. Gelgon and P. Bouthemy. A region-level motion-based graph representation and labeling for tracking a spatial image partition. Pattern Recognition, 33(4):725-740, 2000], [I. Patras, E. A. Hendriks, and R. L. Lagendijk. Video segmentation by MAP labeling of watershed segments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):326-332, 2001], projective [C. Gu and M. C. Lee. Semiautomatic segmentation and tracking of semantic video objects. IEEE Transactions on Circuits, Systems, and Video, 8(5):572-584, 1998], small 3D rigid motion [T. Papadimitriou, K. I. Diamantaras, M. G. Strintzisa, and M. Roumeliotis. Video scene segmentation using spatial contours and 3-D robust motion estimation. IEEE Transactions on Circuits, Systems, and Video, 14(4):485-497, 2004] and normally distributed optical flow [S. Khan and M. Shah. Object based segmentation of video using color, motion and spatial information. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 746-751, 2001], [Y. P. Tsai, C. C. Lai, Y. P. Hung, and Z. C. Shih. A Bayesian approach to video object segmentation via 3-D watershed volumes. IEEE Transactions on Circuits, Systems, and Video, 15(1):175-180, 2005]). These algorithms are suitable only for tracking rigid objects or specific preset types of deformations. The algorithm of the invention, however, addresses the tracking of potentially non-rigid objects in 3D scenes from an arbitrarily moving camera, without prior knowledge other than the object's bitmap in the first frame.
There are algorithms that address video segmentation and successfully track objects under general conditions as an aftereffect. That is, they do not perform explicit tracking in the sense of estimating a current state conditional on the previous one or on the previous frames. For example, in [J. Shi and J. Malik. Motion segmentation and tracking using normalized cuts. In Proceedings of the 6th IEEE International Conference on Computer Vision, pages 1154-1160, 1998] each set of a few (five) consecutive frames is spatiotemporally segmented without considering the previous results (other than saving calculations.) In [Y. Liu and Y. F. Zheng. Video object segmentation and tracking using ψ-learning classification. IEEE Transactions on Circuits, Systems, and Video, 15(7):885-899, 2005] each frame is segmented into object/background without considering previous frames or classifications. (Furthermore, the classification requires a training phase, upon which the classification is performed, prohibiting major changes in the target's appearance.) In the contour tracking performed in [S. Jehan-Besson, M. Barlaud, and G. Aubert. DREAM2S: Deformable regions driven by an eulerian accurate minimization method for image and video segmentation. International Journal of Computer Vision, 53(1):45-70, 2003], an active contour is run in each frame separately, while the only information taken from previous frames is the previously estimated contour for initialization in the current frame. According to the invention, the state (target's bitmap) is explicitly tracked by approximating a PDF of the current state, which is conditional on the previous state and on the current and previous frames, and by estimating the MAP state.
Optical flow is an important cue for visually tracking objects, especially under general conditions. Most video segmentation algorithms make a point estimate of the optical flow, usually prior to segmentation (e.g., [R. Cucchiara, A. Prati, and R. Vezzani. Real-time motion segmentation from moving cameras. Real-Time Imaging, 10(3):127-143, 2004], [M. Gelgon and P. Bouthemy. A region-level motion-based graph representation and labeling for tracking a spatial image partition. Pattern Recognition, 33(4):725-740, 2000], [C. Gu and M. C. Lee. Semiautomatic segmentation and tracking of semantic video objects. IEEE Transactions on Circuits, Systems, and Video, 8(5):572-584, 1998], [S. Khan and M. Shah. Object based segmentation of video using color, motion and spatial information. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 746-751, 2001], [V. Mezaris, I. Kompatsiaris, and M. G. Strintzis. Video object segmentation using Bayes-based temporal tracking and trajectory-based region merging. IEEE Transactions on Circuits, Systems, and Video, 14(6):782-795, 2004], [H. T. Nguyen, M. Worring, R. van den Boomgaard, and A. W. M. Smeulders. Tracking nonparameterized object contours in video. Image Processing, 11(9):1081-1091, 2002], [T. Papadimitriou, K. I. Diamantaras, M. G. Strintzisa, and M. Roumeliotis. Video scene segmentation using spatial contours and 3-D robust motion estimation. IEEE Transactions on Circuits, Systems, and Video, 14(4):485-497, 2004], [I. Patras, E. A. Hendriks, and R. L. Lagendijk. Semi-automatic object-based video segmentation with labeling of color segments. Signal Processing: Image Communications, 18(1):51-65, 2003], [Y. P. Tsai, C. C. Lai, Y. P. Hung, and Z. C. Shih. A Bayesian approach to video object segmentation via 3-D watershed volumes. IEEE Transactions on Circuits, Systems, and Video, 15(1):175-180, 2005], [Y. Tsaig and A. Averbuch. Automatic segmentation of moving objects in video sequences: a region labeling approach. IEEE Transactions on Circuits, Systems, and Video, 12(7):597-612, 2002]) and seldom in conjunction with it (e.g, [I. Patras, E. A. Hendriks, and R. L. Lagendijk. Video segmentation by MAP labeling of watershed segments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):326-332, 2001]). An exception is [M. Nicolescu and G. Medioni. Motion segmentation with accurate boundaries—a tensor voting approach. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, pages 382-389, 2003], where each pixel may be assigned multiple flow vectors of equal priority. However, the segmentation there is only applied to consecutive image pairs. Furthermore, the objects in all three experiments were rigid and either the camera or the entire scene was static. Since optical flow estimation is prone to error, other algorithms avoid it altogether (e.g., [S. Jehan-Besson, M. Barlaud, and G. Aubert. DREAM2S: Deformable regions driven by an eulerian accurate minimization method for image and video segmentation. International Journal of Computer Vision, 53(1):45-70, 2003], [Y. Liu and Y. F. Zheng. Video object segmentation and tracking using Ã-learning classification. IEEE Transactions on Circuits, Systems, and Video, 15(7):885-899, 2005], [A. R. Mansouri. Region tracking via level set PDEs without motion computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):947-961, 2002.], [S. Sun, D. R. Haynor, and Y. Kim. Semiautomatic video object segmentation using Vsnakes. IEEE Transactions on Circuits, Systems, and Video, 13(1):75-82, 2003]), but these algorithms tend to fail when the target is in proximity to areas of similar texture, and may erroneously classify newly appearing regions with different textures. This is shown in an example in [A. R. Mansouri. Region tracking via level set PDEs without motion computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):947-961, 2002], where occlusions and newly appearing areas are prohibited due to the modeling of image domain relations as bijections. Another exception to the optical flow point-estimation is [J. Shi and J. Malik. Motion segmentation and tracking using normalized cuts. In Proceedings of the 6th IEEE International Conference on Computer Vision, pages 1154-1160, 1998], where a motion pro le vector that captures the probability distribution of image velocity is computed per pixel, and motion similarity of neighboring pixels is approximated from the resemblance of their motion pro les. In the work here, the optical flow is neither estimated as a single hypothesis nor discarded, but the bitmap's PDF is constructed through a marginalization over all possible pixel motions (under a maximal flow assumption).
One class of video segmentation and tracking algorithms copes with general object shapes and motions in the context of an arbitrarily moving camera by tracking a nonparametric contour influenced by intensity/color edges (e.g., [S. Sun, D. R. Haynor, and Y. Kim. Semiautomatic video object segmentation using Vsnakes. IEEE Transactions on Circuits, Systems, and Video, 13(1):75-82, 2003]) and motion edges (e.g., [H. T. Nguyen, M. Worring, R. van den Boomgaard, and A. W. M. Smeulders. Tracking nonparameterized object contours in video. Image Processing, 11(9):1081-1091, 2002].) However, this kind of algorithm does not deal well with cluttered objects and partial occlusions, and may cling to irrelevant features in the face of color edges or additional moving edges in proximity to the tracked contour.
Many video segmentation and tracking algorithms perform spatial segmentation of each frame as a preprocessing step. The resulting segments of homogeneous color/intensity are then used as atomic regions composing objects (e.g., [R. Cucchiara, A. Prati, and R. Vezzani. Real-time motion segmentation from moving cameras. Real-Time Imaging, 10(3):127-143, 2004], [M. Gelgon and P. Bouthemy. A region-level motion-based graph representation and labeling for tracking a spatial image partition. Pattern Recognition, 33(4):725-740, 2000], [I. Patras, E. A. Hendriks, and R. L. Lagendijk. Video segmentation by MAP labeling of watershed segments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):326-332, 2001].) These algorithms also assign a parametric motion per segment. Rather than confining the final solution in a preprocessing step and making assumptions regarding the type of motion the segments undergo, the algorithm proposed here uses the aforementioned spatial color coherence assumption and works directly at pixel level.