One goal of visual motion analysis is to compute representations of image motion that allow one to infer the presence, structure, and identity of moving objects in an image sequence. Often, image sequences are depictions of three-dimensional (3D) “real world” events (scenes) that are recorded, for example, by a digital camera (other image sequences might include, for example, infra-red imagery or X-ray images). Such image sequences are typically stored as digital image data such that, when transmitted to a liquid crystal display (LCD) or other suitable playback device, the image sequence generates a series of two-dimensional (2D) image frames that depict the recorded 3D event. Visual motion analysis involves utilizing a computer and associated software to “break apart” the 2D image frames by identifying and isolating portions of the image data associated with moving objects appearing in the image sequence. Once isolated from the remaining image data, the moving objects can be, for example, tracked throughout the image sequence, or manipulated such that the moving objects are, for example, selectively deleted from the image sequence.
In order to obtain a stable description of an arbitrary number of moving objects in an image sequence, it is necessary for a visual motion analysis tool to identify the number and positions of the moving objects at a point in time (i.e., in a particular frame of the image sequence), and then to track the moving objects through the succeeding frames of the image sequence. This process requires detecting regions exhibiting the characteristics of moving objects, determining how many separate moving objects are in each region, determining the shape, size, and appearance of each moving object, and determining how fast and in what direction each object is moving. The process is complicated by objects that are, for example, rigid or deformable, smooth or highly textured, opaque or translucent, Lambertian or specular, active or passive. Further, the depth ordering of the objects must be determined from the 2D image data, and dependencies among the objects, such as the connectedness of articulated bodies, must be accounted for. This process is further complicated by appearance distortions of 3D objects due to rotation, orientation, or size variations resulting from changes in position of the moving object relative to the recording instrument. Accordingly, finding a stable description (i.e., a description that accurately accounts for each of the arbitrary moving objects) from the vast number of possible descriptions that can be generated by the image data can present an intractable computational task, particularly when the visual motion analysis tool is utilized to track objects in real time.
Many current approaches to motion analysis over relatively long image sequences are formulated as model-based tracking problems in which a user provides the number of objects, the appearance of objects, a model for object motion, and perhaps an initial guess about object position. These conventional model-based motion analysis techniques include people trackers for surveillance or human-computer interaction (HCI) applications in which detailed kinematic models of shape and motion are provided, and for which initialization usually must be done manually (see, for example, “Tracking People with Twists and Exponential Maps”, C. Bregler and J. Malik., Proc. Computer Vision and Pattern Recognition, CVPR-98, pages 8-15, Santa Barbara, June 1998). Recent success with curve-tracking of human shapes also relies on a user specified model of the desired curve (see, for example, “Condensation—Conditional Density Propagation for Visual Tracking”, M. Isard and A. Blake, International Journal of Computer Vision, 29(1):2-28, 1998). For even more complex objects under differing illumination conditions it has been common to learn a model of object appearance from a training set of images prior to tracking (see, for example, “Efficient Region Tracking with Parametric Models of Geometry and Illumination”, G. D. Hager and P. N. Belhumeur, IEEE Trans. PAMI, 27(10):1025-1039, 1998). Whether a particular method tracks blobs to detect activities like football plays (see “Recognizing Planned, Multi-Person Action”, S. S. Intille and A. F. Bobick, Computer Vision and Image Understanding, 1(3):1077-3142, 2001), or specific classes of objects such as blood cells, satellites or hockey pucks, it is common to constrain the problem with a suitable model of object appearance and dynamics, along with a relatively simple form of data association (see, for example, “A Probabilistic Exclusion Principle for Tracking Multiple Objects, J. MacCormick and A. Blake, Proceedings of the IEEE International Conference on Computer Vision, volume I, pages 572-578, Corfu, Greece, September 1999).
Other conventional visual motion analysis techniques address portions of the analysis process, but in each case fail to both identify an arbitrary number of moving objects, and to track the moving objects in a manner that is both efficient and accounts for occlusions. Current optical flow techniques provide reliable estimates of moving object velocity for smooth textured surfaces (see, for example, “Performance of Optical Flow Techniques”, J. L. Barron, D. J. Fleet, and S. S. Beauchemin, International Journal of Computer Vision, 12(1):43-77, 1994), but do not readily identify the moving objects of interest for generic scenes. Layered image representations provide a natural way to describe different image regions moving with different velocities (see, for example, “Mixture Models for Optical Flow Computation”, A. Jepson and M. J. Black, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 760-761, New York, June 1993), and they have been effective for separating foreground objects from backgrounds. However, in most approaches to layered motion analysis, the assignment of pixels to layers is done independently at each pixel, without an explicit model of spatial coherence (although see “Smoothness in Layers: Motion Segmentation Using Nonparametric Mixture Estimation”, Y. Weiss, Proceedings of IEEE conference on Computer Vision and Pattern Recognition, pages 520-526, Puerto Rico, June 1997). By contrast, in most natural scenes of interest the moving objects occupy compact regions of space.
Another approach taught by H. Tao, H. S. Sawhney, and R. Kumar in “Dynamic Layer Representation with Applications to Tracking”, Proc. IEEE Conference on Computer Vision and Pattern Recognition, Volume 2, pages 134-141, Hilton Head (June 2000), which is referred to herein as “Gaussian method”, addresses the analysis of multiple moving image regions utilizing a relatively simple parametric model for the spatial occupancy (support) of each layer. However, the spatial support of the parametric models used in the Gaussian method decays exponentially from the center of the object, and therefore fails to encourage the spatiotemporal coherence intrinsic to most objects (i.e., these parametric models do not represent region boundaries, and they do not explicitly represent and allow the estimation of relative depths). Accordingly, the Gaussian method fails to address occlusions (i.e., objects at different depths along a single line of sight will occlude one another). Without taking occlusion into account in an explicit fashion, motion analysis falls short of the expressiveness needed to separate changes in object size and shape from uncertainty regarding the boundary location. Moreover, by not addressing occlusions, data association can be a significant problem when tracking multiple objects in close proximity, such as the parts of an articulated body.
What is needed is an efficient visual image analysis method for detecting an arbitrary number of moving things in an image sequence, and for reliably tracking the moving objects throughout the image sequence even when they occlude one another. In particular, what is needed is a compositional layered motion model with a moderate level of generic expressiveness that allows the analysis method to move from pixels to objects within an expressive framework that can resolve salient motion events of interest, and detect regularities in space-time that can be used to initialize models, such as a 3D person model. What is also needed is a class of representations that capture the salient structure of the time-varying image in an efficient way, and facilitate the generation and comparison of different explanations of the image sequence, and a method for detecting best-available models for image processing functions.