Detecting and tracking independent moving objects in video sequences are two fundamental computer vision tasks that have broad applications in video analysis and processing. Most of the current moving object detection algorithms are based on analysis of a sequence of individual video images in the spatial domain on a frame-by-frame basis. Object tracking algorithms typically require the use of object detection algorithms or human input to initialize the objects that should be tracked, and are also generally applied on a frame-by-frame basis.
One of the most common approaches to moving object detection is based on background subtraction, where differences are calculated between the current frame and a reference background frame. Large pixel differences are used as indications of motion probability. This approach can work well in controlled settings, such as with static camera positions, and with constant or slowly changing illumination. However, background subtraction methods break down when these conditions are not satisfied.
A variation of this approach involves computing differences between successive frames of the video sequence. Typically, the differences are determined after a stabilization or frame registration process has been applied in order to distinguish between background motion and foreground object motion. Both the background subtraction and frame differencing strategies provide difference images indicating image pixels that have changed. However, the identification of the moving object regions from these difference images remains a difficult problem.
Another popular approach to detect moving objects is based on applying an optical flow estimation process to analyze the video sequence. A flow field segmentation algorithm is then used to identify regions of coherent motion. While optical flow algorithms provide pixel-level motion vectors, they are computationally intensive and are inevitably sensitive to noise.
Akutsu et al., in the article “Video tomography: an efficient method for camerawork extraction and motion analysis” (Proc. Second ACM International Conference on Multimedia, pp. 349-356, 1994), teach a method to extract lens zoom, camera pan and camera tilt information from a video sequence using a motion analysis technique. According to this method, the video is represented as a three-dimensional (3-D) spatiotemporal function. Cross-sections are taken through the 3-D spatiotemporal function to provide a two-dimensional (2-D) spatiotemporal representation with one spatial dimension and one time dimension. A Hough transform is applied to the 2-D spatiotemporal representation to extract zooming and panning parameters. This approach does not provide a means to separate the motion pattern of foreground objects from the motion pattern of the background caused by the zooming and panning of the video camera in the two-dimensional representation.
U.S. Pat. No. 6,411,339 to Akutsu et al., entitled “Method of spatio-temporally integrating/managing a plurality of videos and system for embodying the same, and recording medium for recording a program for the method,” uses a similar motion analysis method to estimate video camera motion. The determined camera motion is then used to align the video frame backgrounds so that foreground objects can be identified by computing differences between the aligned video frames.
Joly et al., in the article “Efficient automatic analysis of camera work and microsegmentation of video using spatiotemporal images” (Signal Processing: Image Communication, Vol. 8, pp. 295-307, 1996), teach a method for analyzing a video sequence to characterize camera motion. The method involves determining a 2-D spatiotemporal representation similar to the one described by Akutsu et al. Trace lines are determined by quantizing the 2-D spatiotemporal representation and finding boundaries between the quantized regions. Camera motion is then inferred by analyzing the pattern of the trace lines using Hough transforms. This method does not provide a means to separate the motion pattern of foreground objects from the motion pattern of the background caused by the zooming and motion of the video camera.
Ngo et al., in the article “Motion analysis and segmentation through spatio-temporal slices processing” (IEEE Trans. Image Processing, Vol. 12, pp. 341-355, 2003), describe a method for analyzing motion in a video image sequence using spatiotemporal slices. As with the method of Akutsu et al., the video is represented as a 3-D spatiotemporal function. The method involves using tensor analysis to determine motion information by analyzing the orientation of local structures in a 2-D spatiotemporal slice through the 3-D spatiotemporal space. Since a particular slice will only intersect a line through the video frames, it is necessary to consider a large number of slices, which adds computational complexity. A clustering algorithm is applied based on color similarity to segment the video frames into background and foreground objects so that objects with different colors can be separated from the background. Another approach proposed in the same article for separating moving objects from the background uses background subtraction in the spatial domain. The background image is reconstructed based on a detected dominant motion in spatiotemporal slices.
Niyogi et al., in the article “Analyzing gait with spatiotemporal surfaces” (IEEE Workshop on Motion of Non-Rigid and Articulated Objects, pp. 64-69, 1994), describe a method for analyzing patterns in spatiotemporal representations of a video sequence to evaluate gait of a walking individual. A stationary camera position is used and moving objects are identified by detecting changes in the captured images. Hough transforms are used in the process of determining a spatiotemporal surface associated with the moving object.
Sarkar et al., in the article “Perceptual organization based computational model for robust segmentation of moving object” (Computer Vision and Image Understanding, Vol. 86, pp. 141-170, 2002), teach a method for analyzing a video based on forming a 3-D spatiotemporal volume to find perceptual organizations. The method involves applying a 3-D edge detection process to the 3-D spatiotemporal volume and then using a Hough transform to detect planar structures in the 3-D data.
There remains a need for a computationally efficient method for analyzing video sequences to determine foreground and background motion estimates.