The present disclosure relates to a system and method for creating a spatiotemporal image representation with information content that enables accurate and rapid processing. The disclosure finds particular application in connection with spatiotemporal (ST) image representation of a sequence of video images. However, it is to be appreciated that the present exemplary embodiments are also amendable to other like applications.
In the computer vision community, spatiotemporal processing techniques have recently gained more attention. Different methods have been proposed and studied by researchers for various applications.
Videos are usually viewed and processed by humans in a sequential mode, where each 2-D video frame (with 2 spatial dimensions) is played in a timely order. Traditional spatiotemporal representations of videos are usually obtained by stacking 1-D (spatial dimension) signals obtained from each 2-D video frame in their time orders. Therefore, the resulting 2-D ST images are composed by one spatial dimension and one time dimension. One traditional way to obtain the 1-D spatial dimension signals is to extract a row or column from the same location in each video frame, where the axis of the row or column becomes the spatial dimension. Successive frames yield successive extractions, which are time-order stacked to form a 2-D ST representation. Another known method for producing an ST representation from a video sequence projects (summarizes or averages) the whole or part of each 2-D video frame along one spatial dimension to reduce the image frame having 2-D spatial dimensions to a signal that is of one spatial dimension, and the sequence of 1-D signals resulting from the extractions are stacked to form the 2-D ST image.
Many existing applications using spatiotemporal representation of videos are focused on characterizing camera motion. One approach is directed at how to extract motion velocity that corresponds to an orientation in spatiotemporal space using a set of quadratic pairs of linear filters. Examples consist of rigid patterns moving constantly in one direction with no background clutter. Other approaches have relied on different algorithms to estimate camera motion in the spatiotemporal domain. In one approach video tomography is used to extract lens zoom, camera pan and camera tilt information from a video sequence using the Hough transform to compute a linear camera model in the spatiotemporal images. A similar method analyzes a video sequence to characterize camera motion and involves determining a 2-D spatiotemporal representation wherein trace lines are determined by quantizing the 2-D spatiotemporal representation and finding boundaries between the quantized regions. Camera motion is then inferred by analyzing the pattern of the trace lines using Hough transforms.
Some other attempts have been directed at trying to detect moving objects in a constrained situation, where objects usually move at a constant velocity in front of a static camera. These algorithms often involve detecting straight lines or planes by using the Hough transform. In one example, the gait patterns generated by walking humans are analyzed using XT (spatiotemporal) slices. The Hough transform is then used to locate the straight line patterns in XT slices. In another example, a perceptual organization-based method is used to describe the motion in terms of compositions of planar patches in the 3-D spatiotemporal domain.
Yet another approach is directed at analyzing object motion in less constrained videos using spatiotemporal slices. This method involves using structure tensor analysis to first estimate the local orientation in each spatiotemporal slice. A 7-bin 2-D tensor histogram is then formed and the detected dominant motion is used as the background motion to reconstruct the background image in the spatial domain. Background subtraction is then used to roughly detect the foreground objects, and the results are further refined using color information.
More recently, researchers have started to use spatiotemporal features which are obtained directly from 3-D video volume to assist action/activity/behavior detection and recognition.
Other than applications in characterizing camera motion, detecting/tracking a moving object and representing local volume features, spatiotemporal related methods are also used in video stabilization, visual attention extraction, block matching, parked car detection and human counting, for example.
Current spatiotemporal processing methods and applications use one or multiple spatiotemporal signals/images/slices and process them separately as shown in method 10 of FIG. 1. Integrations may happen after each individual spatiotemporal processing.