The analysis of motion information in video sequences has typically addressed two largely non-overlapping applications: video retrieval and video coding. In video retrieval systems, the dominant motion, motion trajectories and tempo are computed to identify particular video clips or sequences that are similar in terms of motion characteristics or belong to a distinct class (e.g., commercials). In video coding systems, global motion parameters are estimated for global motion compensation and for constructing sprites. In both video retrieval and video coding systems, it is desirable to identify pan and zoom global motion. For video retrieval systems, pan and zoom detection enables classification of video sequences (e.g., documentary movies) for efficient retrieval from video databases. For video coding systems, pan and zoom detection enables the adaptive switching of coding parameters (e.g., the selection of temporal and spatial Direct Modes in H.264).
Previous methods for detecting pan and zoom global motion in video sequences require estimating parameters of global motion, i.e., motion such that most of the image points are displaced in a uniform manner. Because the motion of many image points in a video frame is described by a small set of parameters related to camera parameters, estimating global motion parameters is a more constrained case than the estimation of motion parameters in all image points. The number of parameters obtained depends on the global motion model that is assumed to best describe the motion in the video sequence, for example, translational, affine, perspective, quadratic, etc., yielding 2, 6, 8 and 12 parameters, respectively. In particular, a perspective motion model yields the estimated coordinates {circumflex over (x)}, ŷ using the old coordinates xi, yi and the equations:{circumflex over (x)}i=(a0+a2xi+a3yi)/(a6xi+a7yi+1)  (1)ŷi=(a1+a4xi+a5yi)/(a6xi+a7yi+1)  (2)where a0 . . . a7 are the motion parameters. Other models can be obtained as particular cases of the perspective model. For example, if a6=a7=0, the affine model (six parameters) is obtained, if a2=a5, a3=a4=a6=a7=0, the translation-zoom model (three parameters) is obtained, and if a2=a5=1, a3=a4=a6=a7=0, the translational model (two parameters) is obtained.
Global motion estimation can be formulated as an optimization problem, where the error between a current frame and a motion compensated previous frame is minimized. Techniques such as gradient descent and second order optimization procedures have been applied iteratively to solve the optimization problem. In Hirohisa Jozawa, et al., “Two-stage Motion Compensation Using Adaptive Global MC and Local Affine MC,” IEEE Trans. on Circuits and Systems for Video Tech., Vol. 7, No. 1, pp. 75-82, February 1997, global motion parameters are estimated using a two-stage motion compensation process. In the first stage, global motion is estimated and a global motion compensated picture is obtained. In the second stage, the global motion compensated picture is used as a reference for local motion compensation. The local motion compensation is performed both for the global motion compensated reference image and for the image without global motion compensation using an affine motion model in the framework of the H.263 standard.
Other techniques for estimating global motion in video sequences have also been proposed. A technique proposed in Frederic Dufaux et al., “Efficient, Robust and Fast Global Motion Estimation for Video Coding,” IEEE Trans. on Image Processing, Vol. 9, No. 3, pp. 497-510, March 2000, includes a three-stage process. In a first stage, a low pass image pyramid is constructed by successive decompositions of the original picture. In a second stage, an initial estimation is performed, followed by a refining of the initial estimate, using gradient descent-based in a third stage. A perspective model with eight parameters has been used in this technique to model camera motion.
In Gagan B. Rath, et al., “Iterative Least Squares and Compression Based Estimation for a Four-Parameter Linear Global Motion Model and Global Motion Compensation,” IEEE Trans. on Circuits and Systems for Video Tech., Vol. 9, No. 7, pp. 1075-1099, October 1999, a four-parameter model for global motion is employed for pan and zoom motion estimation. This technique uses iterative least squares estimation to accurately estimate parameters.
In Patrick Bouthemy, et al., “A Unified Approach to Shot Change Detection and Camera Motion Characterization,” IEEE Trans. on Circuits and Systems for Video Tech., Vol. 9, No. 7, pp. 1030-1040, October 1999, a unified approach to shot change detection and camera motion characterization is proposed. By using an affine motion model, global motion parameters are estimated and at the same time, the evolution of scene cuts and transitions is evaluated.
In Yap-Peng, et al., “Rapid Estimation of Camera Motion from Compressed Video With Application to Video Annotation,” IEEE Trans. on Circuits and Systems for Video Tech., Vol. 10, No. 1, pp. 133-146, February 2000, camera motion parameters are estimated from compressed video, where macroblocks from P frames are used to estimate the unknown parameters of a global motion model.
All of the conventional methods described above require estimating global motion parameters to identify a specific type of global motion (e.g., pan, zoom or other). To estimate global motion, however, these conventional methods employ a generic motion model having global motion parameters that must be estimated. These global motion parameters are not necessary, however, for retrieving video sequences from databases. Nor are these global motion parameters necessary for parameter switching in video coding systems. Therefore, the conventional methods described above for estimating global motion increase unnecessarily the computational complexity of the application systems that employ such techniques.
Video retrieval systems can benefit from pan and zoom detection, which would allow identification of documentary movies and other sequences in video databases. Documentary movies include, for example, long panning clips that have a typical length of at least 10 seconds (i.e., 240 frames for a frame rate of 23.976 fps). These long panning clips are often preceded or followed by zooms on scenes or objects of interest. Pan and zoom clips are also present in numerous other types of sequences, from cartoons and sports games to home videos. It is therefore of interest to retrieve video clips and sequences having common pan or zoom characteristics.
Pan and zoom detection in video sequences can also enhance the capabilities of an encoder in a standards compliant system. It is well-known that encoders that are compliant with the MPEG and ITU standards may be unconstrained in terms of analysis methods and parameter values selections, as well as various coding scenarios for given applications, as long as the resulting compressed bit streams are standards-compliant (i.e., can be decoded by any corresponding standardized decoder). The objective of performing various enhancements at the encoder side is bit rate reduction of the compressed streams while maintaining high visual quality in the decoded pictures. An example of such enhancement is the selection of temporal and spatial Direct Modes described in the H.264 video coding standard.
In H.264, each frame of a video sequence is divided into pixel blocks having varying size (e.g., 4×4, 8×8, 16×16). These pixel blocks are coded using motion compensated predictive coding. A predicted pixel block may be an Intra (I) pixel block that uses no information from preceding pictures in its coding, a Unidirectionally Predicted (P) pixel block that uses information from one preceding picture, or a Bidirectionally Predicted (B) pixel block that uses information from one preceding picture and one future picture. The details of H.264 can be found in the publicly available MPEG and ITU-T, “Joint Final Committee Draft of Joint Video Specification ISO/IEC/JTC1/SC29/WG11 (MPEG) 14496-10 and ITU-T Rec. H.264,” Geneva, October 2002, which is incorporated by reference herein in its entirety.
For each pixel block in a P picture, a motion vector is computed. Using the motion vector, a prediction pixel block can be formed by translation of pixels in the aforementioned previous picture. The difference between the actual pixel block in the P picture and the prediction block is then coded for transmission. Each motion vector may also be transmitted via predictive coding. That is, a prediction is formed using nearby motion vectors that have already been sent, and then the difference between the actual motion vector and the prediction is coded for transmission. For each B pixel block, two motion vectors are typically computed, one for the aforementioned previous picture and one for the future picture. From these motion vectors, two prediction pixel blocks are computed, which are then averaged together to form the final prediction. The difference between the actual pixel block in the B picture and the prediction block is then coded for transmission. Each motion vector of a B pixel block may be transmitted via predictive coding. That is, a prediction is formed using nearby motion vectors that have already been transmitted, then the difference between the actual motion vector and the prediction is coded for transmission.
With B pixel blocks, however, the opportunity exists for interpolating the motion vectors from those in the co-located or nearby pixel blocks of the stored pictures. Note that when decoding a B slice, there exist two lists (list 0 and list 1) of reference pictures stored in the decoded picture buffer. For a pixel block in a B slice, the co-located pixel block is defined as a pixel block that resides in the same geometric location of the first reference picture in list 1 or nearby pixel blocks of the stored pictures. The former case is known as the temporal-direct mode. The latter case is known as the spatial direct mode. In both of these cases, the interpolated value may then be used as a prediction and the difference between the actual motion vector and the prediction coded for transmission. Such interpolation is carried out both at the coder and decoder. In some cases, the interpolated motion vector is good enough to be used without any correction, in which case no motion vector data need be sent. Note that the prediction error of a pixel block or subblock, which is computed as the mean square error between the original pixel block and the decoded pixel block after encoding using direct mode is still transformed, quantized and entropy encoded prior to transmission. This is referred to as Direct Mode in H.264 (and H.263). Direct Mode selection is particularly effective when the camera is slowly panning across a stationary background. Indeed, the interpolation may be good enough to be used as is, which means that no differential information need be transmitted for these B pixel block motion vectors. Therefore, for such sequences that allow good motion vector predictions using neighboring temporal or spatial information, the Direct Mode can provide important bit rate savings.
Accordingly, there is a need for a system and method for pan and zoom detection in video sequences that enable classification of video sequences (e.g., documentary movies) in video retrieval systems and adaptive switching of coding parameters (e.g., selection of temporal and spatial Direct Modes in H.264) video coding systems, without performing the computationally intensive task of estimating all the parameters of a global motion model.