The recent advance in the field of communication-based applications such as videophone and video conferencing systems mainly concentrate on minimizing the size and the cost of the coding equipment. The low cost of the final product is the most essential part of the current age of technology. Most current real-time systems include video applications that need to process huge amounts of data and large communication bandwidth. Real-time video applications include strategies to compress the data into a feasible size.
Among different data compression techniques, object-based video representation, as addressed by MPEG-4, allows for content-based authoring, coding, distribution, search and manipulation of video data. In MPEG-4, the Video Object (VO) refers to spatio-temporal data pertinent to a semantically meaningful part of the video. A 2-D snapshot of a Video Object is referred to as a Video Object Plan (VOP). The 2-D triangular mesh is designed on the first appearance of VOP as extension for the 3D modeling. The vertices of the triangular patches are referred to as the nodes. The nodes of the initial mesh are then tracked from VOP to VOP. Therefore, the motion vectors of the node points in the mesh represent the 2D motion of each VO. The motion compensation is achieved by triangle wrapping from VOP to VOP using Affine transform.
Recently, hierarchical mesh representation attracted attention because it provides rendering at various levels of detail. It also allows scalable/progressive transmission of the mesh geometry and motion vectors. The hierarchical mesh coding is used for transmission scalability where we can code the mesh at different resolutions to satisfy the bandwidth constraints and/or the QoS requirements.
Hierarchical 2D-mesh based modeling of video sources has been previously addressed for the case of uniform topology only. The mesh is defined for coarse-to-fine hierarchy, which was trivially achieved by subdividing each triangle/quadrangle into three or four subtriangles or quadrangles, as in C. L. Huang and C.-Y. Hsu, “A new motion compensation method for image sequence coding using hierarchical grid interpolation:” IEEE Trans. Circuits. Syst Video Technol., vol 4, pp. 42-51, February 1994.
A basic requirement of active tracking system is the ability to fixate or track video objects using an active camera. Real-time object tracking systems have been developed recently. T. Darrel's system combines stereo, color, and face detection modules into a single robust system. Pfinder (person finder) uses a multi-class statistical model of color and shape to obtain a 2-D representation of head and hands in a wide range of viewing conditions. KidRooms is a tracking system based on closed-world regions, where regions of space and time in which the specific context of what is in the regions is assumed to be known. These regions are tracked in real-time domains where object motions are not smooth or rigid, and where multiple objects are interacting. Multivariate Gaussian models are applied to find the most likely match of human subjects between consecutive frames taken by cameras mounted in various locations. Lipton's system extracts moving targets from a real-time video stream, classifies them into pre-defined categories and tracks them. Because it uses correlation matching, it is primarily targeted at the tracking of rigid objects. Birchfield proposed an algorithm for tracking a person's head by modeling the head as an ellipse whose position and size are continually updated by a local search combining the output of a module concentrating on the intensity gradient around the ellipse's perimeter, and another module focusing on the color histogram of the ellipse's interior. Reid and Murry introduced monocular fixation using affine transfer as a way of a cluster of such features, while at the same time respecting the transient nature of the individual features.
Mesh based frame processing is disclosed and discussed in papers by Badawy, one of the inventors of this invention, in “A low power VLSI architecture for mesh-based video motion tracking” Badawy, W.; Bayoumi, M. A. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, Volume 49, Issue 7, pp 488-504, July 2002; and also in “On Minimizing Hierarchical Mesh Coding Overhead: (HASM) Hierarchical Adaptive Structured Mesh Approach”, Badawy, W., and Bayoumi, M., Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Istanbul, Turkey, June 2000, p. 1923-1926; and “Algorithm Based Low Power VLSI Architecture for 2-D mesh Video-Object Motion Tracking”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 4, April 2002. The present invention is directed towards an improvement over the technology disclosed in Badawy papers.
Coding of frame data requires a motion detection kernel. Perhaps, the most popular motion estimation kernel used for inter-frame video compression is the block matching model. This model is often more preferred over others in video codec implementations, because it does not involve complicated arithmetic operations as compared to other kernels such as the optical flow model. However, the block matching model has major limitations in the accuracy of estimated motion, since it only allows inter-frame dynamics to be described as a series of block translations. As a result, any inter-frame dynamics related to the reshaping of video objects will be inaccurately represented.
The mesh-based motion analysis as disclosed by Badawy addresses the shortcomings of block matching. In this model, an affine transform procedure is used to describe inter-frame dynamics, so that the reshaping of objects between video frames can be accounted for and the accuracy of estimated motion can be improved. Since this model also does not require the use of complicated arithmetic procedures, it has attracted many developers to use it as a replacement for the block matching model in inter-frame codec implementations. Indeed, MPEG-4 has included the mesh-based motion analysis model as part of its standard.
The efficacy of the mesh-based motion analysis model, even with the improvements disclosed in this patent document, is often limited by the domain disparity between the affine transform function and the pel domain. In particular, since the affine transform is a continuous mapping function while the pel domain is discrete in nature, a discretization error will result when the affine transform is applied to the pel domain. As pel values are not uniformly distributed in a video frame, a minor discretization error may lead to totally incorrect pel value mappings. Hence, the quality of frames reconstructed by the mesh-based motion analysis model is often affected. The poor frame reconstruction quality problem becomes even more prominent at the latter frames of a group-of-pictures (GOP), since these latter frames are reconstructed with earlier reconstructed frames in the same GOP and thus all prior losses in the frame reconstruction quality are carried over.
To resolve the frame reconstruction quality problem in the mesh-based motion analysis model, residual coding techniques can be employed. These techniques provide Discrete Cosine Transform (DCT) encoding of the prediction difference between the original frame and the reconstructed frame, and thus better frame reconstruction quality can be achieved. However, the use of these techniques will reduce the compression efficiency of the video sequence, since residual coding will bring about a significant increase in the compressed data stream size. To this end, residual coding via the matching pursuit technique may be used instead. This approach offers high quality residual coding with very short amendments to the compressed data stream. Nevertheless, it is not a feasible coding solution as yet because of its high computational power requirements.