This invention relates to a digital video compression technology and, more particularly, to novel systems and methods for determining contour-based motion estimation for compressing and transmitting video images so that they may be accurately reconstructed for providing quality images.
Since digital computers were introduced in the 1930""s, they have been used in many areas of industry, including communications and in the video industry. One of the significant recent developments using digital computers involves data storage on an appropriate storage media and data communication (involving video data) transmitted through local area networks, and other communication network such as wide area network, Internet, World Wide Web, and others.
Video images as they can be seen on television or a computer screen are actually a series of still pictures. Each of the still pictures is called a frame. By showing the frames at a rapid rate, such as approximately 30 frames per second, human eyes can recognize the pictures as a moving scene. This invention concerns efficiently encoding and transmitting and accurately reconstructing and displaying video images.
For the purposes of this document, it will be useful to introduce terms with which the reader will need to be familiar in order to fully comprehend the disclosure contained herein. These terms are as follow:
B-frame: bidirectional predicted frame. A frame that is encoded with a reference to a past frame, a future frame or both.
Bitrate: the rate at which a device delivers a compressed bitstream to an input of another device.
I-frame: intra coded framexe2x80x94a frame coded using information only from its own frame and not reference to any other frame.
I-VOP: intra coded video object planexe2x80x94a video object plane coded using information only from the video object plane and not from any other video object plane.
IEC: International Electrotechnical Commission.
ISO: International Organization for Standardization.
Motion estimation: a process of estimating motion vectors for a video image.
MPEG: Moving Picture Experts Group. A group of representatives from major companies throughout the world working to standardize technologies involved in transmission of audio, video, and system data. Video coding standards are developed by the MPEG video group.
MPEG-1: a standard for storage and retrieval of moving pictures and associated audio on storage media. The current official denotation is ISO/IEC/JTC1/SC29/WG11.
MPEG-2: a standard for digital television at data rates below 10 Mbit/sec. The study began in 1990 and the standard for video was issued in early 1994.
MPEG-3: a standard initially to suit coding of high Definition TV (HDTV). MPEG-3 was later merged into MPEG-2.
MPEG-4: a standard for multimedia applications. This phase of standardization started in 1994 to accommodate the telecommunications, computer and TV/film industries.
MPEG-7: a content representation standard for various types of multimedia information.
P-frame: forward predictive frame. A frame that has been compressed by encoding the difference between the current frame and the past reference frame.
P-VOP: forward predictive video object plane. A video object plane that has been compressed by encoding the difference the video object plane and the past reference video object plane.
Pel: picture element in a digital sense. A pel is the digital version of a pixel in analog technology.
Video image: an image containing a video object, multiple video objects, a video object plane, an entire frame, or any other video data of interest.
VOP: video object plane as defined in MPEG-4. An image or video content of interest.
With the general meaning of this terminology in mind, a description of the general problems of the prior art and a detailed description of the operation of the invention are provided below.
Generally, when a video signal is digitized, a large amount of data is usually generated. For example, if a frame of a video image in a sequence of such frames is digitized as discrete grids or arrays with 360 pels (or pixels) per raster line and 288 lines/frame, approximately 311 Kbytes of memory capacity is necessary to store that one frame, assuming each pixel uses 8 bits of space to store color data. On a screen, a moving picture needs at least 30 frames per second to provide a realistic image. The raw data rate for a picture is about 72 Mbits per second or 4,320 Mbit (540 Mbyte) per minute. Therefore, it is almost impractical to store digital video data on a media or to send digital video data of several minutes to another location.
Moreover, real time transmission of video signals is impossible since no hardware currently available can provide the speed required to process the massive amount of data. Therefore, it is essential to compress the digital video data in order to generate moving pictures that are manageable using a current hardware technology.
A number of attempts have been made in the prior art to accomplish video data compression. Researchers discovered that the compression ratio of conventional lossless methods, such as Huffman, Arithmetic, and LZW, are not high enough for image and video compression. Fortunately, consecutive video pictures are usually quite similar from one to the next. Taking advantage of this, typically the prior art utilizes common video characteristics, such as spatial redundancy, temporal redundancy, uniform motion, spatial masking, and others to compress video picture data as used in Joint Photographic Expert Group (JPEG), H.261 compression, Moving Picture Experts Group (MPEG), and others.
One attempt to solve the problems of the prior art was made by a group called the Moving Picture Experts Group (MPEG) under the auspices of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Formed in 1988, this group has accomplished standardization of compression techniques for video, audio, and system which can be used throughout the world. Some of the standardization efforts of this group have been resulted in the standards known as MPEG-1 and MPEG-2 (into which MPEG-3 was merged). At the present time, the Group is working on an MPEG-4 standard.
A typical technique in compression, as adopted in the MEPG standard series, uses compression based on the Discrete Cosine Transform (DCT) and a motion compensation technique. The DCT-based compression is used to reduce spatial redundancy, and motion compensation is used to exploit temporal redundancy. Even though the Group working on MPEG-4 has adopted Shape Adoptive Discrete Cosine Transform (SADCT), the basic concept behind both DCT and SADCT is the same.
In MPEG-1 and MPEG-2, a frame can be usually encoded into three different types: intra-frame (I-frame), forward predictive frame (P-frame), and bi-directional predicted frame (B-frame). An I-frame is a frame that has been encoded independently as a single image without reference to other frames. A P-frame is a frame that has been compressed by encoding the difference between a frame and a past reference frame which is typically an I-frame or P-frame. A B-frame is a frame that has been encoded relative to a past reference frame, a future reference frame, or both. A typical group of encoded frames has a series of these types of frames in combination.
Each frame is typically divided into macroblocks. A macroblock consists of 16xc3x9716 sample array of luminance (grayscale) samples together with one 8xc3x978 block sample for each of two chrominance (color) components. Macroblocks are the units of motion-compensated compression, and blocks are used for DCT compression.
When DCT compression is used, blocks are first transformed from the spatial domain into a frequency domain using the technique provided by DCT compression. Generally, DCT is a method of decomposing a block of data into a weighted sum of spatial frequencies. For example, an analog signal is sampled by discrete cosine functions with different spatial frequencies. Each of these spatial frequency patterns has a corresponding coefficient which is the amplitude representing the contribution of that spatial frequency pattern in the block of data being analyzed. In an 8xc3x978 DCT, each spatial frequency pattern is multiplied by its coefficient and the resulting 64 amplitude arrays (8xc3x978) are summed, each pel separately, to reconstruct the 8xc3x978 block. In the DCT compression technique, quantization needs to be performed after the frequency conversion to significantly reduce the number of data by removing non-zero data values for the coefficients.
When macroblocks are reconstructed from I-frame information, P-frame and/or B-frame information, macroblocks usually overlap each other. Reconstructed macroblocks, such as by frame prediction, do not form a clean frame because predicted macroblocks are usually shifted from their original positions. Therefore, a motion estimation for each macroblock is necessary to compensate for the shift. The prior art method of motion estimation is performed by comparing each pel of a macroblock array against a corresponding array of the next frame within a certain range. Motion-compensated coding is an example of inter-frame encoding. When a best matching array is found, a motion vector is calculated by comparing the current position with the previous position of the macroblock. The process of finding a motion vector for each macroblock has to be repeated for all macroblocks in the frame. As can be seen from this discussion, the necessary computations are complex, use significant computing resources, and result is an inaccurate image that must be corrected before it is displayed.
In a relatively new development in the prior art, MPEG-4 supports content-based video functionality which requires introduction of the concept of video object planes (VOPs). A frame can be segmented into a number of arbitrarily shaped image regions which are video object planes. A VOP can be an image or a video content of interest. Unlike the video source format used in MPEG-1 and MPEG-2, video input is not necessarily a rectangular region. Since the MPEG-4 standard uses the VOP concept, terminology used for encoding types of MPEG-4 are I-VOP, P-VOP, and B-VOP, instead I-frame, P-frame, and B-frame as used for MPEG-1 and MPEG-2.
MPEG-4 uses both binary shape encoding and greyscale shape encoding. A video object of interest can be differentiated from the background. In binary shape encoding, a video object can be defined as either opaque or transparent to the background. In the grey scale encoding, however, the relatedness of the video object to the background can also be defined within a scale from zero to 255 between opaque (255) and transparent (0). MPEG-4 uses modified MMR coding for binary shape information, and motion compensated DCT coding for grey scale shape information.
The prior art has some disadvantages which are generally recognized in the industry. First, when a frame is reconstructed by assembling macroblocks from the previous frame, realization of the original frame is usually impossible unless an accurate motion estimation is performed for each macroblock. This may result in a serious problem. For example, a shape divided into several macroblocks cannot be restored to its original picture with one continuous edge. An image segmented into several macroblocks does not align smoothly when it is reassembled. As a consequence, a continuous edge in the original shape becomes broken at the borders of each macroblock. The resulting poor image quality is a serious concern and can be improved by implementing a concept as disclosed in the present invention.
Moreover, the processing time involved in performing DCT compression, quantization, and motion estimation is very substantial. For example, an 8xc3x978 block requires typically at least 1024 multiplications and 896 additions to perform DCT compression. Particulary, the time required to process motion estimation is great because motion estimation has to be performed for entire macroblocks as defined in a frame. Obviously, implementing a system of this complexity requires significant coding, introducing a substantial possibility of programming error.
In view of the foregoing, it is a primary object of the present invention to provide a method and a system for contour-based motion estimation, which is capable of reconstructing a better quality video image without discontinued contour. It is a feature of the invention that contour recognition is utilized which results in an accurate image, unlike the prior art.
It is also an object of the present invention to provide a method and a system for determining a contour-based motion estimation, which is capable of efficiently reducing processing time for motion estimation of a video image. The invention includes an important feature of computational and processing simplicity, resulting in only moderate use of computing resources.
Further, it is an object of the present invention to provide a system for contour-based motion estimation which is capable of transmitting a relatively low number of motion vectors for a video image compared to the prior art. The invention, being capable of selecting data for transmission based on contour location changes, omits the redundant data transmission of the prior art and only transmits data required for an accurate image reconstruction.
It is still further an object of the present invention to provide a method and a system of determining motion estimation which is capable of generating overhead information to be transmitted that reduces the total amount of data transmitted. The overall information to be transmitted can be substantially less than in the prior art because relevant information can be transmitted as overhead information, not as actual picture data. Therefore, the present invention can reduce the redundancy of transmitted data.
Consistent with the foregoing objects, and in accordance with the invention as embodied and broadly described herein, a method and a system for contour-based motion estimation is disclosed in one embodiment of the present invention.
The foregoing and other objects and features of the present invention will become more fully apparent from the following description and appended claims, taken in conjunction with the accompanying drawings.