Digital video products and services such as digital satellite service and video streaming over the Internet are becoming increasingly popular and drawing significant attention in the marketplace. Because of limitations in digital signal storage capacity and in network and broadcast bandwidth transmission limitations, there has been a need for compression of digital video signals for efficient storage and transmission of video images. For this reason, many standards for compression and encoding of digital video signals have been developed. For example, the International Telecommunication Union (ITU) has promulgated the H.261, H.263 and H.26L standards for digital video encoding. Additionally, the International Standards Organization (ISO) has promulgated the Motion Picture Experts Group (MPEG) MPEG-1 and MPEG-2 standards for digital video encoding.
These standards specify with particularity the form of encoded digital video signals and how such signals are to be decoded for presentation to a viewer. However, significant discretion is allowed for selecting how digital video signals are transformed from uncompressed format to a compressed, or encoded format. For this reason, there are many different digital video signal encoders available today. These various digital video signal encoders may achieve varying degrees of compression.
It is desirable for a digital video signal encoder to achieve a high degree of compression without significant loss of image quality. Video signal compression is generally achieved by representing identical or similar portions of an image as infrequently as possible to avoid redundancy. A digital motion video image, which may be referred to as a “video stream”, may be organized hierarchically into groups of pictures which includes one or more frames, each of which may represent a single image of a sequence of images of the video stream. All frames may be compressed by reducing redundancy of image data within a single frame. Motion-compensated frames may be further compressed by reducing redundancy of image data within a sequence of frames.
Motion video compression may be based on the assumption that little change occurs between frames. This is frequently the case for many video signals. This assumption may be used to improve motion video compression because a significant quantity of picture information may be obtained from the previous frame. In this way, only the portions of the picture that have changed need to be stored or transmitted.
Each video frame may include a number of macroblocks that define respective portions of the video image of the video frame. The term macroblock refers to a “16×16” pixel region. Other block sizes, i.e., 8×16, 16×8, 8×8, 4×8, 8×4 and 4×4, are derived by subdividing the 16×16 macroblock. A motion vector may be used in mapping blocks from one video frame to corresponding blocks of a temporally displaced video frame. A motion vector maps a spatial displacement within the temporally displaced frame of a relatively closely correlated block of picture elements, or pixels. In frames in which subject matter is moving, motion vectors representing spatial displacement may identify a corresponding block that matches a previous block rather closely.
This is also true when the video sequence includes a camera pan, i.e., a generally uniform spatial displacement of the entirety of the subject matter of the motion video image. In a camera pan, most of the picture information from the previous frame may still be the same, but it may be at a new location in the current picture frame. It is important to know where objects in the current video frame have moved relative to the previous video frame so that as much information can be carried forward from the previous frame as possible. A search to determine where motion has taken place from a reference frame to a current frame is known as “motion estimation”.
Motion estimation may be obtained by calculating the similarity between two identically placed regions in the previous and current video frames. To calculate the difference, the sum of absolute differences (SAD) may be used. The result of the SAD is often called “distortion”, as it measures how different two areas of the previous and current frames are. Distortion may be computed as:                     distortion        =                  ∑                                                                previous                (                                  x                  ,                  y                                )                            -                              current                (                                  x                  ,                  y                                )                                                                                    (        1        )            where, previous (x,y) is the location of a previous frame of video and current (x,y) is the location of a current frame of video. Rate-distortion means to consider not only the similarity in the picture regions, how large of a vector the motion has, i.e., how far an object has traveled. This vector must be stored, and therefore is a cost that must be considered. For this reason, motion estimation is usually performed by a motion search for many nearby locations (i.e., the motion vector is not too long). The optimal solution is found by comparing the rate-distortions of all possible choices.
Of course, change in the picture from frame to frame will not only happen because of camera motion. Objects within a video frame can also move, e.g., a stationary camera recording a person who is walking past the frame of view. In cases such as this, it is possible that only small regions of the picture have moved, and other small regions have remained in place. Further, for video content such as sports, it's possible for many small objects to be moving in different directions.
Motion estimation must be capable of dealing with both coarse-grain motion (large objects moving or camera pan) and fine-grain motion (small objects moving). For this reason, H.26L uses 7 different sizes of regions to estimate motion. These are usually called blocks. These sizes include: 16×16, 8×16, 16×8, 8×8, 4×8, 8×4 and 4×4. The larger block sizes are for coarse-grain motion, the smaller block sizes for fine-grain motion. These sizes are in terms of pixels (individual color dots in the picture). However, performing a motion search for all of these block sizes is very expensive. H.26L states that a motion search should be performed for all of them, but we have discovered a better way.
It is important to note that smaller block sizes are more expensive to store than larger block sizes because each block has a motion vector. In other words, an entire 16×16 region can be described with a single motion vector, whereas the same region divided into 4×4 blocks needs 16 motion vectors. Because of this and the fact that most motion in video is coarse-grain, the 16×16 block size is usually selected as the best or preferred block size.
While there are sophisticated methods for performing image compression, they tend to be expensive. Thus, there still exists a need in the art for a method and system for image compression that reduces computational complexity and increases speed of motion video image compression.