1. Field of the Invention
The present invention relates to digital video encoding, such as MPEG and AVI. More specifically, the present invention relates to methods of motion estimation for encoding digital images of a digital video stream.
2. Discussion of Related Art
Due to the advancement of semiconductor processing technology, integrated circuits (ICs) have greatly increased in functionality and complexity. With increasing processing and memory capabilities, many formerly analog tasks are being performed digitally. For example, images, audio and even full motion video can now be produced, distributed, and used in digital formats.
FIG. 1(a) is an illustrative diagram of a digital video stream 100. Digital video stream 100 comprises a series of individual digital images 100_0 to 100_N, each digital image of a video stream is often called a frame. For full motion video a video frame rate of 60 images per second is desired. As illustrated in FIG. 1(b), a digital image 100_Z comprises a plurality of picture elements (pixels). Specifically, digital image 100_Z comprises Y rows of X pixels. For clarity, pixels in a digital image are identified using a 2-dimensional coordinate system. As shown in FIG. 1(b), pixel P(0,0) is in the top left corner of digital image 100_Z. Pixel P(X−1,0) is in the top right corner of digital image 100_Z. Pixel P(0,Y−1) is in the bottom left corner and pixel P(X−1, Y−1) is in the bottom right corner. Typical image sizes for digital video streams include 720×480, 640×480, 320×240 and 160×120.
FIG. 2 shows a typical digital video system 200, which includes a video capture device 210, a video encoder 220, a video channel 225, a video decoder 230, a video display 240, and an optional video storage system 250. Video capture device 210, typically a video camera, provides a video stream to video encoder 220. Video encoder 220 digitizes and encodes the video stream and sends the encoded digital video stream over channel 225 to video decoder 230. Video decoder 230 decodes the encoded video stream from channel 225 and displays the video images on video display 240. Channel 225 could be for example, a local area network, the internet, telephone lines with modems, or any other communication connections. Video decoder 230 could also receive a video data stream from video storage system 250. Video storage system 250 can be for example, a video compact disk system, a hard disk storing video data, or a digital video disk system.
A major problem with digital video system 200 is that channel 225 is typically limited in bandwidth. As explained above a full-motion digital video stream can comprise 60 images a second. Using an image size of 640×480, a full motion video stream would have 18.4 million pixels per second. In a full color video stream each pixel comprises three bytes of color data. Thus, a full motion video stream would require a transfer rate in excess of 52 megabytes a second over channel 225. For internet application most users can only support a bandwidth of approximately 56 Kilobits per second. Thus, to facilitate digital video over computer networks, such as the internet, digital video streams must be compressed.
One way to reduce the bandwidth requirement of a digital video stream is to avoid sending redundant information across channel 225. For example, as shown in FIG. 3, a digital video stream includes digital image 301 and 302. Digital image 301 includes a video object 310_1 and video object 340_1 on a blank background. Digital image 302 includes a video object 310_2, which is the same as video object 310_1, and a video object 340_2, which is the same as video object 340_1. Rather then sending data for all the pixels of digital image 301 and digital image 302, a digital video stream could be encoded to simply send the information that video object 310_1 from digital image 301 has moved three pixels to the left and two pixels down and that video object 340_1 from digital image 301 has moved one pixel down and four pixels to the left. Thus rather than sending all the pixels of image 302 across channel 225, video encoder 220 can send digital image 301 and the movement information, usually encoded as a two dimensional motion vector, regarding the objects in digital image 301 to video decoder 230. Video decoder 230 can then generate digital image 302 using digital image 301 and the motion vectors supplied by video encoder 220. Similarly, additional digital images in the digital video stream containing digital images 301 and 302 can be generated from additional motion vectors.
However, most full motion video streams do not contain simple objects such as video objects 310_1 and 340_1. Object recognition in real life images is a very complicated and time-consuming process. Thus, motion vectors based on video objects are not really suitable for encoding digital video data streams. However, it is possible to use motion vector encoding with artificial video objects. Rather than finding distinct objects in a digital image, the digital image is divided into a plurality of macroblocks. A macroblock is a number of adjacent pixels with a predetermined shape and size. Typically, a rectangular shape is used so that a rectangular digital image can be divided into an integer number of macroblocks. FIG. 4 illustrates a digital image 410 that is divided into a plurality of square macroblocks. For clarity, macroblocks are identified using a 2-dimensional coordinate system. As shown in FIG. 4, macroblock MB(0,0) is in the top left corner of digital image 410. Macroblock MB(X−1,0) is in the top right corner of digital image 410. Macroblock MB(0,Y−1) is in the bottom left corner and macroblock MB(X−1, Y−1) is in the bottom right corner. As illustrated in FIG. 5(a), a typical size for a macroblock 510 is eight pixels by eight pixels. As illustrated in FIG. 5(b), another typical size for a macroblock is 16 pixels by 16 pixels. For convenience, macroblocks and digital images are illustrated with bold lines after every four pixels in both the vertical and horizontal direction. These bold lines are for the convenience only and have no bearing on actual implementation of embodiments of the present invention.
To encode a digital image using macroblocks and motion vectors, each macroblock MB(x, y) of a digital image is compared with the preceding digital image to determine which area of the preceding image best matches macroblock MB(x, y). For convenience, the area of the preceding image best matching a macroblock is called an origin block OB. Typically, an origin block has the same size and shape as the macroblock. To determine the best matching origin block, a difference measure is used to measure the amount of difference between the macroblock and each possible origin block. Typically, a value such as the luminance of each pixel in the macroblock is compared to the luminance of a corresponding pixel in the origin block. The sum of absolute differences (SAD) of all the values (such as luminance) is the difference measure. Other embodiments of the present invention may use other difference measures. For example, one embodiment of the present invention uses the sum of square differences as the difference measures. For clarity, only SAD is described in detail, those skilled in the art can easily adapt other difference measures for use with different embodiments of the present invention. The lower the difference measure the better the match between the origin block and the macroblock.
The motion vector for macroblock MB(x, y) is simply the two-dimensional vector which defines the difference in location of a reference pixel on the origin block with a corresponding reference pixel on the macroblock. For convenience, the reference pixel in the examples contained herein uses the top left pixel of the macroblock and the origin block as the reference pixel. Thus for example, the reference pixel of macroblock MB(0,0) of FIG. 4 is pixel(0,0). Similarly the reference pixel of macroblock MB(X−1, Y−1) assuming an 8×8 reference block is pixel P(8*(X−1), 8*(Y−1)).
FIGS. 6(a)-6(f) illustrate a conventional matching method to find the object block in preceding image 601 for macroblock 610. The method in FIGS. 6(a)-6(f) is to compare each macroblock of a digital image with each pixel block, i.e., a block of pixels of the same size and same shape as the macroblock in the preceding image, to determine the difference measure for each pixel block. The pixel block with the lowest difference measure is determined to be the origin block of the macroblock. As illustrated in FIG. 6(a), a group of pixels 610 with reference pixel RP(0,0), in preceding image 601 is compared to an 8×8 macroblock MB(x, y) to determine a difference measure for the group of pixels 610. Then as illustrated in FIG. 6(b), the group of pixels 620 with reference pixel RP(1,0), is compared to macroblock MB(x, y) to determine a difference measure for pixel block 620. Each pixel block having a reference pixel RP(j,0), where j is an integer from 0 to 19, inclusive, compared to macroblock MB(x, y) to determine a difference measure for the pixel block. Finally as illustrated, in FIG. 6(c) pixel block 630, with reference pixel RP(19,0) is compared with macroblock MB(x, y) to find a difference measure for pixel block 630. In the method of FIGS. 6(a)-6(f) the last pixel block in a row must have at least half the columns of pixels as in macroblock M(x, y). However, in some methods the last pixel block in a row may contain as few as one column. Thus, in these embodiments a pixel block having reference pixel RP(22,0) (not shown) may be used.
As illustrated in FIG. 6(d), after pixel block 630, pixel block 640, with reference pixel (0,1), is compared with macroblock MB(x, y) to determine a difference measure for pixel block 640. Each pixel block to the right of group of pixel 650 is then in turn compared to macroblock MB(x, y) to determine difference measures for each pixel block. Eventually, as illustrated in FIG. 6(e), the group of pixels 660 having reference pixel RP(19,1) is compared with macroblock MB(x, y) to determine difference measures for pixel block 660. This process continues until finally as illustrated in FIG. 6(f), pixel block 690 with reference pixel (19,11) is compared with macroblock MB(x, y) to determine a difference measure for pixel block 690. In some methods, the process continues until a pixel block having only one pixel within preceding image 601, e.g., a pixel block having reference pixel (22, 15) is compared with macroblock MB(x, y). Furthermore some embodiments may start with pixel blocks having reference pixels with negative coordinates. For example, rather than starting with pixel block 610 having reference pixels RP(0, 0), some methods would start with a pixel block having reference pixel RP(−7, −7). Conventional padding techniques can be used to fill the pixel block which require pixels that are outside of preceding image 601. A common padding technique is to use a copy of the closest pixel from preceding image 601 for each pixel outside of preceding image 601.
For large digital images the method illustrated in FIGS. 6(a)-6(f) would require a very large number of calculations to encode a digital image. For example, a 640×480 image comprises 1200 16×16 macroblocks of 256 pixels each. To encode a digital image from a preceding digital image would require comparing each of the 1200 16×16 macroblocks with each of the 298,304 pixel blocks (16×16 blocks) in the preceding image. Each comparison would require calculating 256 absolute differences. Thus, encoding of a digital image requires calculating approximately 91.6 billion absolute differences. For many applications this large number of calculations is unacceptable. For example, real time digital video data may need to be encoded for live broadcasts over a computer network, such as the internet. Since a digital video sequence ideally has 60 frames per second, the method of FIGS. 6(a)-6(f) would require calculating approximately 5.4 trillion absolute differences per second. The computing power required to perform the calculations would be cost prohibitive for most applications. Hence there is a need for a method or structure to perform motion estimation for encoding digital video streams using motion vectors of macroblocks.