This invention relates to digital-video compression, and more particularly to motion estimation for video compression.
Digital video is the format commonly used with personal computers, digital-video cameras, and other electronic systems. Since a huge amount of memory or storage space is required to fully store all 30 or more frames per second of video, the images are usually compressed. Often sequential images in the video sequence differ only slightly. The difference from a previous (or following) image in the sequence can be detected and encoded, rather than the entire picture. Such lossy compression techniques are widely used, such as MPEG encoding.
During compression or encoding, each frame or image is divided into a grid of macroblocks. Each macroblock contains 16xc3x9716 pixels. A macroblock from a current frame or picture is compared to a range of macroblocks in a previous picture in the video sequence. Often a match or near-match is found at a different location. The difference in locations is known as a motion vector, since it indicates the movement of the macroblock between the two pictures. The motion vector rather than the entire macroblock can then be encoded for the new picture, saving storage space.
FIG. 1 illustrates motion estimation for a pair of digital-video pictures. A current picture 10 is compared to an old picture 12 in a video sequence. Old picture 10 could occur either before or after current picture 10 in the sequence when backward and forward estimation are used.
Macroblock 16 in current picture 10 is selected and compare to all macroblocks within search range 14 in old picture 12. A match or near-match is found with macroblock 16xe2x80x2. because of movement of subjects in the pictures, macroblock 16xe2x80x2 from old picture 12 has moved to a new location in current picture 10. The difference in locations of macroblocks 16, 16xe2x80x2 is indicated by motion vector 18.
Rather than store all 16xc3x9716 pixels of macroblock 16 in the encoded video stream, only motion vector 18 and an identifier for macroblock 16xe2x80x2 need to be included. This reduces or compresses the size of the video stream.
Although the video stream is compressed, large number of calculations are needed for motion estimation. The various macroblocks within a search range are usually evaluated by a sum of the absolute difference (SAD) method. The macroblock 16xe2x80x2 with the smallest SAD is the closest match to the macroblock 16 being searched.
For a search range of +/xe2x88x92127 and +/xe2x88x9263 pixels, a total of 32K macroblocks are evaluated, requiring 32K SAD operations. Each SAD operation requires 256 (16xc3x9716) subtractions, 256 absolute-value operations, and 255 2-input additions, a total of 767 arithmetic operations. A full search for one macroblock thus requires 32Kxc3x97767 or about 24M calculations.
A 720xc3x97480-pixel picture has 1350 macroblocks, each of which may move independently and thus must be motion-estimated. So a total of 1350xc3x9724M or 32 G calculations are needed per picture. For a video having 30 frames per second, about 1 trillion operations per second are needed (1 T ops/sec). Thus full motion estimation requires large computing resources.
Hierarchical Motion Estimationxe2x80x94FIG. 2
Computing requirements can be reduced by using a pyramid or hierarchical motion-estimation search. Pixels are averaged together to reduce the number of pixels in the picture, so that smaller search ranges and smaller macroblocks are used. This reduces the number of calculations.
FIG. 2 shows pyramid motion estimation. The term xe2x80x9cpyramidxe2x80x9d is used since successively smaller pictures are used for motion estimation searches. These smaller pictures are at higher levels of the xe2x80x9cpyramidxe2x80x9d. For example, picture 22 represents the full-size picture of 720xc3x97480 pixels. The next level (level-2) of the pyramid is a reduced-size picture having only 360xc3x97240 pixels, about xc2xc the size of full picture 22. Level-2 picture 24 is generated from full-size picture 22 by averaging each 2xc3x972 square of 4 pixels into a single pixel. The top of the pyramid is level-3 picture 26, which is created by 2xc3x972 averaging of level-2 picture 24. Level-3 picture 26 has 180xc3x97120 pixels, only {fraction (1/16)}th of full-size picture 22.
The macroblock size also becomes smaller with each higher level of the pyramid. For example, the 16xc3x9716 macroblock 20 of full picture 22 is reduced to an 8xc3x978 macroblock in level-224, and reduced further to a 4xc3x974 macroblock in level-326. The image 20xe2x80x2 in the selected macroblock from the old picture also becomes smaller with each higher level.
The search ranges are also reduced from +/xe2x88x92127, 63 to +/xe2x88x9263, 31 in level-2, and to +/xe2x88x9231, 15 in level-3. The smaller search ranges and the smaller macroblocks in higher levels require fewer arithmetic operations during a search within a higher level picture. For example, a search of the +/xe2x88x9231, 15 range of level-3 requires comparison of a 4xc3x974 macroblock. Each SAD operation of a 4xc3x974 macroblock requires 16 subtractions, 16 absolute-value operations, and 15 2-input adds, a total of only 47 operations (rather than 767). Only 63xc3x9731 (1953) SAD operations are required for the reduced search range at level-3. Thus a total of about 91K operations are needed for the level-3 search.
Multi-Level Searchxe2x80x94FIG. 3
FIG. 3 shows multiple levels of motion-estimation search. A full-resolution picture 22 of a current frame is compared to a full-resolution picture 22xe2x80x2 of a prior (old) frame using multi-level searching.
The full picture 22 is reduced by 2xc3x972 pixel averaging by reducer 32 to produce level-2 picture 24, which is one-quarter the size of full picture 22. This level-2 picture 24 is again reduced by 2xc3x972 pixel averaging by reducer 33 to produce level-3 picture 26. Level-3 picture 26 is one-sixteenth the size of full picture 22.
Similar pixel-averaging operations occurred when the old picture was being processed, and the full, quarter, and sixteenth-size pictures 22xe2x80x2, 24xe2x80x2, 26xe2x80x2 were saved.
First, a coarse motion estimation search is performed at the top level by motion estimator 38. Motion estimator 38 selects a 4xc3x974 macroblock in current level-3 picture 26 and compares it to all 4xc3x974 pixel groupings in the search range of level-3 old picture 26xe2x80x2. The best four matches are sent to the next lower level, to motion estimator 36. The best four matches rather than the single-best match are sent to improve accuracy, allowing for averaging distortions. Rather than 4, the best n matches, where n is typically between 2 and 4, can be sent to the next lower level.
Motion estimator 36 then compares the selected macroblock for level-2 pictures 24, 24xe2x80x2. Rather than search over the entire search range, only the four best-match macroblocks for the level-3 search and their nearest neighbors are compared. Thus a search range of only 9 macroblocks for each of the 4 best-fit vectors from level-3 are compared. A total of 9xc3x974 or 36 macroblocks are compared by level-2 motion estimator 36. Less than 7K operations are required by level-2 motion estimator 36 per selected macroblock.
Finally, the best 4 motion vectors from level-2 motion estimator 36 are sent to level-1 motion estimator 34. Motion estimator 34 then compares each of the four best-match macroblocks and their 8 surrounding neighbors, or 9xc3x974 macroblocks. These are 16xc3x9716 macroblocks, so a total of 27K operations are required by level-1 motion estimator 34. The motion vector for the best-fit macroblock is then output as the motion vector for that selected macroblock. Then the motion estimation can continue for other selected macroblocks in the current picture until all macroblocks have been processed.
The total number of operations is 91K for level-3, 7K for level-2, and 28K for level-1, or 126K operations. This is a 99.5% reduction over the full search method.
Each of the operations is an 8-bit operation, since the pixels are stored as the 8-bit luminance Y values of a YUV pixel. The U and V chromatic values can be ignored for motion estimation, so that the motion estimation is essentially performed on a simplified mono-color picture.
One variation is to convert each 8-bit pixel (Y) value to a 1-bit value before motion estimation in the lower levels. This further reduces computational requirements, since 1-bit (Boolean) logical operations can be used rather than 8-bit arithmetic operations. The top level (level-3) remains at 8 bits, so that the initial search is still accurate. See Song, Zhang, and Chiang, xe2x80x9cHierarchical motion estimation using binary pyramid with 3-scale tilingsxe2x80x9d, SIPE Vol. 3309, 1997, pp. 80-87.
While such hierarchical motion estimation schemes are useful at reducing computational complexity, significant storage space is needed to store the upper level pictures, even though these are reduced in size. The variation using single-bit pixel values is too rough since so much of the pixel data is discarded.
What is desired is a motion estimation method that reduces storage requirements for reduced-resolution pictures. A hierarchical motion estimator is desired that operates on reduced-width pixels. It is desired to vary the number of bits per pixel for the different levels of the pyramid. A flexible motion estimator is desired that operates on picture levels with reduced-width pixels.
A motion estimator for compressing digital-video images has a memory for storing images containing digital pixels. A pixel averager receives a 2xc3x972 group of pixels. It outputs one pixel as an average of four pixels. A width reducer receives a full-width pixel. It outputs a reduced-width pixel having fewer digital bits than the full-width pixel.
The memory temporarily stores a first image input to the motion estimator, a reduced-width level-1 image, generated by the width reducer from the first image. The reduced-width level-1 image contains reduced-width pixels having fewer bits per pixel than full-width pixels in the first image. The first image is deleted from the memory once the reduced-width level-1 image is generated.
The memory also temporarily stores a level-2 image generated by the pixel averager. The level-2 image has one-quarter of a number of pixels of the first image. A reduced-width level-2 image is generated by the width reducer from the level-2 image. The reduced-width level-2 image contains reduced-width pixels. The level-2 image is deleted from the memory once the reduced-width level-2 image is generated.
The memory also temporarily stores a level-3 image generated by the pixel averager. The level-3 image has one-quarter of a number of pixels of the level-2 image. A reduced-width level-3 image is generated by the width reducer from the level-3 image. The reduced-width level-3 image contains reduced-width pixels. The level-3 image is deleted from the memory once the reduced-width level-3 image is generated.
The memory also temporarily stores a level-4 image generated by the pixel averager. The level-4 image has one-quarter of a number of pixels of the level-2 image. A reduced-width level-4 image is generated by the width reducer from the level-4 image. The reduced-width level-4 image contains reduced-width pixels. The level-4 image is deleted from the memory once the reduced-width level-4 image is generated.
A calculator receives the reduced-width level-4 image and an old reduced-width level-4 image. It finds a matching block of reduced-width pixels that most-closely matches a selected block of pixels in the old reduced-width level-4 image. The calculator generates a level-4 motion vector identifying the matching block. The calculator also receives the reduced-width level-3 image and an old reduced-width level-3 image. It finds a matching block of reduced-width pixels within a search range determined by the level-4 motion vector. The matching block is a block within the search range that most-closely matches a selected block of pixels in the old reduced-width level-3 image. The calculator generates a level-3 motion vector identifying the matching block.
The calculator also receives the reduced-width level-2 image and an old reduced-width level-2 image. It finds a matching block of reduced-width pixels within a search range determined by the level-3 motion vector. The matching block is a block within the search range that most-closely matches a selected block of pixels in the old reduced-width level-2 image. The calculator generates a level-3 motion vector identifying the matching block.
The calculator also receives the reduced-width level-1 image and an old reduced-width level-1 image. It finds a matching block of reduced-width pixels within a search range determined by the level-2 motion vector. The matching block is a block within the search range that most-closely matches a selected block of pixels in the old reduced-width level-1 image. The calculator generates a level-I motion vector identifying the matching block.
The level-1 motion vector is output to an encoded video stream as a substitute for the selected block. Thus reduced-width pixels are stored for motion estimation.
In further aspects the calculator determines a sum-of-absolute difference (SAD) of the selected block of pixels and one of several target blocks of pixels in different images. The calculator generates the motion vector from a target block having a minimum SAD. The reduced-width pixels in the reduced-width level-1 and level-4 images have at least 2 fewer bits than the reduced-width pixels in the reduced-width level-1 and level-4 images. Thus wider pixels are used in the top and bottom levels.
In further aspects the reduced-width pixels in the reduced-width level-1 and level-4 images have 2 fewer bits than the full width pixels. The reduced-width pixels in the reduced-width level-2 and level-3 images have 4 fewer bits than the full width pixels.
In still further aspects the blocks are macroblocks have 16 by 16 pixels at level-1, but only 8 by 8 pixels at level-2, 4 by 4 pixels at level-3, and 2 by 2 pixels at level-4. The calculator generates at least four motion vectors for level-4, but only one motion vector for level-1. The calculator searches four search ranges in level-3 determined by four motion vectors from level-4. Thus multiple search ranges are searched in a level.