The present invention relates generally to video compression, and specifically to fast determination of relative block motion between successive video frames.
Current video compression standards, such as MPEG-1/2/4 and H.261/263, involve estimation of motion of image blocks between consecutive frames in a video sequence being compressed. Most of the images in the sequence are represented by the movements of their blocks relative to earlier frames. During compression, a current block in a current frame is compared to blocks in a preceding frame, to find a block in the preceding frame most closely correlated with the current block. In the compressed frame, the block is indicated by the position of the most closely-correlated block in the preceding frame. The correlation is ordinarily measured by a sum of the absolute differences or squared differences between the values of the respective pixels of the blocks.
Comparison of the blocks, which normally comprise. 16xc3x9716 pixels each, requires large amounts of processing time. In order to reduce the required time, comparisons are performed only with blocks in a neighborhood of the current block, for example those blocks displaced by up to 15 pixels in either or both of the x and y directions.
To further reduce the number of comparisons, various methods have been introduced in which the current block is compared to a selected number of the blocks in the preceding frame. A number of these methods are described in xe2x80x9cImage and Video Compression Standards,xe2x80x9d by V. Bhaskran, and K. Konstantinides, Kluwer Academic Publishers, 1995, which is incorporated herein by reference. One such method is a two-dimensional search logarithmic method in which the current block (x,y) (wherein x,y are the coordinates of a fixed point in the block, for example, of the top left pixel of the block) is compared iteratively to groups of nine blocks in the preceding frame. In a first iteration, the nine blocks of the preceding frame include a center block, having the same coordinates (x,y) as the current block, and the eight surrounding blocks distanced 8 pixels in one or both of the x and y directions from the center block. The block with the best correlation among the nine blocks is used as a new center block in the second iteration. In the second iteration, the current block is compared to the new center block and the eight surrounding blocks distanced by 4 pixels from the center block. This process is repeated for distances of two pixels and one pixel.
However, even with this and other methods, motion estimation is still one of the most CPU-consuming tasks in video compression. Therefore, further saving of processing time is desired.
In order to efficiently perform tasks, such as block comparisons, which involve performing identical commands on large sets of data, processors with SIMD (Single Instruction, Multiple Data) processing units have been introduced. For example, the AltiVec technology of the PowerPC chip described in the AltiVec web pages at http://www.mot.com/SPS/PowerPC/AltiVec/AvecHome.html, which is incorporated herein by reference, allows performing one command on 16 different 1-byte variables concurrently. Use of the AltiVec technology for block comparison may therefore substantially enhance the performance of the block comparison. However, since large amounts of data are involved, the data cannot be held constantly in registers, and the main memory external to the processor must be accessed repeatedly in order to fetch the blocks of the preceding frame.
In order to simplify the hardware of the SIMD processing unit, memory access of the SIMD unit is ordinarily limited to words consisting of multiple bytes, the first of which begins at a memory location which is aligned to the size of the registers of the unit, i.e., at locations whose addresses are integral multiples of 16 bytes. When it is necessary to compare a block not properly aligned, two load commands must be performed to load each row of the block. For each row, the contents of two 16-byte memory locations containing between them the 16-bytes of the block are loaded, and then a permute command is performed to put the desired 16 bytes in a single register. Since in implementing the search logarithmic method the memory accesses require most of the execution time, it would be highly desired to reduce the number of memory accesses required.
It is an object of some aspects of the present invention to provide methods for fast block comparison using SIMD units.
It is another object of some aspects of the present invention to provide methods for block comparison which require a reduced number of memory accesses.
It is another object of some aspects of the present invention to provide methods for reducing the percentage of time spent on memory access in motion estimation.
In preferred embodiments of the present invention, a processor and a plurality of SIMD registers which are preferably included in the processor, are used for motion estimation of a block in a current frame. The block in the current frame is compared to a plurality of background blocks in a preceding frame, and the block in the preceding frame most closely resembling the block in the current frame (hereinafter referred to as the current block) is chosen. The processor is capable of reading into each of the registers a word of 16 bytes beginning at an address in a main memory which is aligned to the size of the registers, i.e., which is an integral multiple of 16 bytes. In order to facilitate efficient loading of the blocks of the preceding frame into the registers, the preceding frame is loaded into the main memory in a predetermined known alignment. Although in the predetermined alignment the beginning address may be aligned with the 16-byte registers, preferably the beginning address is deliberately out of alignment with the 16-byte registers.
When loading into the registers a desired word which is not precisely aligned in the memory or whose alignment is not known, two load commands are required to load two 16-byte aligned words that include all the bytes of the desired word. In addition, a permute command is required to bring the desired word into a single register. The use of the predetermined alignment, in accordance with the principles of the present invention, allows the use of a relatively fast shift command instead of the slower permute command, since the amount of shift required is known from the predetermined alignment. The time required to perform a shift command is about two to three times less than the time required to perform a permute command.
Preferably, an noted hereinabove, the preceding frame is loaded beginning at an address which is not aligned with respect to the SIMD registers. Preferably, the preceding frame is aligned in the memory so that for each block of the current frame on which motion estimation is to be performed, the rows of an area in the preceding frame that includes the blocks required for comparison in a first iteration of a logarithmic method used in the motion estimation are loaded directly into respective SIMD registers in a minimal number of steps of the processor. Preferably, the preceding frame is aligned on a half-word boundary, i.e., on an 8-byte boundary which is not a 16-byte boundary. By misaligning the preceding frame when it is loaded into the main memory, the method of the present invention enables motion estimation to be carried out faster and more efficiently.
Preferably, the current frame is loaded into the main memory in alignment with the registers. Preferably, the current frame is aligned so that its first row and column fall on full 16-byte boundaries. Thus, the preceding and current frames are misaligned with respect to each other in the main memory, with an 8-byte gap therebetween.
In some preferred embodiments of the present invention, during motion estimation the blocks of the current frame are loaded sequentially into the registers, and each block is kept in the registers for its entire processing term. Preferably, in the first iteration of the logarithmic method, each of the blocks of the current frame is compared to nine neighboring blocks of the previous frame, in three successive stages. In each stage the rows of the current block are compared in parallel to the rows of three blocks located on common rows of the previous frame. The three blocks are partially overlapping so that they include all together 32 pixels in each of the rows, which are represented by 32 bytes. The 32 bytes are properly aligned in the memory so that they can be loaded into two SIMD registers in only two load commands. Preferably, for each of the 16 rows of the three blocks, two load commands are performed to load the data values of the pixels in the row into the registers. The 16 bytes in each of the registers belong respectively to the blocks 8-pixels to the left and to the right of the current block. A shift command is used to move the middle 16 of the 32 bytes into a third SIMD register, and these 16 bytes are compared to the respective row of the current block.
There is therefore provided in accordance with a preferred embodiment of the present invention, a method of comparing a current block in a current frame to a plurality of background blocks in a preceding frame, each block including a matrix of data values arranged in a given number of columns, using a processor which has a plurality of computational registers, each capable of receiving a number of the data values at least equal to the given number by loading the data values from a memory beginning at an address in the memory evenly divisible by the given number, including storing the preceding frame in the memory beginning at an address that is divisible by the given number with a predetermined remainder, loading at least some of the data values of the current block into one or more of the plurality of registers, loading at least some of the data values of one or more of the background blocks into another one or more of the plurality of registers, and comparing the background blocks to the current block using the registers.
Preferably, the method includes storing the current frame in the main memory beginning at an address divisible by the given number.
Preferably, loading the current block to the registers includes loading the current block only once for each block.
Preferably, loading the data values of the background blocks into the registers includes loading respective first and second rows of first and second ones of the background blocks and shifting the first and second rows to obtain in one of the registers a row of a third block overlapping the first and second blocks.
Preferably, the given number is sixteen.
Preferably, comparing the blocks includes comparing in accordance with a logarithmic iterative algorithm.
Further preferably, comparing the blocks includes comparing the current block to a plurality of partially overlapping background blocks in the preceding frame.
Alternatively or additionally, comparing the blocks includes loading and comparing all the data values of the background blocks situated on a common row of the preceding frame, in each iteration, before loading data values from another row.
Preferably, the data values of the preceding frame are loaded into the registers at most once in each iteration.
Preferably, comparing the blocks includes estimating motion of the current block relative to the preceding frame based on the comparison.
Preferably, the predetermined remainder is not equal to zero.
Further preferably, the predetermined remainder is equal to half the given number.
There is further provided in accordance with a preferred embodiment of the present invention, apparatus for comparing a current block in a current frame to a plurality of background blocks in a preceding frame, each block including a matrix of data values arranged in a given number of columns, including a main memory, in which the data values of the preceding frame are stored beginning at an address that is not evenly divisible by the given number, and a processor, including a plurality of computational registers, each capable of receiving the given number of data values, by loading the data values from the main memory beginning at an address evenly divisible by the given number.
Preferably, data values describing the current frame are stored in the main memory beginning at an address evenly divisible by the given number.
Preferably, the processor includes an AltiVec PowerPC.
Preferably, the processor loads the current block to the registers only once during a comparison process.
Preferably, the processor loads respective first and second rows of first and second ones of the background blocks and shifts the first and second rows to obtain in one of the registers a row of a third block overlapping the first and second blocks.
Preferably, the processor compares the background blocks to the current block by loading and comparing all the data values of the background blocks situated on a common row of the preceding frame, in each of a plurality of iterations, before loading data values from another row.
Preferably, the data values of the preceding frame are loaded into the registers at most once in each of a plurality of iterations in an iterative comparison method.
There is further provided in accordance with a preferred embodiment of the present invention, a digital video processing system, including apparatus for compressing video frames, the compressing including comparing a current block in a current frame to a plurality of background blocks in a preceding frame, each block including a matrix of data values arranged in a given number of columns, including a main memory, in which the data values of the preceding frame are stored beginning at an address that is not evenly divisible by the given number, and a processor, having a plurality of computational registers, each capable of receiving the given number of data values, by loading the data values from the main memory beginning at an address evenly divisible by the given number, and a network interface for transmission of the compressed video frames.
Preferably, the system includes a video on demand system.
There is further provided in accordance with a preferred embodiment of the present invention, a digital video camera, including a camera for acquiring a plurality of video frames, and apparatus for compressing at least some of the plurality of video frames, the compressing including comparing a current block in a current frame to a plurality of background blocks in a preceding frame, each block including a matrix of data values arranged in a given number of columns, including a main memory, in which the data values of the preceding frame are stored beginning at an address that is not evenly divisible by the given number; and a processor, having a plurality of computational registers, each capable of receiving the given number of data values, by loading the data values from the main memory beginning at an address evenly divisible by the given number.