1. Field of Art
The disclosure generally relates to video processing, and more particularly, to optimize memory access of motion compensation and motion estimation within a video processing system.
2. Description of the Related Art
Motion compensation is often an important consideration for high video compression performance within a video processing system. For example, many existing video coding standards use a block-based hybrid motion compensated and transform video coding method. In a hybrid motion compensated and transform video coder, inter-picture motion compensated prediction reduces temporal redundancies between successive video pictures (or frames). Each block within a current picture, which is a B-type or P-type picture, is normally predicted by the encoder of the motion-compensated video coder from a previous picture(s) which can be an I-type, a P-type, or a B-type picture. Motion compensated prediction also reduces spatial redundancies within a picture. For example, H.264 video coding standard uses intra-picture motion-compensated prediction to reduce spatial redundancies within an I-type picture itself. Each current block of the I-type picture is predicted by reference block(s) found within the same picture.
A motion vector, MV (x, y), obtained by motion compensated prediction shows the spatial difference between a current block in the current picture and a reference block in the reference picture(s). A motion vector is a translation vector, indicating motion of a reference block in the reference picture(s) aligning with the predicted block. The prediction errors from motion compensated prediction indicate the content difference between the current and reference block. With motion vector and prediction errors being decoded at the decoder of the video coder, the decoder performs the motion compensation to reconstruct the current block. Specifically, the decoder copies the content of the best matched block from the reference picture indicated by the motion vector of the current block and adds the prediction errors to the best matched block to reconstruct the current block. As such, motion-compensated prediction can greatly improve video compression compared to the coding without such processing.
To copy the content of the reference block for motion compensation of a single block, a decoder needs to fetch the content from a computer memory, such as DRAM. Recently emerging video coding standards require support for variable block-size motion compensation with small block sizes, and their implementation requires a heavier use of memory. For example, the H.264 video coding standard supports more flexibility in the selection of motion compensation block sizes and shapes than any previous standards, with a luma motion compensation block size as small as 4×4 pixels. Compared with 4 motion vectors per macroblock of 16×16 pixels required by earlier standards, the H.264 standard supports as many as 16 independent motion vectors for the same 16×16 pixels macroblock. This higher motion vector flexibility results in a larger number of memory fetches where each fetch comprises fewer reference pixels. In the case of H.264, each row fetched from memory may consist of as few as 4 pixels useful for motion compensation.
Complicating the memory requirement for motion compensation due to smaller and variable block sizes, memory read location of the reference block for a block being motion compensated often does not align perfectly with the block size of the block being motion compensated. For example, a 4×4 pixels reference block may sit between two neighboring memory read units, each of which is 8×64 pixels in size. For horizontal direction alone, a non-aligned memory reads fetch pixels unused next to the reference block on both left and right sides along memory grid. Such complication translates to more motion compensation related memory bandwidth waste.
FIG. 3A is a block diagram first illustrating memory bandwidth waste due to unaligned memory fetch related to motion compensation of a single block. FIG. 3A includes a memory consisting of a plurality of memory units. Each memory unit is defined by its two-dimensional coordinates, i.e., xi and yi. In one embodiment, the xi is in unit of 8 pixels and yi in 64 pixels. In this case, a memory fetch unit is 8×64 pixels. Block 302 is a reference block identified by the decoded motion vector and reference information of a current block being motion compensated. The location of the reference block does not always align perfectly with memory unit configuration in a memory. Thus, a non-aligned memory read fetches unused pixels on both sides of the reference block. Taking block 302 in FIG. 3A as an example, the memory read of block 302, which locates between the memory unit (x2, y3) and memory unit (x3, y3), needs to fetch both memory unit (x2, y3) and the memory unit (x3, y3) to reconstruct the block being motion compensated. However, due to the non-alignment described above, the pixels (i.e., 302L) between the left boundary of memory unit (x2, y3) and the block 302 are unused, thus wasted. Similarly, the pixels (i.e., 302R) between the right boundary of the memory unit (x3, y3) and the block 302 are wasted. Thus, just horizontally alone, the memory fetch of the reference block 302 wastes (302L+302R) pixels due to non-alignment memory read.
Another example of memory bandwidth waste related to the memory fetch of a motion vector reference block is from overlapping of pixels between multiple reference blocks. Very often multiple motion vectors for multiple blocks to be motion compensated may point to the same memory location for the reference blocks. Memory bandwidth waste related to the memory fetch of motion vector reference block may arise from the overlapping of pixels between multiple reference blocks.
FIG. 2 illustrates a simplified motion compensation of multiple neighboring blocks which have same or similar motions. For example, in FIG. 2, a moving football in the current picture 200 is located in two neighboring blocks, block 222 and its right neighboring block 224. The moving football is a rigid moving object whose motion spreads over multiple blocks. Therefore, the motion prediction process at the encoder side of a video coder finds that, within the search range 250, the corresponding motion vectors 230 and 240 for the blocks 222 and 224 are same with each other in terms of amount of motion and direction of motion. Conventional memory fetch of motion compensation for blocks 222 and 224 requires two separate memory fetches: one for reference block 222R identified by motion vector 230 and one for reference block 224R identified by motion vector 240. However, one memory fetch for reference blocks 222R and 224R may be saved because blocks 222R and 224R have same motion information needed for the motion compensation of blocks 222 and 224, and can be fetched together with one memory fetch.
Referring back to FIG. 3A, FIG. 3A also illustrates the memory bandwidth waste due to overlapping of multiple reference blocks in motion compensation. In top right corner of FIG. 3A, the decoded motion vectors for two blocks being motion compensated identifies their corresponding reference blocks 302 and 304 in the memory. Two reference blocks 302 and 304 have some overlapping pixels between the two reference blocks, e.g., pixels in an overlapping block 306, due to similar motions. To fetch reference block 302 for its corresponding block being motion compensated will fetch the overlapping pixels 306 once. A separate memory fetch of reference block 304 will fetch the overlapping pixels 306 twice. As such, the overlapping pixels 306 are unnecessarily fetched twice for motion compensation.
Additional memory bandwidth waste related to motion compensation comes from more accurate motion compensation requirements in recently emerging coding standards. For example, MPEG-2 standard supports half-pixel motion vector accuracy, while H.264 supports quarter-pixel-accurate motion compensation, which allows a motion vector points to a reference location between pixels in quarter pixel granularity. In such cases, e.g., half-pixel or quarter-pixel granularity, neighboring pixels can be interpolated by variable tap sub-pixel filter, such as widely used 6-tap sub-pixel filter, to form prediction pixels. However, when a sub-pixel filter is used for more accurate motion compensation, a larger block needs to be fetched for a reference block. For example, using a 6-tap sub-pixel filter for a 16×16 pixels macroblock, a block of size 21×21 pixels needs to be fetched for the motion compensation. For a memory unit often having a size of 2n (where n is an positive integer number), a 21×21 pixels memory read translates to a memory fetch of at least 32×32 pixels memory content, thus, resulting in fetching 768 bytes of data, instead of the 441 bytes of data needed. The memory bandwidth waste gets worse for motion compensating a 4×4 block because a 9×9 reference block must be fetched from memory, requiring a 16×12 fetch of 192 bytes of data instead of 81 bytes of data needed.
FIG. 3B (top center) first illustrates the memory bandwidth waste due to sub-pixel filtering support requirement. Two reference blocks 308 and 310 in the memory are to be fetched separately for their corresponding blocks to be motion compensated. Due to the sub-pixel accuracy motion compensation requirement, a larger block for each reference block, i.e., 308F for 308 and 310F for 310 reference block, needs to be fetched from the memory. The larger blocks, e.g., 308F and 310F, are referred to as sub-pixel accuracy motion compensation support block from herein in the specification. The size of a sub-pixel accuracy motion compensation support block is determined by the type of sub-pixel interpolation filter being used. The overlapping block 312 between the two larger blocks, 308F and 310F, represents the pixels that are unnecessarily fetched twice from the memory for the motion compensation process. For example, assuming the reference blocks 308 and 310 each is a 16×16 pixels macroblock, using a 6-tap sub-pixel filter for blocks 308 and 310 needs to fetch 308F and 310F each of size 21×21 pixels for the motion compensation. The overlapping 312 is at least 5 pixels in horizontal direction, which are fetched twice unnecessarily.
Motion vector refinement often occurs after a video transcoder finishes the decoding of the motion vector information, and prepares for encoding the decoded video stream into the required destination video format. Often the video transcoder needs to refine a decoded motion vector by searching the neighboring pixels of the reference block identified by the motion vector. As such, for two neighboring reference blocks, the overlapping between the motion vector refinement blocks represents the memory bandwidth waste due to motion vector refinement support.
FIG. 3B (bottom center) also illustrates the memory bandwidth waste due to motion vector refinement support described above. In FIG. 3B, two reference blocks 308 and 310 need motion vector refinement process. The dotted region 320 around the two reference blocks represents the block for sub-pixel filtering support and the solid area 330 represents the block for motion vector refinement support. The solid region 330 is referred to as motion vector refinement support block from herein in the specification. The size of a motion vector refinement support block is determined by a configurable threshold, which is a design choice of implementation. The overlapping block 340 between two motion vector refinement support blocks represents the pixels would be wasted from separate memory fetch of the motion vector refinement support blocks.
Motion estimation involves searching a region within a reference picture for a close match of the current block in a current picture. Referring to FIG. 5A, to estimate the motion of the block 601, the complete search range and the region of support, represented by 601S, need to be fetched from the memory. The fetched block 601S may overlap with another block's search range and region of support as shown in FIG. 5B, represented by the shaded region 603. In the FIG. 5B, 602 represents the other block and 602S represents the search range and region of support of 602. Conventionally, the regions 601S and 602S would have been fetched separately, resulting in fetching the overlap region 603 twice. This results in memory bandwidth degradation, or it requires the memory to have much higher bandwidth requirement. Such overlapping memory fetches are unnecessary and can be done by fetching the region 601S first and then fetching region 602S minus the overlap region of 603. As an example, in the case of searching for a block of 4×4, a search range and the region of support of 18×18 are required. For a cluster of 4×4 in a 16×16 block, the total search range and region of support for all the 4×4s put together will result in fetching 5184 bytes. However if an intelligent memory fetch is carried out, the total bytes that is required to be fetched from the memory comes out to be 900 bytes, which is significantly smaller than 5184 bytes. This improves the overall memory efficiency as the bandwidth requirements drops significantly.
The combination of smaller and variable reference block sizes, non-aligned memory read, overlapping reference blocks and motion compensation with sub-pixel accuracy results in a large amount of memory bandwidth waste related to motion compensation. In case of transcoding, further motion compensation related memory waste may arise when an encoder needs to refine a decoded motion vector or fetch overlapping search range and the region of support. Thus, there is lacking, inter alia, a system and method for optimized memory access of motion compensation in a video processing system.