Codecs for compressing frames of video may use a graphics processing unit (GPU) to accelerate the task. Most codecs perform the compression task by generating a bit stream consisting of several image macro-blocks, where each macro-block may be encoded with a different number of bits. Typically, in order to decode a given macro-block, the decoder must first process its predecessor (with exception of cases where macro-blocks are located on separate streams). For parallel decoding, this constraint raises a problem since ideally, a large number of macro-blocks should be processed in parallel.
For most known codecs that support YUV420 input pixel format, the luma and the chroma channels (or “planes”) are divided into squares of pixels that are subsequently transformed into the 2D frequency domain. Typically, the square pixels group differ in size depending on the channel type. For instance, where the Y plane is divided into 8×8 samples, the U and V planes would be divided into 4×4 samples.