Current advances in display manufacturing will increase usage of high quality video like high definition (HD) and Ultra HD (4K MD and 8K UHD). Video stream delivery for these users demands new and improved video coding technology. The latest hybrid video compression standard, H.265/HEVC, was developed by the Joint Collaborative Team on Video Coding (JCT-VC) established by ISO/IEC MPEG and ITU-T VCEG. Although HEVC doubles the compression ratio of popular H.264/AVC standard at the same video quality, the computational complexity is considerably higher. Most of its coding complexity is due to rate-constrained motion estimation (RCME).
HEVC is based on hybrid architecture as its predecessor standard, H.264/AVC, however, numerous improvements have been made in the frame splitting, inter and intra prediction modes, transformation, in-loop filtering and entropy coding of the new design. The coding improvement of HEVC is obtained at the expense of higher computational complexity in the encoder structure. This means that coding a video sequence for real-time applications needs more powerful hardware. In addition, a single processor with existing technology is not able to deliver such computation demand. However, during the last few years, highly parallel processing devices such as graphics processing units (GPUs) or many-core central processing units (CPUs) have been developed and utilized to accelerate such complex tasks.
High-level parallelization tools in HEVC, like wavefront parallel processing (WPP) and tiles, allow processing several Coding Tree Units (CTUs) in parallel. For example, the maximum number of concurrent processes is equal to the number of CTU rows when WPP is used to encode one frame. This number increases significantly when a variant of WPP, called overlapped wavefront (OWF), is used to encode several frames simultaneously. At the cost of a lower coding efficiency, the degree of parallelism can be increased by using tiles or slices in addition to WPP/OWF. Hence, the parallel encoding of CTUs is usually sufficient to maintain a multi-core CPU fully occupied most of the time, especially for high resolutions. However, it cannot provide enough parallelization for a many-core CPU or a heterogeneous architecture having CPU and GPU.
In order to increase the degree of parallelization, prior art methods process in parallel RCME on several PUs. The main challenge of these methods is to determine the best motion vector (MV) for a PU without knowing its motion vector predictors (MVPs). Most of prior art methods estimate these MVPs by using MVs from already encoded CTUs by estimating the MVPs from neighboring CTUs using spatial information.
One prior art method performs parallel motion estimation (ME) on heterogeneous architectures for a whole frame. This method calculates motion vectors of entire frame blocks in parallel. However, the MVP is ignored resulting in poor rate-distortion (RD) performance. Another prior art method utilizes the collocated motion vectors of the previous frame and extrapolates the motion vectors into the encoding frame. Although these methods achieve fine-grained parallelism suitable for a GPU, the prediction of MVPs can induce extra overhead for the CPU without significantly improving the RD performance.
An additional prior art method is directed toward a parallel implementation of RCME which uses the GPU as pre-processor by calculating the sum of absolute differences (SADs) for the whole search region, and transferred the results back to the CPU. This prior art method achieves better RD performance because it preserves MVP dependencies. However, due to the high bandwidth usage for transferring an excessive amount of data, the time reduction is smaller than other methods.
The above-mentioned prior art methods have deficiencies. They either transfer the distortion values (SADs) for the whole search region back to the CPU requires very high bandwidth leading to a reduced speedup, or their attempt to predict the MVPs is regularly not accurate causing a negative impact on RD performance.
There is needed method that provides a high degree of parallelization well-suited for massively parallel architectures, while significantly improving RD performance, with similar time reduction.