The continuous increase in the speed of available networks and the use of effective wireless technology is making it possible to provide rich content including both audio and video to a different variety of end devices that include mobile terminals, cell phones, computers and other electronic devices. In order to maintain its competitive advantage, a service provider needs to maintain video quality and reliability as well as be able to process high volume of traffic in a timely manner. Video encoding, an important step in the process, is notoriously CPU time consuming. Effective techniques for high speed encoding of video is thus necessary for meeting the objectives of the service provider.
Video compression is an important step in the encoding of video images. H.264 is the most recent and efficient standard for video compression described in “Advanced Video Coding (AVC) for generic audiovisual services”, ed, March 2009, and ITU-T Rec, H.264. The new tools introduced by this standard improve on the compression ratio of its predecessors by a factor of two, but at the cost of higher complexity. H.264 is also one of most commonly used formats in processing high definition (HD) videos. However, this standard is so complex that some video encoders face difficulties encoding HD video sequences in real-time, even on current state of the art processors.
To reduce the encoding time, most H.264 video encoders developed for general processors exploit parallel approaches. For example, as part of Intel. 2010, Intel® Integrated Performance Primitives 6.1—code Samples available from http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-code-samples/, Intel's IPP H.264 video encoder is based on a multi-slice parallel approach, where the slices of a frame are encoded in parallel. The multi-core era does offer the promise of a great future for well-designed parallel approaches. A parallel approach must achieve speedups close to the number of cores used, support a variable number of cores with high scalability, reduce processing latency, preserve visual quality video without giving rise to significant degradation in the quality of the encoded video, and should not change or force encoding parameters.
Existing prior art addressing the parallelization of H.264 encoders is briefly described below. An H.264 encoder can be parallelized into multiple threads by using either functional decomposition or data domain decomposition. Functional decomposition divides an encoder into multiple tasks (motion estimation, transformation, entropy coding, etc.). Tasks are grouped in balanced processing groups, and each group is associated with a thread. As discussed by S. Chien, et al., in “Hardware architecture design of video compression for multimedia communication systems”, IEEE Communications Magazine, vol. 43, p. 123, 2005, this type of decomposition can deliver good speedup, but rarely allows good scalability and flexibility to change.
Data domain decomposition exploits the H.264 hierarchical structure to parallelize an encoder. This hierarchical structure has six levels: sequence, groups of pictures (GOPs), pictures, slices, macroblocks (MBs) and blocks. Each GOP starts with an instantaneous decoder refresh (IDR) frame. Frames following an IDR frame may not refer to frames preceding the IDR frame, meaning that GOPs are independent of each other, and can be encoded in parallel. Each frame belongs to one of the following types: intra (I) frames, predicted (P) frames or bidirectional (B) frames. I frames are coded without reference to any other frame, while P frames are coded with reference to a past frame, and B frames are commonly coded with reference to a past and a future frame. Although a B frame may be used as reference, this is rarely the case in practice. Therefore, it is safe to assume that no frame depends on a B frame. A slice represents a spatial region of a coded frame. A slice represents all or part of a coded video frame and includes a number of coded macroblocks (MB), each containing compressed data corresponding to a 16×16 block of displayed pixels in a video frame. In general, the macroblocks are situated at predetermined positions in the frame. On many systems each slice comprises one or more MB lines, wherein each MB line comprises consecutive MBs in the entire row of a frame. Similar to frames, there are three types of slices: I slices, P slices and B Slices. Slices belonging to the same frame are also independent of one another, and can therefore be encoded in parallel. Slices comprise macroblocks. A macroblock covers an area of 16×16 luminance pixels (top-left, top and top-right) and the one on its left (left neighbor). Processing of a P or B macroblock is dependent on its top neighbor, top-right neighbor and left neighbor.
Changying et al. describe a parallel algorithm at the GOP level for the H.264 in “The Research of H.264/AVC Video Encoding Parallel Algorithm”, Intelligent Information Technology Application, 2008. IITA '08. Second International Symposium, 2008, pp. 201-205. The authors use a Master-Worker model and a dynamic scheduling algorithm to distribute the next GOP unit to process to an unused node, and obtain good scalability and speedup. However, this pure GOP-level parallel approach exhibits high latency, and is therefore not applicable to real-time encoding. Rodriguez et al., propose an approach based on GOP and slice level parallelism in Hierarchical Parallelization of an H.264/AVC Video Encoder, Parallel Computing in Electrical Engineering, 2006, PAR ELEC 2006, International Symposium, 2006, pp. 363-368. In this approach, the system is composed of homogeneous clusters of workstations. Each cluster receives a GOP unit to process dynamically. A master workstation divides each frame into slices and distributes their processing among workstations. While this approach provides a good trade-off between speedup and latency, this type of encoding can however only be used when the system has access to all the frames of the video sequence to compress, making it unacceptable for real-time telecommunications.
Chen et al. present a parallel approach at the macroblock level in “Implementation of H.264 encoder and decoder on personal computers”, Journal of Visual Communication and Image Representation, vol. 17, pp. 509-532, 2006. This algorithm exploits dependencies between macroblocks in a wave front manner. This kind of approach produce's a good, but not excellent, speedup. Ge et al., describe a more efficient approach based on a frame-level and slice-level parallelism in “Efficient multithreading Implementation of H.264 encoder on Intel hyper-threading architectures”, Information, Communications and Signal Processing and the Fourth Pacific Rim Conference on Multimedia, Proceedings of the 2003, Vol. 1, pp. 469-473. This approach uses five threads on a system with four processors. One thread is used to load the input file (or stream), to save the output file (or stream) and to fill two lists of slices: a priority list of I and P slices and a secondary list of B slices. The other four threads are used to encode slices in parallel. Each thread retrieves and processes the next free slice in the priority list. If this list is empty, the thread retrieves the next free slice in the secondary list. This approach provides excellent speedup, but forces the use of B frames, which is not always possible (for example, the H.264 baseline profile does not allow B frames) and also increases latency. Low latency is required however, in real-time video applications such as videoconferencing.
Therefore, there is a strong requirement for developing a method and system for parallel encoding of video that would be able to handle commonly used video compression standards and profiles and produce high speedup leading to a low latency which is important for many real time applications.