In order to efficiently store and transmit digital video signals, it is often desired to “compress” the signals. H.264 Advanced Video Coding (AVC) is a video compression standard that achieves higher compression efficiency than most signal compression standards. The AVC standard provides good video quality at bit rates that are substantially lower than previous standards, such as MPEG-2, H.263, or MPEG-4 Part 2, without being impractical to implement. The AVC standard is also flexible enough to be applied to a wide variety of applications and work well on a very wide variety of networks and systems.
The coding efficiency gains of advanced video standards such as AVC come at the price of increased computational requirements. The demand for computing power also increases with the shift towards HD resolutions. As a result, current high-performance uniprocessor computer architectures are not capable of providing the performance required for real-time processing. One way to speed up the video encoding processes is to use a multi-core architecture. Moreover, another powerful solution is to exploit parallelism. AVC may be parallelized either by a task-level or data-level decomposition.
In order to exploit parallel processing power in video compression applications, conventional methods involve splitting a picture in a video sequence into “slices.” Some video compression applications require a single-slice approach (one slice per picture). With the single-slice approach, there are many dependency issues in the syntax and semantics around the block boundary, especially in the AVC specification.
One method of parallel processing video compression on a multi-core system with the single-slice approach is to separate a picture horizontally into a top half and a bottom half, further separating the picture into Macroblocks (MBs). One thread in the processor processes the top half of the picture and another thread processes the bottom half. Both threads process the same picture. The bottom thread ignores dependency around the boundary and also handles conformance of syntax around the boundary. When the bottom thread processes the first lines in an MB of the picture, it selects an MB mode that is independent of mode used for its upper MB. However, this methodology may achieve lower efficiency of compression than the standard single-slice raster scan approach.
A multiple-slice approach has been proposed. However, multi-slice methods may suffer from many problems. For example, it may be difficult or impossible to validate the correctness of parallel-processing methodology incorporating multi-slices. In addition, the video quality decreases at the boundaries of slices. Video compression using horizontal multi-slice encoding may suffer workload imbalance if the complexity of video contents are different in different slices. Moreover, the result of individual slices of horizontal multi-slice encoding needs to be concatenated to form a single result. This is additional work that does not exist in single-slice encoding.
All of the processes discussed above divide a frame into slices for encoding. If a system could pass an entire frame, rather than a slice, to a multi-core encoder, it would greatly reduce the communication load between the central processor and encoder. Additionally, the communication load would be further reduced if the encoding process occurred in a single-command multiple-data fashion. Hence, there remains a need for an efficient implementation and scalable methodology for processing, in parallel, groups of pictures (GOPs) at the frame level where each frame is an encoding unit.