Compressed video has been widely used nowadays in various applications, such as video broadcasting, video streaming, and video storage. The video compression technologies used by newer video standards are becoming more sophisticated and require more processing power. On the other hand, the resolution of the underlying video is growing to match the resolution of high-resolution display devices and to meet the demand for higher quality. For example, compressed video in High-Definition (HD) is becoming a trend. The requirement of processing power for HD contents increases with the spatial resolution. Accordingly, processing power for higher resolution video is becoming a challenging issue for hardware based and software based implementation.
In order to fulfill the computational power requirement for high resolution and/or more sophisticated coding standards, high speed processor and/or multiple processors have been used to perform real-time video decoding. For example, in the personal computer (PC) and consumer electronics environment, a multi-core Central Processing Unit (CPU) may be used to decode video bitstream. The multi-core system may be in a form of embedded system for cost saving and convenience. The existing single thread software that is intended for single processor may not be suitable for multi-core system due to the data dependency nature of video coding. In order to improve the performance and reduce system cost of multi-core system based on existing single thread software, the parallel algorithms need to take into account the characteristic of underlying video coding technology. Common issues encountered in parallel algorithm design include load balancing, synchronization overhead, memory access contention, etc. In embedded systems, the communication buffer size is also an important issue because the embedded systems have limited memory and bandwidth.
Most existing video standards are not designed for parallel processing. Often, data dependencies exist in various stages of processing. Several parallel algorithms of video decoders are proposed to avoid data dependencies. For example, in MPEG-2 and H.264/AVC standards, motion compensation for a current macroblock (MB) relies on one or two reference macroblocks from one or two previously processed pictures. Furthermore, in H.264/AVC, the Intra-prediction and Motion Vector Prediction (MVP) rely on neighboring blocks. The frame level partitioning is not suitable due to Inter prediction and buffer size limitation in embedded systems. The open source software FFmpeg supports MPEG-2, MPEG-4 and H.264 coding standards and also supports slice level partitioning. However, only MPEG-2 format in FFmpeg is useful for multi-core systems since only MPEG-2 standard defines each MB row as an individual slice. Other video standards, such as MPEG-4 and H.264 do not have the option for multiple slices within a frame. Most commercial bitstreams are configured to process each frame as a slice. Therefore, the slice level partitioning does not help to speed-up decoding in a multi-core system since one slice is equivalent to one frame in most cases. Furthermore, the frame based partition does not help to alleviate the large buffer requirement.
To take potential advantage of a multi-core system, it is desirable to process the video in a smaller unit, such as macroblock level processing. For macroblock level partitioning, the system still needs to take care of data dependencies of spatial neighboring blocks, i.e., the left, top-left, top and top-right macroblocks. Local storage may be needed for buffering information associated with spatial neighboring blocks for bandwidth efficiency and lower power consumption. The reference picture buffer associated with Inter-prediction will still be needed since on-chip frame memory may cause substantial cost increase. Nevertheless, the data dependencies of macroblock level processing still pose a challenge to the parallel algorithm design for video decoder.
In FIG. 1, an exemplary system block diagram for H.264 decoder is shown. The video bitstream is first processed by entropy decoder or variable length decoder 110 (VLD) to recover code symbols. The entropy decoder 110 may also include a bitstream parser to extract respective information for various processing modules. For example, spatial prediction mode and associated information may be extracted and provided to Intra Prediction 160 and motion vector and mode information may be extracted and provided to Motion Compensation (MC) 170. The coded residue information is provided to Inverse Quantization (IQ) 120 and further processed by Inverse Transform (IT) 130. The recovered residues from IT 130 are combined with the predictor selected from Intra Prediction 160 and MC 170 to form reconstructed video data. The selection between Intra Prediction 160 and MC 170 is controlled by switch 190 based on whether the underlying video data is processed in Intra mode or Inter mode. The reconstructed video data is further process by an in-loop filter, i.e. Deblocking Filter 150 to enhance visual quality. The deblocked video data is stored in Reference Picture Buffer 180 for motion compensation of subsequent video data. The H.264 decoder system shown in FIG. 1 may be configured differently.
In a single-core system as shown in FIG. 2, the decoding process can be implemented using single threshold software. The decoding process is divided into multiple processing modules including VLD 220, IQ/IT 230, PRED (Prediction) 240, and DBK 250. The PRED 240 is responsible for both Intra Prediction 160 and Motion Compensation 170 of FIG. 1. As mentioned before, buffer required for incoming bitstream 210 and Reference picture 270 usually are sizeable and off-chip memory is used to reduce chip cost. The deblocking process (DBK 250) for a current macroblock also relies on neighboring macroblocks. An output buffer 260 is used, as shown in FIG. 2, to store decoded video frame for displaying the video frame on a display screen after decoding. While separate buffers are shown in FIG. 2, these buffers may be aggregated so that a single memory device may be used.
Various parallel algorithms exist and can be divided into three categories: functional partitioning, data partitioning and mixed partitioning. The functional partitioning takes each thread as a distinct function in a pipelined fashion, and communicates between tasks in an explicit way. Each function is decomposed into tasks, which can be executed in parallel on a multiprocessor system. The data partitioning use different threads to execute the same function for different parts of input data simultaneously. The mixed partitioning combines the functional partitioning and data partitioning, and processes functional parallelism at coarse granularity and applies data parallelism within those granularities. In embedded systems, a major problem of data partitioning and mixed partitioning is communication buffer size. The data partitioning needs a VLD buffer with a size corresponding to two full frames, and the mixed partitioning needs a VLD buffer and prediction buffer with a size corresponding to three full frames. In a non-patent literature by van der Tol, et al. (E. B. van der Tol, E. G. T. Jaspers, and R. H. Gelderblom, “Mapping of H.264 decoding on a multiprocessor architecture”, Proc. of the SPIE, vol. 5022, pp. 707-718, May 2003), and a non-patent literature by Meenderinck et al. (C. Meenderinck, A. Azevedo, M. Alvarez, B. Juurlink and A. Ramirez, “Parallel Scalability of H.264”, Proc. of workshop on Programmability Issues for Multicore Computers (MULTIPROG), January 2008), buffer requirement for data partitioning is discussed. In a non-patent literature by Nishihara et al. (K. Nishihara, A. Hatabu and T. Moriyoshi, “Parallelization of H.264 video decoder for embedded multicore processor”, Proc. of IEEE Intl. Conf on Multimedia and Expo (ICME), pp. 329-332, April 2008), and a non-patent literature by Sihn et al. (K.-H. Sihn, H. Baik, J.-T. Kim, S. Bae and H. J. Song, “Novel approaches to parallel H.264 decoder on symmetric multicore systems”, Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp, 2017-2020, April 2009), buffer requirement for mixed partitioning is discussed.
For functional partitioning, a method to divide the video decoder functionality into a parsing part and a reconstruction part for dual-core environment is disclosed by Seitner et al. in two separate non-patent literatures (F. H. Seitner, R. M. Schreier, M. Bleyer and M. Gelautz, “A macroblock-level analysis on the dynamic behaviour of an H.264 decoder”, Proc. of IEEE Intl. Symp. on Consumer Electronics (ISCE), pp. 1-5, June 2007 and F. H. Seitner, M. Bleyer, M. Gelautz, R. M. Beuschel, “Development of a High-Level Simulation Approach and Its Application to Multicore Video Decoding”, IEEE Trans. on Circuits and Systems for Video Technology, vol. 19, no. 11, pp. 1667-1679, November 2009). A high level simulation is also proposed by Seitner et al. to estimate the performance. Y. Kim et al., (Y. Kim, J.-T. Kim, S. Bae, H. Baik and H. J. Song, “H.264 decoder on embedded dual core with dynamically load-balanced functional partitioning”, Proc. of IEEE Intl. Conf. on Multimedia and Expo (ICME), pp. 1001-1004, April 2008) proposed dynamic load balancing in functional partitioning, which off-loads the motion compensation to another core if the previous macroblock is in a bi-directional Inter prediction mode. However, quarter frame buffer size is required for their pipeline structure. M. Kim et al., (M. Kim, J. Song, D. Kim and S. Lee, “H.264 decoder on embedded dual core with dynamically load-balanced functional partitioning”, Proc. of IEEE Intl. Conf. on Image Processing (ICIP), pp. 3749-3752, September 2010) disclosed a dynamic load balancing method for functional partitioning based on a dual-core DSP. The system consists of a dual-core DSP, VLD hardware, MC hardware, deblocking hardware and a SDRAM. The dual-core DSP only handles some selected parts of video decoder flow. It is desirable to develop an improved functional partitioning based scheme that can further enhance performance and reduce buffer requirement.