Over the last years, H.264 (NPL 1), a standard for high efficient video compression, has been getting more and more popular and has entered many application areas such as HDTVs, portable videos, multimedia, video conferencing or video and digital cameras.
One reason for the big success is the high efficiency accompanied with high picture output quality, which is also a result of the use of the de-blocking filter (PTL 1, NPL 2, NPL 3).
The H.264 de-blocking filter 101 is a closed-loop filter which operates inside the decoding loop 108 with an inter prediction unit 103 and an intra prediction unit 105, an addition unit 107 and a selection unit 106 as well as the memory areas for the actual (102) and reference frames (104) as shown in FIG. 8, which shows a block diagram of an H.264 video decoder decoding loop.
The de-blocking filter is necessary for block-wise lossy coding at high compression ratios.
FIG. 9 shows the structure of a macro block (MB) 200 with 4 needed pixels 203 on each side of the edge 202 to implement an H.264 in-loop de-blocking filter.
Two neighboring image pixels that are coded in two different MBs or sub blocks (SB) 201 may describe the same image content.
The independent prediction and coding of the two pixels may however result in different reconstruction values on both sides of the block edge.
The de-blocking filter 101 alleviates such reconstruction differences at block boundaries adaptively according to their estimated magnitude.
Due to the fact that for the filtering at block boundaries pixels from both sides of the boundaries are needed, there exist dependencies between neighboring MBs.
As shown in FIG. 10, next to the data of the current processed macro block (MBC) 400, the last SB row 43 of the upper macro block (MBU) 403 and the last SB column 41 of the left macro block (MBL) 401 are needed to process the current macro block (MBC) 400.
There exist implementations like described in [NPL 4], which are neglecting these dependencies in a first run over the image to increase the parallelism and then correcting the errors in a second run.
However, most other implementations are running only one time over the picture and they are taking these dependencies in the first run into account.
Other implementations are decreasing the processing time by either exploiting the task level [PTL 2] [PTL 3] or data level parallelism [NPL 5, NPL 6, NPL 7, NPL 8].
In the related art [NPL 5-NPL 8], first all data necessary for processing an MB is loaded, then the MB is processed, and finally the data is stored back. This results in dependencies while concurrently processing the current macro block (MBC) 400 and the left macro block (MBL) 401, the MBC 400 and the upper macro block (MBU) 403, and the MBC 400 and the upper right macro block (MBUR) 405 as shown in FIG. 11, 12, 13 respectively.
FIG. 11 shows the necessary data when processing the MBL 401 and the MBC 400.
If processing both the MBC 400 and the MBL 401 in parallel, from MBL 401, the shaded SB column data 41 on the right side is also needed for processing the MBC 400.
FIG. 12 shows the necessary data when processing the MBU 403 and the MBC 400.
If processing both the MBU 403 and the MBC 400 in parallel, from MBU 403, the shaded lower SB row 43 is also needed for processing the MBC 400.
FIG. 13 shows the necessary data when processing MBUR 405 and the MBC 400.
Here, the shaded area 46 from MBU 403 is needed and updated in both MB filter tasks.
FIG. 14 shows the necessary data when processing MBUL 407 and the MBC 400.
Here, no same data is needed in both MB filter processes so that these MBs 407 and 400 are processable in parallel.
FIG. 15 shows a schematic block diagram of a filtering task.
A Filter apparatus 1000 includes a plurality of Macro Block filters (MB Filters) 1100, 1101, 1102.
At first, a Picture data 10 is sliced into a plurality of Macro block rows.
The Macro block rows are processed by corresponding MB filters respectively in parallel.
Meanwhile, since filtering a Macro Block requires the data from neighboring Macro Blocks as mentioned above, an inter-processor synchronization is required between the MB Filters.
FIG. 16 shows MB filter tasks in the related art.
At first the data for one MB is loaded in ST51, then processed (ST53) and finally stored back (ST55).
The Data load task (ST51) includes “Loading current MB” (ST511), “Loading left SB column” (ST512) and “Loading upper SB row” (ST513).
The Data process task (ST53) includes “Processing vertical edges” (ST531) and “Processing horizontal edges” (ST532).
The “Data store” task (ST55) includes “Storing current MB” (ST551), “Storing left SB column” (ST552) and “Storing upper SB row” (ST553).
Further, the synchronization “Inter sync” (ST57) is performed at the end of the MB processing, which is necessary, when operating for example on a multi-processor unit (MB Filter), where each MB Filter is working on an MB row and multiple MB Filters are running at the same time.
Inside an MB row, the fact that the MBs are all processed from the same MB Filter guaranties that the dependencies between MBs are observed.
Given the execution procedure of the filtering process as mentioned above, during the start-up phase, the parallelism is step-wise increased until all PU have started the filter task.
As shown in FIG. 17, before starting to process the MBC 400, processing the MBL 401, the MBU 403 and the MBUR 405 must have been finished.
The resulting delay is 2 Synchronous Intervals (SI), which correspond to 2 MBs.