Blocking effect and ringing noise are two well-known artifacts in low bit rate coded video. The blocking effect is the grid noise along block boundaries mainly visible in smooth areas, and the ringing noise shows along object borders. Traditionally, de-blocking filters try to remove the unwanted boundaries between adjacent blocks by low-pass filtering applied to pixels on both sides of the block borders. However, this type of filtering may introduce undesirable blurring effects when applied to pixels which belong to real image edges. The decision between edge and non-edge block borders relies on the assumption that real edges have higher amplitude than borders produced by the quantization of DCT coefficients. One method used to remove the ringing noise along object borders is to detect the edges in each frame, and apply a smoothing filter along these edges.
De-blocking and de-ringing are important video processing techniques used to remove coding artifacts and improve visual quality when rendering low bit rate coded video. Overlap smoothing and in-loop de-blocking are inter-block filtering techniques applied in video encoding standards to offset the effect of block encoding. Few video filtering algorithms are able to run in real time without customized hardware or high speed processor because they are computationally intensive.
FIG. 13 is a schematic view of the architecture of a known deblocking filter. The deblocking filter 1 comprises an External RAM 2, a System bus 3, a Ram1 4, a Ram0 5, a Ram2 6, a deblocking filter 7, control parameters 8, and a controller 9. In operation, the input pixel data of the deblocking filter come from two modules: the Ram2 6 for unprocessed pixel data of the current MB from the prior modules in the pipeline (e.g., inverse transformation, motion compensation and intra prediction), and the Ram0 5 for the adjacent pixel data of the top and left MBs of the current MB.
When needed, the Ram0 5 is loaded with needed pixel data from the External RAM 2 via the System bus 3 in advance. The Control parameters 8 and the Controller 9 provide instructions for the filtering process. The processing results are then sent into the Ram1 4 with the pixel data of current MB and the Ram0 5 with the adjacent pixel data of the top and the left MBs. After deblocking filtering, processed data in the Ram1 4 and Ram0 5 have to be stored back to the External RAM 2.
To add to the complexity of the filtering algorithms, some of these filtering techniques are applied concurrently during video encoding or decoding. For example, overlap smoothing and in-loop de-blocking can occur during the decoding of VC-1 bitstreams; both de-blocking and de-ringing can be applied after the decoding of MPEG-4 bitstreams; and de-blocking can be performed as a post-processing technique in addition to in-loop de-blocking in H.264. Conventional solutions that have individual hardware block for each filtering application are costly in terms of area and bandwidth.
One known multi-DSP system has a main DSP operating concurrently with an auxiliary DSP for implementing a filter algorithm. The DSPs have separate program memories in which the main DSP downloads filter process instructions to auxiliary program memory. They share the same data memory but priority is given to the main DSP.
Another known system implements de-blocking and de-ringing by splitting the frame into rectangular slices which are processed by 4 processing elements simultaneously, each of which has data level and instruction level parallelism. Data transfer between the local processing element data memory and the external memory is performed in the background by a powerful DMA engine. Both require additional high speed processors to operate in parallel for a software solution. Depending on the complexity of the filtering, a number of additional processors may be required. This increases area for additional cores and programs memories and adds complexity for arbitration.
One known filter accelerator, connected in parallel with a conventional DSP, enhances the speed of filtering operations in DSP by calculating and maintaining partial results based on selected prior data samples, freeing the DSP to perform other operations. However, this will not meet real time requirements for both sets of video post-filtering techniques.
Another known hardware architecture is one that may be embedded in DSP with special instructions to accelerate adaptive de-blocking filter of H.264/AVC video coding. Its building blocks include a dedicated data buffer, instruction decoder and controller, transpose model and edge filter with compact data access. However, this is not generic enough to support filtering algorithms other than de-blocking filtering.
Another known digital signal processing arrangement comprises a memory area, a signal processing module and a direct memory access controller for coordinating data transmission between the signal processing module and memory area.
For the implementation of a filter that supports post or in-loop filtering in different standards, the complexity of the filtering algorithms, besides the arithmetic parts, is further aggravated by excessive I/O overheads in loading data required in different standards for processing. These overheads are contributed by several factors discussed below and they impair the filter co-processor to accelerate the filtering process and reduce efficiency in continuity between consecutive filtering processes.
One of the factors is that different natures of de-blocking and de-ringing algorithms that require different data handling for efficient filtering. De-blocking is performed across block boundaries while de-ringing is block-based. The same data access pattern for de-blocking may not be suitable for de-ringing. Conventional block boundary filtering has typical arrangement of two 4×4 blocks beside the block boundary. For block-based filtering that requires surrounding pixels, the arrangement brings about excessive read and write operations.
In addition, the operation of de-blocking is two one-dimensional filtering, one vertical and the other horizontal, one after the other. Whereas for de-ringing, the operation is usually one-dimensional. Current digital signal processors have efficient data interface and filtering in one-dimensional only. For two-dimensional filtering, data has to be rearranged prior to input into the filter function and post-arranged for storage to memory.
Finally, there is a trend in video consumer products towards supporting multiple video standards for video encoding and decoding applications. Thus, in addition to the traditional hardware solutions, it is more desirable to have a software solution that is flexible enough to support different video standards.