Motion estimation is an effective inter-frame coding technique to exploit temporal redundancy in video sequences. Motion-compensated inter-frame coding has been widely used in various international video coding standards. The motion estimation adopted in various coding standards is often a block-based technique, where motion information such as coding mode and motion vector is determined for each macroblock or similar block configuration. In addition, intra-coding is also adaptively applied, where the picture is processed without reference to any other picture. The inter-predicted or intra-predicted residues are usually further processed by transformation, quantization, and entropy coding to generate a compressed video bitstream. During the encoding process, coding artifacts are introduced, particularly in the quantization process. In order to alleviate the coding artifacts, additional processing has been applied to reconstructed video to enhance picture quality in newer coding systems. The additional processing is often configured in an in-loop operation so that the encoder and decoder may derive the same reference pictures to achieve improved system performance.
In the High Efficiency Video Coding (HEVC) standard being developed, Deblocking Filter (DF), Sample Adaptive Offset (SAO) and Adaptive Loop Filter (ALF) have been developed to enhance picture quality. The in-loop filter information may have to be incorporated in the bitstream so that a decoder can properly recover the required information. Therefore, in-loop filter information from SAO and ALF is usually provided to entropy encoder for incorporation into the bitstream. In HEVC, DF is applied to the reconstructed video, where the horizontal deblocking filter is first applied to the reconstructed video data across vertical block boundaries and the vertical deblocking filter is then applied to the horizontally DF-processed (also referred as H-DF-processed or horizontal-deblocked) video data across horizontal block boundaries. After both horizontal DF and vertical DF filtering are applied, the fully DF-processed video data is processed by SAO. ALF is then applied to the SAO-processed video data. While the loop processing order for HEVC is from DF to SAO, and then to ALF, the processing order maybe different among various loop filters in other video systems.
FIG. 1 illustrates an exemplary adaptive video coding system incorporating in-loop processing. The video bitstream is decoded by Video Decoder 142 to recover the transformed and quantized residues, SAO/ALF information and other system information. The coded residues are processed by Inverse Quantization (IQ) 124 and Inverse Transformation (IT) 126 to recover the residues. The recovered residues are then added back to prediction data 136 at Reconstruction (REC) 128 to form reconstructed video data. Depending on whether the underlying video data is coded in an Intra mode or Inter mode, a switch 114 selects the prediction data from either the Intra prediction block 110 or the Motion Compensation (MC) block 113. The reconstructed video is further processed by DF 130 (deblocking filter), SAO 131 and ALF 132 to produce the final enhanced decoded video. The enhanced and reconstructed video data is stored in Reference Picture Buffer 134 and used for prediction of other frames.
The coding process in HEVC is applied according to Largest Coding Unit (LCU). The LCU is adaptively partitioned into coding units using quadtree. In each leaf CU, DF is performed for each 8×8 block and in HEVC Test Model Version 4.0 (HM-4.0), the DF is applies to 8×8 block boundaries. For each 8×8 block, horizontal filtering across vertical block boundaries is first applied, and then vertical filtering across horizontal block boundaries is applied. For generality, the coding process may divide a picture into image units and selects coding parameters adaptively for each image unit, where the image unit may be an LCU, a macroblock, a slice, a tile, or other image structure. During processing of a luma block boundary, n pixels of each side are involved in filter parameter derivation, and up to m pixels on each side may be changed after filtering. For HEVC, n is set to 4 for the luma component. For horizontal filtering across vertical block boundaries, reconstructed pixels (i.e., pre-DF pixels) are used for filter parameter derivation and also used as source pixels for filtering. In FIG. 1, deblocking 130 is shown as the first in-loop filtering applied to the reconstructed video data. Nevertheless, a video system may process the reconstructed video data to generate processed reconstructed video data before deblocking is applied. In this case, the deblocking is applied to the processed reconstructed video data. In this disclosure, the pre-DF video data refers to the video data immediately before the DF process, which may be the reconstructed video data or the processed reconstructed video data. For convenience, the term for the reconstructed video data also includes the processed reconstructed video data in this disclosure.
FIG. 2A illustrates an example of a vertical block boundary 210 with n (n=4) boundary pixels on each side of the block boundary. The n boundary pixels on the right side are designated as q0, q1, q2 and qn−1, where q0 is the pixel immediate next to the boundary. The n (n=4) boundary pixels on the left side are designated as p0, p1, p2 and pn−1, where p0 is the pixel immediate next to the boundary. For horizontal filtering across vertical block boundaries, reconstructed pixels (i.e., pre-DF pixels) are used for filter parameter derivation, and horizontal-deblocked pixels (i.e. pixels after horizontal filtering) are used for vertical filtering. FIG. 2B illustrates an example of a horizontal block boundary 220 with n boundary pixels on each side of the block boundary. The n boundary pixels on the lower side are designated as q0, q1, q2 and qn−1, where q0 is the pixel immediate next to the boundary. The n (n=4) boundary pixels on the upper side are designated as p0, p1, p2 and pn−1, where p0 is the pixel immediate next to the boundary. While n pixels on each side of the block boundary are used for filter parameter derivation and filtering operation, deblocking only alters m pixels on each side of the block boundary, where m is equal to or smaller than n. In HEVC, m is set to 3 for the luma component. Accordingly, only boundary pixel (q0 to qm−1) or (p0 to pm−1) may be altered after DF filtering. For DF processing of a chroma block boundary, two pixels (i.e., n=2) on each side, i.e., (p0, p1) or (q0, q1), are involved in filter parameter derivation, and at most one pixel (i.e., m=1) on each side i.e., p0 or q0, may be altered after filtering. For horizontal filtering across vertical block boundaries, reconstructed pixels are used for filter parameter derivation and are used as source pixels for filtering. For vertical filtering across horizontal block boundaries, horizontal DF processed intermediate pixels (i.e. pixels after horizontal filtering) are used for filter parameter derivation and also used as source pixels for filtering.
Sample Adaptive Offset (SAO) 131 is also adopted in HM-4.0, as shown in FIG. 1. SAO is regarded as a special case of filtering where the processing only applies to one pixel. SAO can divide one picture into multiple LCU-aligned regions, and each region can select one SAO type among two Band Offset (BO) types, four Edge Offset (EO) types, and no processing (OFF). For each to-be-processed (also called to-be-filtered) pixel, BO uses the pixel intensity to classify the pixel into a band. The pixel intensity range is equally divided into 32 bands, as shown in FIG. 3. After pixel classification, one offset is derived for all pixels of each band, and the offsets of center 16 bands or outer 16 bands are selected and coded. In SAO, pixel classification is first done to classify pixels into different groups (also called categories or classes). The pixel classification for each pixel is based on a 3×3 window, as shown in FIG. 4 where four configurations corresponding to 0°, 90°, 135°, and 45° are used for classification. Upon classification of all pixels in a picture or a region, one offset is derived and transmitted for each group of pixels. In HM-4.0, SAO is applied to luma and chroma components, and each of the luma components is independently processed. Similar to BO, one offset is derived for all pixels of each category except for category 0, where Category 0 is forced to use zero offset. Table 1 below lists the EO pixel classification, where “C” denotes the pixel to be classified.
TABLE 1CategoryCondition1C < two neighbors2C < one neighbor && C == one neighbor3C > one neighbor && C == one neighbor4C > two neighbors0None of the above
Adaptive Loop Filtering (ALF) 132 is another in-loop filtering in HM-4.0 to enhance picture quality, as shown in FIG. 1. Multiple types of luma filter footprints and chroma filter footprints are used. For example, an 11×5 cross shaped filter is shown in FIG. 5A and a 5×5 snow-flake shaped filter is shown in FIG. 5B. Each picture can select one filter shape for the luma signal and one filter shape for the chroma signal. In HM-4.0, up to sixteen luma ALF filters and at most one chroma ALF filter can be used for each picture. In order to allow localization of ALF, there are two modes for luma pixels to select filters. One is a Region-based Adaptation (RA) mode, and the other is a Block-based Adaptation (BA) mode. In addition to the RA and BA for adaptation mode selection at picture level, Coding Units (CUs) larger than a threshold can be further controlled by filter usage flags to enable or disable ALF operations locally. As for the chroma components, since they are relatively flat, no local adaptation is used in HM-4.0, and the two chroma components of a picture share a (the?) same filter.
The RA mode simply divides one luma picture into sixteen regions. Once the picture size is known, the sixteen regions are determined and fixed. The regions can be merged, and one filter is used for each region after merging. Therefore, up to sixteen filters per picture are transmitted for the RA mode. On the other hand, the BA mode uses edge activity and direction as properties for each 4×4 block. Calculating properties of a 4×4 block may require neighboring pixels. For example, a 5×5 window 610 is used for an associated 4×4 window 620 in HM-4.0 as shown in FIG. 6. After properties of 4×4 blocks are calculated, the blocks are classified into fifteen categories. The categories can be merged, and one filter is used for each category after merging. Therefore, up to fifteen filters are transmitted for the BA mode.
In the exemplary decoder implementation for HM-4.0 as shown in FIG. 1, the decoding process is divided into two parts. One is LCU-based processing including Intra prediction (IP) 110, Motion Compensation (MC) 113, Inverse Transform (IT) 126, Inverse Quantization (IQ), and Reconstruction (REC) 128, and the other is picture-based processing including DF 130, SAO 131, and ALF 132. Entropy decoding (ED) 142 belongs to the picture-based processing when SPS, PPS, or slice-level syntax elements are parsed and ED 142 belongs to the LCU-based processing when syntax elements of LCUs are parsed. In PC-based software environment, picture-based processing is easier to implement than LCU-based processing for DF, SAO, and ALF. However, if decoder implementation is in hardware or embedded software, picture-based processing would require picture buffers, which results in high system cost due to on-chip picture buffers. On the other hand, the use of off-chip picture buffers will significantly increases system bandwidth due to external memory access. Furthermore, power consumption and data access latency will also increase accordingly. Therefore, it is preferred to implement DF, SAO, and ALF using LCU-based decoding configuration.
When LCU-based processing is used for DF, SAO, and ALF, the encoding and decoding process can be done LCU by LCU in a raster scan order for parallel processing of multiple LCUs. In this case, line buffers are required for DF, SAO, and ALF because processing one LCU row requires pixels from the upper LCU row. If off-chip line buffers (e.g. DRAM) are used, it will result in substantial increase in external memory bandwidth and power consumption. On the other hand, if on-chip line buffers (e.g. SRAM) are used, the chip area will increase and accordingly the chip cost will increase. Though line buffers for LCU-based processing are already much smaller than picture buffers, it is desirable to further reduce line buffers to reduce cost.
FIG. 7 illustrates an example of line buffer requirement for processing luma component associated with DF, SAO, and ALF in an LCU-based encoding or decoding system. Bold lines 710 and 712 indicate horizontal and vertical LCU boundaries respectively, where the current LCU is located in the upper side of the horizontal LCU boundary 710 and the right side of the vertical LCU boundary 712. Pixel lines A through J are first processed by horizontal DF and then by vertical DF. For convenience, a pixel line X is referred to as a line X. Horizontal DF processing for lines K through N above the horizontal LCU boundary 710 needs to wait until the four lines below the horizontal LCU boundary become available. The horizontal filtering for lines K through N can be delayed until the lower LCU becomes available in order to avoid line buffers for the horizontal-deblocked pixels. Therefore, four lines (i.e., line K through N) of pre-DF pixels (i.e., reconstructed pixels) have to be stored for DF to be performed at a later time. The pre-DF pixels refer to reconstructed pixels that are not yet processed by DF at all. Accordingly, in a typical system, four (n=4) lines (lines K through N) are used to store pre-DF pixels for subsequent DF processing. Based on the system configuration shown in FIG. 1, SAO is then applied to DF output pixels. Since DF has processed lines A through J, SAO can process lines A through I. The SAO processing can be applied up to DF output line I since the SAO processing with the EO type is based on a 3×3 window as indicated by box 730. The 3×3 window for line J will require DF output pixels for line K, which is not available yet. After SAO processes lines A through I, the properties for a 4×4 block 740 still cannot be calculated since line J is not yet processed by SAO. Therefore, ALF can only process lines A through F at this time. ALF processing using the 5×5 snowflake filter for line F is shown in FIG. 7, where the filter footprint 750 for an underlying pixel 752 is shown. After this point, no further processing can be done for the current LCU until the lower LCU becomes available.
When the lower LCU becomes available, lines K through N of the current LCU (after the lower LCU arrives, the LCU located at the upper-right quadrant of LCU boundaries 710 and 712 is still referred as the “current LCU”) are read from line buffers and processed by horizontal DF to generate horizontal-deblocked lines K through N. Horizontal DF processing can be applied to reconstructed lines of the neighboring LCU below. Only two lines (lines O and P) of the neighboring LCU below are shown to illustrate the in-loop processing of reconstructed video data above the bottom LCU boundary line 710. Vertical DF processing is applied to the horizontal-deblocked lines K through N. Vertical DF processing operates on four boundary pixels associated with lines K through N of the current LCU is indicated by box 720 as one example in FIG. 7. After lines K through N are processed by vertical DF, lines J through P can be processed by SAO. When SAO processes line J, line I is required for determining the EO classification. Therefore, two lines (i.e., lines I and J) of DF output pixels have to be stored in line buffers for SAO. Next, the properties of 4×4 block for lines G through P can be calculated and lines G through P can be filtered by ALF accordingly. When line G is processed by ALF, it requires SAO processed pixel data from lines E to I. Through further analysis, it can be shown that five lines (i.e., lines E through I) of SAO output pixels have to be stored in line buffers for ALF. Accordingly, the total in-loop filtering requires 11 luma line buffers (4 pre-DF lines, 2 DF-processed lines and 5 SAO processed lines).
FIG. 8 illustrates an example of chroma line buffer requirement associated with DF, SAO, and ALF for LCU-based decoding. Bold lines 810 and 812 indicate horizontal and vertical LCU boundaries respectively, where the current LCU is located on the upper side of the horizontal LCU boundary 810 and the right side of the vertical LCU boundary 812. When the current LCU is processed, lines A through L are first processed by DF. However, lines M through N cannot be vertically filtered by DF because the lower LCU is not yet available and DF needs two horizontal-deblocked lines below the horizontal boundary 810. Similar to the case of luma in-loop processing, the horizontal filtering for lines M and N is delayed until the lower LCU becomes available in order to avoid buffering of horizontal-deblocked video data. Accordingly, two lines (i.e., lines M and N) of pre-DF video data (i.e., reconstructed video data) need to be stored for DF. SAO is applied on DF output pixels, and the processing for each pixel is based on a 3×3 window as illustrated by box 820 in FIG. 8. Since DF has processed lines A through L, SAO can process lines A through K. After SAO processes lines A through K, ALF can process lines A through I. Since a 5×5 snowflake filter is used, ALF cannot process lines beyond line I as indicated by the filter footprint 830 for an underlying pixel 832 in line I. After this point, no further process can be done for the current LCU until the lower LCU becomes available. When the lower LCU arrives, lines M through P are first processed by DF, and then lines L through P are processed by SAO. Only two lines (lines O and P) of the neighboring LCU below are shown to illustrate the in-loop processing of reconstructed video data above the bottom LCU boundary line 810. When SAO processes line L, the neighboring line K is required. Therefore, two lines (i.e., line K and L) of DF output pixels have to be stored for SAO. After lines L through P are processed by SAO, lines J through P can be filtered by ALF. When line J is filtered, it requires neighboring lines H through L. Through further analysis, it can be shown that four lines (i.e., line H through K) of SAO output pixels have to be stored in the line buffers for ALF. Accordingly, the total in-loop filtering requires eight chroma line buffers.
In the above analysis of an exemplary coding system, it is shown that the line buffer requirement of DF, SAO and ALF processing for the luma and chroma components are 11 and 8 lines respectively. For HDTV signals, each line may have nearly two thousand pixels. The total line buffers required for the system becomes sizeable. It is desirable to reduce the required line buffers for in-loop processing.