Three-dimensional (3D) television has been a technology trend in recent years that intends to bring viewers sensational viewing experience. Various technologies have been developed to enable 3D viewing. Among them, the multi-view video is a key technology for 3DTV application among others. The traditional video is a two-dimensional (2D) medium that only provides viewers a single view of a scene from the perspective of the camera. However, the multi-view video is capable of offering arbitrary viewpoints of dynamic scenes and provides viewers the sensation of realism.
The multi-view video is typically created by capturing a scene using multiple cameras simultaneously, where the multiple cameras are properly located so that each camera captures the scene from one viewpoint. Accordingly, the multiple cameras will capture multiple video sequences corresponding to multiple views. In order to provide more views, more cameras have been used to generate multi-view video with a large number of video sequences associated with the views. Accordingly, the multi-view video will require a large storage space to store and/or a high bandwidth to transmit. Therefore, multi-view video coding techniques have been developed in the field to reduce the required storage space or the transmission bandwidth.
A straightforward approach may be to simply apply conventional video coding techniques to each single-view video sequence independently and disregard any correlation among different views. Such coding system would be very inefficient. In order to improve efficiency of multi-view video coding, typical multi-view video coding exploits inter-view redundancy. Therefore, most 3D Video Coding (3DVC) systems take into account of correlation of video data associated with multiple views and depth maps.
FIG. 1 illustrates generic prediction structure for 3D video coding. The incoming 3D video data consists of images (110-0, 110-1, 110-2, . . . ) corresponding to multiple views. The images collected for each view form an image sequence for the corresponding view. Usually, the image sequence 110-0 corresponding to a base view (also called an independent view) is coded independently by a video coder 130-0 conforming to a video coding standard such as H.264/AVC or HEVC (High Efficiency Video Coding). The video coders (130-1, 130-2, . . . ) for image sequences associated with dependent views (i.e., views 1, 2, . . . ) further utilize inter-view prediction in addition to temporal prediction. The inter-view predictions are indicated by the short-dashed lines in FIG. 1.
In order to support interactive applications, depth maps (120-0, 120-1, 120-2, . . . ) associated with a scene at respective views are also included in the video bitstream. In order to reduce data associated with the depth maps, the depth maps are compressed using depth map coder (140-0, 140-1, 140-2, . . . ) and the compressed depth map data is included in the bit stream as shown in FIG. 1. A multiplexer 150 is used to combine compressed data from image coders and depth map coders. The depth information can be used for synthesizing virtual views at selected intermediate viewpoints. To increase the coding efficiency, information sharing or prediction between textures (i.e., image sequences) and depths can also be utilized as indicated by long-dashed lines in FIG. 1. Furthermore, inter-view predictions as indicated by dotted-dashed lines can be used to code the depth maps for dependent views. An image corresponding to a selected view may be coded using inter-view prediction based on an image corresponding to another view. In this case, the image for the selected view is referred as dependent view. Similarly, a depth map for a selected view may be coded using inter-view prediction based on a depth map corresponding to another view. In this case, the depth map for the selected view is referred as dependent depth map. For color video, the depth maps associated with the chrominance (chroma) component may not need the same resolution as the luminance (luma) component. Therefore, the depth maps for color video may be coded in 4:0:0 sampling format, where the 4:0:0 sampling format refers to that only luma is sampled. The depth maps may be non-linearly scaled.
High-Efficiency Video Coding (HEVC) is a new international video coding standard that is developed by the Joint Collaborative Team on Video Coding (JCT-VC). In the HEVC, sample-adaptive offset (SAO) is a technique designed to reduce the distortion (intensity offset) of reconstructed pictures. SAO can be applied to individual color components such as luma and chroma components. FIGS. 2A-B illustrate exemplary system diagrams for an HEVC encoder and decoder respectively, where SAO 231 is applied to reconstructed video data processed by deblocking filter (DF) 230.
FIG. 2A illustrates an exemplary adaptive inter/intra video encoding system incorporating in-loop processing. For inter-prediction, Motion Estimation (ME)/Motion Compensation (MC) 212 is used to provide prediction data based on video data from other picture or pictures. Mode decision 214 selects Intra Prediction 210 or inter-prediction data 212 and the selected prediction data is supplied to Adder 216 to form prediction errors, also called residues. The prediction error is then processed by Transformation (T) 218 followed by Quantization (Q) 220. The transformed and quantized residues are then coded by Entropy Encoder 222 to form a video bitstream corresponding to the compressed video data. The bitstream associated with the transform coefficients is then packed with side information such as motion, mode, and other information associated with the image area. The side information may also be subject to entropy coding to reduce required bandwidth. Accordingly, the data associated with the side information are provided to Entropy Encoder 222 as shown in FIG. 2A. When an inter-prediction mode is used, a reference picture or pictures have to be reconstructed at the encoder end as well. Consequently, the transformed and quantized residues are processed by Inverse Quantization (IQ) 224 and Inverse Transformation (IT) 226 to recover the residues. The residues are then added back to prediction data 236 at Reconstruction (REC) 228 to reconstruct video data. The reconstructed video data may be stored in Reference Picture Buffer 234 and used for prediction of other frames.
As shown in FIG. 2A, incoming video data undergoes a series of processing in the encoding system. The reconstructed video data from REC 228 may be subject to various impairments due to a series of processing. Accordingly, various in-loop processing is applied to the reconstructed video data before the reconstructed video data are stored in the Reference Picture Buffer 234 in order to improve video quality. In the High Efficiency Video Coding (HEVC) standard being developed, Deblocking Filter (DF) 230 and Sample Adaptive Offset (SAO) 231 have been developed to enhance picture quality. Both Deblocking Filter (DF) 230 and Sample Adaptive Offset (SAO) 231 are referred as loop filters. SAO 231 is further referred as an adaptive filter since the filter operation is adaptive and filter information may have to be incorporated in the bitstream so that a decoder can properly recover the required information. Therefore, loop filter information from SAO is provided to Entropy Encoder 222 for incorporation into the bitstream. In FIG. 2A, DF 230 is applied to the reconstructed video first and SAO 231 is then applied to DF-processed video.
A corresponding decoder for the encoder of FIG. 2A is shown in FIG. 2B. The video bitstream is decoded by Video Decoder 242 to recover the transformed and quantized residues, SAO information and other system information. At the decoder side, only Motion Compensation (MC) 213 is performed instead of ME/MC. Switch 215 is used to select Intra prediction 210 or Inter prediction (i.e., motion compensation 213) according to the mode information. The decoding process is similar to the reconstruction loop at the encoder side. The recovered transformed and quantized residues, SAO information and other system information are used to reconstruct the video data. The reconstructed video is further processed by DF 230 and SAO 231 to produce the final enhanced decoded video. As shown in FIG. 2A and FIG. 2B, the DF and SAO are applied to reconstructed video data in a video encoder as well as a video decoder. In this case, when the SAO is applied to the reconstructed video data, the reconstructed video data may have been processed by the DF. For convenience, the reconstructed video data with or without additional processing (such as the DF) are referred as processed video data (such as processed multi-view images and processed multi-view depth maps). In software or hardware based implementation, the processed multi-view images and processed multi-view depth maps can be received from a media, such as memory, disk, and network. The processed multi-view images and processed multi-view depth maps may also be received from another processor.
The coding process in HEVC is applied to each image region named Largest Coding Unit (LCU). The LCU is adaptively partitioned into coding units using quadtree. The LCU is also called Coding Tree Block (CTB). For each leaf CU, DF is performed for each 8×8 block in HEVC. For each 8×8 block, horizontal filtering across vertical block boundaries is first applied, and then vertical filtering across horizontal block boundaries is applied.
SAO can be regarded as a special case of filtering, where the processing only applies to one pixel. In SAO, pixel classification is first done to classify pixels into different groups (also called categories or classes). The pixel classification for each pixel is based on a 3×3 window. Upon the classification of all pixels in a picture or a region, one offset is derived and transmitted for each group of pixels. In HEVC Test Model Version 4.0 (HM-4.0) or newer version, SAO is applied to luma and chroma components, and each of the luma components is independently processed. For SAO, one picture is divided into multiple LCU-aligned regions. Each region can select one SAO type among two Band Offset (BO) types, four Edge Offset (EO) types, and no processing (OFF). For each to-be-processed (also called to-be-filtered) pixel, BO uses the pixel intensity to classify the pixel into a band. The pixel intensity range is equally divided into 32 bands as shown in FIG. 3. After pixel classification, one offset is derived for pixels of each band, and the offsets of center 16 bands or outer 16 bands are selected and coded. As for EO, it uses two neighboring pixels of a to-be-processed pixel to classify the pixel into a category. The four EO types correspond to 0°, 90°, 135°, and 45° as shown in FIG. 4. Similar to BO, one offset is derived for all pixels of each category except for category 0, where Category 0 is forced to use zero offset. Table 1 shows the EO pixel classification, where “C” denotes the pixel to be classified. Therefore, four offset values are coded for each coding tree block (CTB) or Largest Coding Unit (LCU) when EO types are used.
TABLE 1CategoryCondition1C < two neighbors2C < one neighbor && C == one neighbor3C > one neighbor && C == one neighbor4C > two neighbors0None of the above
For each color component (luma or chroma), one picture is divided into CTB-aligned regions, and each region can select one SAO type among BO (with starting band position), four EO types (classes), and no processing (OFF). The total number of offset values in one picture depends on the number of region partitions and the SAO type selected by each CTB. The SAO syntax table and parameters are shown in FIG. 5A and FIG. 5B, where the syntax in FIG. 5B represents a continuation of syntax from FIG. 5A. The SAO merge Left and Up flags (i.e., sao_merge_left_flag and sao_merge_up_flag) are shown in lines 510 and 520 respectively. The syntax element sao_merge_left_flag indicates whether the current CTB reuses the parameters of left CTB. The syntax element sao_merge_up_flag indicates whether the current CTB reuses the parameters of upper CTB. The SAO type index from luma and chroma (i.e., sao_type_idx_luma and sao_type_idx_chroma) are shown in lines 530 and 540 respectively. The value of SAO is represented by magnitude and sign (i.e., sao_offset_abs and sao_offset_sign) as shown in lines 550 and 560 respectively. The band position (i.e., sao_band_position) of BO is indicated in line 570. The EO class for luma and chroma (i.e., sao_eo_class_luma and sao_class_chroma) are indicated in lines 580 and 590 respectively. The syntax element cIdx indicates one of three color components. While FIG. 5A and FIG. 5B illustrate the syntax design to convey the SAO information according to the HEVC standard, a person skilled in the art may use other syntax design to convey the SAO information.
Multi-view video may result in a large amount of data for storage or transmission. It is desirable to further improve the efficiency of three-dimensional video coding. The SAO coding tool has shown to improve video quality for conventional video compression. It is desirable to apply SAO to multi-view video to improve the coding efficiency.