In order to provide good quality video images, and provide improved, new generation video codec applications, some desirable attributes of video codec include (1) the ability to maximize perceived video quality of important regions in an image given limited video coding bitrate or bandwidth, and (2) the ability to enable object-based video codec where objects in an image are detected to adjust codec precision accordingly. While the first attribute, at least to some extent, can be addressed by normal video coding (meaning coding without using region segmentation-based coding) and by video coding standards such as H.264, scalable video coding (SVC), High Efficiency Video Coding (HEVC), or scalable HEVC (SHVC), or by non-standard alternative video codecs such as VP8 and VP9 to name a few examples. However, to get the best results with these standards, awareness of important regions (region segmentation) may be necessary. Further, in principle, a standard such as MPEG-4 that supports explicit coding of objects is necessary to achieve the second attribute. However, the standards, be it MPEG-4, H.264 or HEVC, only describe bitstream syntax and decoding semantics, and only loosely mandate details of an encoder, much less details of segmentation. Further segmentation of video, though desirable in enabling advanced applications, can be computationally complex and very context dependent. This is further complicated because the standards do not cover segmentation of video.
In limited bandwidth video coding, quantization adapted to human perceptual and/or visual requirements can be used to achieve improved video quality as perceived by the users. Specifically, in video encoding, luma and chroma pixel values may be transformed into frequency coefficients, such as discrete cosine transform coefficients, that are then quantized or rounded to certain values in order to reduce the unnecessary precision in the coefficients beyond what is detectable by the human eye. For example, the human eye is less sensitive to color than brightness, and the human eye can only notice a certain level of difference in brightness and color. Thus, to improve perceived image quality, several processes may be exploited such as but not limited to (1) identifying highly textured areas where more noise can be added without adding visually noticeable artifacts, (2) identifying areas of very high or very low brightness, where somewhat higher quantization artifacts can be hidden, (3) identifying frames just before or just after scene cuts where more quantization noise can be introduced without it being very visible, and (4) identifying areas of focus such as human faces and other objects within a video that are likely of higher interest (region of interest (ROI)) such that ROI areas can be coded with finer quantization and better quality, such as a foreground, while other areas are coded with relatively lower quality, such as a background.
This last technique is especially relevant in the context of certain applications such as video conferencing, video chats, and other applications including applications that use foreground overlays on a background. For these examples, the segmentation of a usually static, or at least more static, background (BG) from usually moving human head and shoulders, or other overlay objects, in a foreground (FG) is used to concentrate the fine coding on the foreground to improve the coding. While many general techniques for segmentation of foreground from background are available, most of the techniques are either compute intensive, or poorly perform the segmentation of background from foreground, or both. With better quality practical segmentation, the available limited coding bandwidth can be better directed at the ROI, such as to improve the human or other foreground objects in the scene, thereby giving a perceived overall improvement in image quality.