When video is streamed over the Internet and played back through a Web browser or media player, the video is delivered in digital form. Digital video is also used when video is delivered through many broadcast services, satellite services and cable television services. Real-time videoconferencing often uses digital video, and digital video is used during video capture with most smartphones, Web cameras and other video capture devices.
Digital video can consume an extremely high amount of bits. The number of bits that is used per second of represented video content is known as the bit rate. Engineers use compression (also called source coding or source encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video information by converting the information into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original information from the compressed form. A “codec” is an encoder/decoder system.
Over the last 25 years, various video codec standards have been adopted, including the ITU-T H.261, H.262 (MPEG-2 or ISO/IEC 13818-2), H.263, H.264 (MPEG-4 AVC or ISO/IEC 14496-10), and H.265 (ISO/IEC 23008-2) standards, the MPEG-1 (ISO/IEC 11172-2) and MPEG-4 Visual (ISO/IEC 14496-2) standards, and the SMPTE 421M standard. A video codec standard typically defines options for the syntax of an encoded video bitstream, detailing parameters in the bitstream when particular features are used in encoding and decoding. In many cases, a video codec standard also provides details about the decoding operations a video decoder should perform to achieve conforming results in decoding. Aside from codec standards, various proprietary codec formats define options for the syntax of an encoded video bitstream and corresponding decoding operations.
In general, video compression techniques include “intra-picture” compression and “inter-picture” compression. Whereas intra-picture compression compresses a given picture u sing information within that picture, and inter-picture compression compresses a given picture with reference to a preceding and/or following picture (often called a reference or anchor picture) or pictures.
Inter-picture compression techniques often use motion estimation and motion compensation to reduce bit rate by exploiting temporal redundancy in a video sequence. Motion estimation is a process for estimating motion between pictures. In one common technique, an encoder using motion estimation attempts to match a current block of sample values in a current picture with a candidate block of the same size in a search area in another picture, the reference picture. A reference picture is, in general, a picture that contains sample values that may be used for prediction in the encoding and decoding process of other pictures.
For a current block, when the video encoder finds an exact or “close enough” match in the search area in the reference picture, the video encoder parameterizes the change in position between the current and candidate blocks as motion data such as a motion vector (“MV”). An MV is conventionally a two-dimensional value, having a horizontal MV component that indicates left or right spatial displacement and a vertical MV component that indicates up or down spatial displacement. An MV can indicate a spatial displacement in terms of an integer number of samples starting from a co-located position in a reference picture for a current block. For example, for a current block at position (32, 16) in a current picture, the MV (−3, 1) indicates a block at position (29, 17) in the reference picture. In general, motion compensation is a process of reconstructing pictures from reference picture(s) using motion data.
When encoding a block using motion estimation and motion compensation, an encoder often computes the sample-by-sample differences (also called residual values or error values) between the sample values of the block and its motion-compensated prediction. The residual values may then be encoded. For the residual values, encoding efficiency depends on the complexity of the residual values and how much loss or distortion is introduced as part of the compression process. In general, a good motion-compensated prediction closely approximates a block, such that the residual values include few significant values, and the residual values can be efficiently encoded. On the other hand, a poor motion-compensated prediction often yields residual values that include many significant values, which are more difficult to encode efficiently.
Encoders typically spend a large proportion of encoding time performing motion estimation, attempting to find good matches and thereby improve rate-distortion performance. Encoder-side decisions about motion estimation are not made effectively, however, in certain encoding scenarios. In particular, motion estimation decisions are not made effectively in various situations when encoding screen capture content for remote screen presentation (also called “screen remoting”). For example, when screen capture video shows a user scrolling through a text document or dragging a window that includes text content around a graphical user interface, conventional block-based motion estimation for 16×16 blocks, 8×8 blocks, 4×4 blocks, etc. is typically complex and time-consuming. In addition to using a significant amount of processing resources, which is problematic for low-complexity devices, this can add delay, which is problematic for real-time screen remoting. Also, block-based motion estimation often fails to detect scrolling activity and window movement activity of large magnitude in screen capture video. When such scrolling activity and window movement activity are not efficiently encoded, overall compression efficiency suffers, which is especially problematic in low-bandwidth scenarios.