The present application relates to transmission and storage of video streams and the like, and more particularly to optimizing use of bandwidth and compute cycles in video transmission and/or storage.
Note that the points discussed below may reflect the hindsight gained from the disclosed inventions, and are not necessarily admitted to be prior art.
Video encoding is widely used to convert images and video streams into forms suitable for transmission over limited bandwidth communications channels. Various video encoding schemes have been developed in attempts to minimize encoding and decoding computation complexity, optimize bandwidth use, improve compressed video image quality and increase energy efficiency.
FIG. 2A shows a generalized example of a video encoding system. A video stream source 20, such as a video camera or rendering program, produces a stream of frames 30. The stream of frames 30 can consist of any visual content 10 that has been recorded or otherwise converted into or generated as a bitstream, such as a sporting event, a movie or television show, a conversation, or a computer graphics demonstration.
The stream of frames 30 enters a video encoder 40, which can use any video encoding/decoding scheme, such as AVC, MPEG-2, VC-1 or Ogg Theora. The video encoder 40 produces an encoded bitstream 50, which can be transmitted over any communications channel 60, including storage, wired, wireless or other means such as a USB stick, DVD, Internet, home network or wireless phone network.
Ultimately, an encoded bitstream 50 is received at a decoder 70, which decodes the encoded bitstream 50 and sends the resulting decoded video stream 80 to a display device 90 to be displayed 100. While the choice of display device 90 may depend on the particular visual content 10, generally, any display device 90 can be used, such as a video-capable mobile phone, a tablet computer, a laptop or desktop computer screen or a television.
H.264/MPEG-4 Part 10, also called Advanced Video Coding or AVC, is one example of a standardized video encoding/decoding scheme, and is used for recording, compression and distribution of high definition video. The AVC specification can be obtained from the International Telecommunication Union web site.
FIG. 2B shows a conventional implementation of AVC encoding. The video encoder 40 initially receives a frame 110 from a stream of frames 30.
A frame 110 is composed of pixels. A pixel is a single point in a recorded image, and is comprised of one or more samples 250. A sample 250 is the intersection of a channel and a pixel-that is, a sample 250 is a portion of a pixel that describes an attribute of the pixel, such as color (also called “chroma”) or brightness (also called “luminance” or “luma”). Pixels encoded in AVC can include luma samples, chroma samples, monochrome samples or single-color samples, depending on the type of picture. Samples 250 are comprised of bits. Different samples 250 can be comprised of different numbers of bits.
FIG. 2C shows the composition of an individual frame 110 of a stream of frames 30 in a generalized video encoding system. A frame 110 has a given height and width in pixels. Generally, a frame 110 has the same height and width in luma samples as it does in pixels (this may not be true with respect to chroma samples). The frame size is its height multiplied by its width. For groups of samples 230 of size P×Q, a frame 110 contains frame size/P×Q groups of samples 230. For patches of samples 240 of size M×N, a group of samples 230 contains P×Q/M×N patches of samples 240.
FIG. 2D shows the composition of an individual frame 110 of a stream of frames 30 encoded using AVC. In AVC, samples 250 are arranged into two-dimensional arrays called macroblocks 260. A macroblock 260 can contain a 16×16 block of luma samples and two corresponding blocks of chroma samples (of varying sizes) of a picture that has three sample arrays, or a 16×16 block of samples of a monochrome picture or a picture that is coded using three separate color planes.
From here on, for convenience, a 16×16 block of luma samples encoded or decoded using AVC will be called a “macroblock” 260 and samples 250 in that macroblock 260 that are referred to will be luma samples 250. A frame 110 contains frame size/256 macroblocks 260. A subset of a macroblock 260 (an array of samples equaling in size or smaller than a macroblock 260) will be called a “sub-block” 270. A (luma) sub-block 270 can be a 16×16, 16×8, 8×16, 8×8, 4×8, 8×4 or 4×4 subset of the macroblock 260 (subsets of blocks of chroma samples can be different sizes). Sub-blocks 270 described hereinbelow for exemplary purposes will be size 4×4 unless stated otherwise.
As stated above, FIG. 2B shows an implementation of a conventional AVC video encoder 40. A frame 110 from a stream of frames 30 is received by the video encoder 40. Generally, an I-frame is used to begin a stream of frames 30 or a new scene, or to provide a reference frame 210 that, when encoded, will be minimally distorted. P-frames use a prior I-frame or P-frame from the stream of frames 30 as a reference frame 210. B-frames can use both a prior and a later frame in the stream of frames 30 as reference frames 210.
Intra Prediction 120 or Inter Prediction 130 is performed on each macroblock 260 in the frame 110 depending on factors including whether the frame 110 is an I-type, P-type or B-type frame. Partitioning into sub-blocks 270 occurs in both types of prediction. Mode Select 140 picks from an available set of pre-defined rules that Intra Prediction 120 or Inter Prediction 130 uses to recreate each sub-block 270 as nearly as possible based on the contents of nearby sub-blocks 270. Subtracting 150 the predicted contents of the sub-block 270 from the actual contents of the sub-block 270 is intended to result in a bitwise representation of the sub-block 270 that is as close to zero as possible; smaller numbers take fewer bits to encode than larger numbers. The chosen prediction rule is encoded along with the Subtraction 150 result so that the sub-block 270 can be recreated by the decoder 70. Prediction is used so that fewer bits are needed to encode the frame 110.
In Intra Prediction 120, a macroblock 260 in the frame 110 that is being predicted is partitioned into sub-blocks 270.
Intra Prediction 120 is performed on each sub-block 270, generating a prediction based on samples 250 adjacent to the sub-block 270. The adjacent samples 250 used in Intra Prediction 120 can, for example, consist of previously decoded and reconstructed samples 250—that is, samples 250 that have already been through the Inverse Quantize 180, Inverse Transform 190 and Add 200 stages to recreate decoded versions of encoded samples 250.
There are twenty-two Intra Prediction 120 modes defined by the AVC specification. Each mode is a set of rules describing how to construct a sub-block 270 from adjacent samples 250. Mode Select 140 attempts to determine the Intra Prediction 120 mode that, based on the adjacent samples 250, can be used to construct a predicted sub-block 270 that most closely resembles the actual sub-block 270.
Once Mode Selection 140 has been completed, the predicted sub-block 270 is Subtracted 150 from the actual sub-block 270 and the result is passed to the Transform 160 stage.
In Inter Prediction 130, each macroblock 260 is partitioned into sub-blocks 270 and prediction is performed based on comparison of the frame 110 currently to be encoded to a reference frame 210 nearby in the stream of frames 30 (or, for B-frames, two nearby reference frames 210, potentially a previous frame and a later frame). The reference frame 210 can consist of a previously decoded and reconstructed frame 110—that is, a frame 110 that has already been through the Inverse Quantize 180, Inverse Transform 190 and Add 200 stages to recreate an encoded and then decoded version of the frame 110. Mode Select 140 determines which Inter Prediction 130 mode to use, including how to partition the macroblock 260, in order to most efficiently encode the macroblock 260.
In Inter Prediction 130, a current motion vector is generated for each sub-block 270 by finding a corresponding sub-block 270 in the reference frame 210 near the location of and containing similar visual content to the sub-block 270 currently being encoded. An offset is then determined between the currently encoding sub-block 270 and the corresponding sub-block 270. A predicted motion vector is generated from previously generated current motion vectors of neighboring sub-blocks 270 in the frame 110. The predicted motion vector is Subtracted 150 from the current motion vector and the result is passed to the Transform 160 stage.
At the Transform 160 stage, an integer block transform is performed on each macroblock 260 resulting from Subtraction 150. The output of the Transform 160 stage is then Quantized 170.
Quantizing 170 consists of multiplying the output of the Transform 160 stage by a multiplication factor and then performing a bitwise right-shift in order to deliberately implement a chosen level of lossiness, thus allocating a particular number of bits to encode each macroblock 260. The purposes of Quantizing 170 include attempting to achieve a desired ratio of visual quality to compression and to match imposed bandwidth limitations.
The amount of bitwise right-shift is determined by a variable called QP. Choice of QP determines how much detail is retained in the frame 110 and how many bits will be required to encode the frame 110. QP is chosen by rate control, which is part of Quantizing 170.
Once the frame 110 is Quantized 170, the resulting bitstream passes through a Bitstream Encode 220 stage, which typically includes a reordering stage and a lossless entropy encoding stage. The frame 110 is then output by the encoder 40.
The Quantizing 170 result also is sent to an Inverse Quantizing stage 180, where Quantizing 170 is reversed; an Inverse Transform 190 stage, where the Transform 160 is reversed; and an Add 200 stage, where the prediction that was originally Subtracted 150 is Added 200 back in. The result of the Add 200 stage is a decoded version of the frame 110, which can then be used by the encoder 40 as a reference frame 210.
FIG. 2E schematically shows an example of conventional end-to-end encoding and decoding in the context of an internet user. Visual content 10 is digitized and sent to an encoder 40. The encoded content is then uploaded 55 to the Internet 65. The encoded content can then be downloaded 67 by a user. The user's device decodes 70 the encoded content, and the user's display device 90 displays 100 the decoded visual content 10. It is advantageous for a user experience for downloading 67 and decoding 70 to be as fast as possible, and for the displayed 100 visual content 10 to be as high quality as possible.