This invention generally relates to encoding and decoding content, and more specifically to encoding and decoding content for videos.
Various online systems transmit information to and from one another over a network. The information may be in the form of images, videos that include a sequence of frames, or text. A sender typically encodes the information using an encoder system into a compressed form, and the compressed information is transmitted to the receiver. The receiver can then decode the compressed information using a decoder system to reconstruct the original information. A video typically includes a sequence of image frames that capture the motion of objects and background of a scene that occur due to movement of the camera or movement of the objects themselves. Compared to other types of information, video compression can be challenging due to large file size and issues such as video and audio synchronization. Video compression for lower-power devices, such as smartphones, can be even more challenging.
One way to encode each target frame in the sequence is to take advantage of redundant information in “reference frames.” Reference frames for a target frame are frames in the video that are reconstructed before the target frame. In a process termed “P-frame compression,” an encoder system identifies blocks of pixels in a reference frame. For each block in the reference frame, the encoder system determines the displacement of the block in the reference frame and a corresponding block in the target frame that contains the same portion of the scene. The displacement reflects the movement of the portion of the scene from the reference frame to the target frame. Typically, the displacements are represented in the form of motion vectors that indicate the direction and magnitude of the change from the reference frame to the target frame.
During the encoding process, the encoder system repeatedly determines motion vectors for a sequence of target frames in the video, each with respect to a reference frame that was reconstructed before the target frame. The encoder system generates a compensated frame by displacing the blocks of pixels in the reference frame based on the determined motion vectors. The compensated frame may resemble the target frame at a high-level, but may not include all of the details in the target frame. Thus, the encoder system also determines a residual frame that describes the difference between the target frame and the compensated frame. The encoder system compresses the motion vectors and the residual frame for each target frame for transmission to the receiver.
The decoder system at the receiver can repeatedly reconstruct each target frame by applying the motion vectors to a reconstructed reference frame to generate the compensated frame. The residual frame is combined with the compensated frame to generate the reconstructed frame. The reconstructed frame in turn can be used as the reference for the next frame in the video. By encoding the video frame using motion vectors and a residual frame, the encoder system may transmit a significantly smaller number of bits to the receiver compared to encoding the actual pixel data of the target frame from scratch.
However, P-frame compression can be difficult because representing the target frame in terms of motion vectors and residual frames alone may be too rigid and constraining. For example, some blocks may contain partial occlusions, in which two superimposed objects are each moving in different directions of motion. Determining the motion of the objects as a single motion vector for the block may be inappropriate, resulting in low reconstruction quality. As another example, while it may be advantageous to encode a frame using reference frames in the distant past, this is computationally infeasible in practice, and typically, reference frames temporally closest to the target frame are used to encode the frame. Moreover, while the relative amount of information spent on motion vectors and the residual frame remain relatively constant, it may be advantageous to adjust the relative amount between the two types of information depending on the content of the target frame.