The present invention relates to parallel video compression.
Video data is often compressed in order to efficiently transmit the data or to reduce the storage requirements of the data. The compression process may require considerable processing time. It is often desirable to reduce the compression time or encoding time. Different methods can be used to achieve such reduction. One method is to parallelize the compression process. Parallelizing the compression task may also be used to improve the quality of the compressed video that can be obtained within a given amount of time.
Within the overall compression process, there exist various different compression steps including, for example, estimation, compensation and encoding. Estimation involves identifying similarities between portions of the same image or different images. Similarities may be identified in the form of similarity vectors, such as motion vectors. Compensation subtracts from a current region a reference region that was determined to be most similar to thereby obtain “residuals.” The reference region may be, for example, a reconstructed reference region (the version of the reference region that will be available to the decoder during decoding). Encoding uses similarity vectors and residuals to construct a video file or video stream. Although directional estimation expressed in terms of estimation vectors is commonly used, other forms of estimation have no associated direction. For example, one form of spatial estimation averages nearby values. In the following description, the term “estimation results” is used to include both estimation vectors and other estimation results having no associated direction.
A time-consuming step in video compression is motion estimation, in which a current region within an image frame is compared to regions in other image frames, which may occur temporally before or after the current image frame. When a good match is found, the data for that region can be compressed by including a reference to the location of the match. The compressed data will include information about differences between the current region and the matched region. Typically, such differences will be in relation to the decompressed, or decoded, version of the matched region; in other instances, the differences may be between the current region and the original matched region. Similarly, a region within an image frame can be compared to other regions within the same image frame. In this case, a good match is found within the same image frame because of recurring spatial image patterns. “Estimation” is used herein to refer to the finding of similar regions within different image frames, or to the finding of similar regions within both different image frame(s) and the same frame.
Motion estimation and spatial estimation are often very time consuming and computationally intensive and may involve various block matching algorithms (BMAs). For example, motion estimation or spatial estimation may be used to find two or more matches to a region, where an interpolation or other combination of the matches may be used to further improve the image compression. Also by way of example, motion estimation or spatial estimation may find that a current region best matches another region after the other region has been interpolated to generate video samples between the originally existing pixels. In both examples, these algorithms may improve the quality of the compressed data, but the increased complexity of the algorithms require increased execution time.
Encoding is typically done serially at the frame level since there exists a potential dependency on previously encoded frames that may be used as a reference for the current frame. That is, a block of video data within a frame may be encoded by a reference to a portion of a previous frame, plus the differences between the current block and the decoded version of that reference. This process requires that any reference frames that an encoder may want to use for encoding a current frame be previously processed.
By way of example, consider that for a given encoder, each frame takes X seconds for motion estimation, and thus a piece of content that has Y number of frames will take X*Y number of seconds for the estimation portion. Current motion estimation processes use previously encoded frames as references and thus, at the frame level, estimation is done in series. Due to this causality, compression algorithms may attempt to reduce their execution time of the motion estimation component through a reduction of trial matches within each frame, using a “best guess” approach and not necessarily a best match approach. This kind of solution is not readily scalable in terms of encoding time, and potentially results in even less encoding efficiency in the encoded result. Encoding efficiency may be defined in terms of execution time, subjective quality, objective quality, bitrate, overall file size, etc. This same observation applies to subframes as well, where a subframe is defined as a subset of a frame.
Compared to motion estimation, spatial estimation within a frame has a smaller set of potential candidates, all of which are within the frame itself. Spatial estimation is also considered to be computationally intensive, but not as much as motion estimation. The challenges described in the previous paragraph also apply to spatial estimation.
Following estimation and compensation, encoding takes place. In MPEG video compression, three types of coding frames are commonly used. I-frames are “intra-coded”, meaning that they refer only to themselves and do not rely on information from other video frames. P-frames are “predictive frames.” I-frames and P-frames are called anchor frames, because they are used as references in the coding of other frames using motion compensation. The first P-frame after an I-frame uses that I-frame as a reference. Subsequent P-frames use the previous P-frame as a reference. Additionally, B-frames (bi-directional frames) are coded using the previous anchor (I or P) frame as a reference for forward prediction, and the following I- or P-frame for backward prediction. An important point to note is that any errors in reference frames can propagate to frames which use the reference frames. For example, errors in a given P-frame can propagate to later P-frames and B-frames. A group of pictures (GOP) may consist, for example, of a series of frames such as: I-B-B-B-P-B-B-B-P-B-B-B-P. A GOP where all predictions do not refer to frames outside itself is considered a closed GOP. Vice versa, if any predictions refer to frames outside the GOP, then it is considered an open GOP.
By using concurrent processing of multiple processors to perform motion estimation, the video compression processing time can be reduced. However, previous attempts to improve processing speeds by performing parallel motion estimation have had undesirable limitations. In particular, parallel video encoding arrangements are known in which each of multiple slaves receives a frame to be encoded together with its reference frames, where the references are the raw (original) frames. Upon receipt, each slave performs motion encoding (including both estimation and compensation) of the frame using the raw reference frames, and then returns the encoded frame to a master. For example, Jeyakumar and Sundaravadivelu (Proceedings of the 2008 International Conference on Computing, Communication and Networking, “Implementation of Parallel Motion Estimation Model for Video Sequence Compression”) describe one such approach to parallelizing video sequence compression by using multiple processors. Nang and Kim (1997 IEEE, “An Effective Parallelizing Scheme of MPEG-1 Video Encoding on Ethernet-Connected Workstations”) also propose a similar method of parallelizing video compression.
In the foregoing arrangements, each slave performs the compensation based on the raw references frames and not the reconstructed reference frames. For lossy compression, since the decoder does not have access to the raw frames, during reconstruction (using the residuals), the decoder will use reconstructed frames as references. The result is a reconstruction mismatch between the encoder and decoder. This type of mismatch will propagate errors further in the decoder to other frames, continually using mismatched results as references until the next keyframe. This mismatch can become problematic where the GOP size is large or when higher lossy compression is applied.