The technology described herein relates to methods of and apparatus for encoding video data, and in particular to differential encoding techniques.
As is known in the art, differential encoding involves comparing portions of data with one another and using information relating to the differences between the portions of data rather than the entire data portions themselves to represent the “original” data. This has the advantage that a smaller volume of data is required to encode a given amount of original data, which can be important where, for example, the data transmission capacity is restricted.
Such differential encoding techniques are particularly suitable for the compression of (digital) video data, because although there may be 25 to 30 video frames per second, within a given scene in a video sequence, each frame will typically be very similar to the adjacent frames, with the differences only often being due to “objects” in the frames moving to different positions. This means that much of the video data necessary to reproduce successive frames in a video sequence is substantially identical as between frames.
The MPEG video compression standards and other related algorithms, for example, therefore use differential encoding to compress video data, e.g. for transmission or storage purposes.
Generally, in differential encoded video data each video frame is divided into a plurality of blocks (16×16 pixel blocks in the case of MPEG encoding) and each block of the frame is encoded individually. Three types of data “block” are usually used (e. g. stored or transmitted). These are commonly referred to as INTRA (I) blocks, INTER (P) blocks and bi-directionally predicted (B) blocks.
INTRA (I) blocks are coded frame blocks that are predicted from the same frame only, i.e. are not dependent on previous (or future) frame blocks. INTER (P) blocks and bi-directionally predicted (B) blocks are differentially coded frame blocks that describe the differences between the “current” block and a “prediction” frame block created from video data in frames before the current frame, and, in the case of B blocks, also video data in frames generated after the current frame. The “prediction” frame block that the differences encoded in P and B blocks are referenced to could, for example, simply comprise a preceding I frame block, or could be a more complex frame block predicted, e.g., from an I block and one or more preceding P blocks.
As in such arrangements P and B blocks only contain data relating to differences between blocks in frames in the original video data, they are considerably smaller than I blocks, and so the overall amount of data that must be transmitted or stored can be reduced by using P and/or B blocks to encode the data. (However, I blocks must still be stored or transmitted at intervals to allow the complete original data to be reconstructed.)
As is known in the art, an important aspect of such differential encoding of video data is identifying which areas of the video frames being compared are most similar to each other (such that there is then a reduced or minimum number of differences to be encoded). This process is complicated by the fact that, typically, the area of the “prediction” (reference) frame that most closely matches a given block or area in the current frame will not be in the same position within the reference frame as that area is in the current frame. This is because the most closely matching areas in the video frames will tend to move between frames, as objects in the video sequence move around.
Differential encoding of video data typically therefore involves two aspects: firstly identifying the location in a “reference” video frame of the area in that frame that most closely matches the area (block) of the video frame currently being encoded, and then determining the differences between the two areas in the two frames (i.e. the current and the reference frame).
The encoded data accordingly usually comprises a vector value pointing to the area of a given reference frame to be used to construct the appropriate area (block) of the frame currently being constructed, and data describing the differences between the two areas. This thereby allows the video data for the area of the frame currently being constructed to be constructed from video data describing the area in the reference frame pointed to by the vector value and the difference data describing the differences between that area and the area of the video frame currently being constructed.
The process of identifying which areas in different video frames most (or sufficiently) closely match and accordingly determining the vector to be stored to point to the relevant area in the reference video frame is usually referred to as “motion estimation”. This process is usually carried out by comparing video data values (usually luminance values) for each pixel in a given area or block (typically a 16×16 pixel block in MPEG systems) of the video frame currently being encoded with a succession of corresponding-sized pixel blocks in the reference video frame until the closest (or a sufficiently close) match in terms of the relevant video data values is found. The vector pointing to the so-identified pixel block in the reference frame is then recorded and used for the encoded data stream. The relative closeness or match between relevant video data for the pixel blocks being compared is assessed using difference comparison or cost functions, such as a mean-squared difference (MSD) function.
However, because they require a comparison between a large number of pixel video data values (e. g. 256 pixel values where 16×16 pixel blocks are being tested), such “motion estimation” processes are computationally intensive, even if the range of the search over the reference frame (i.e. the region of the reference frame over which the search for the closest matching frame area is carried out) is deliberately limited.
The data that describes the differences between the two areas in the two frames (i.e. the current and the reference frame) is typically represented in the encoded data stream as a set of frequency coefficients. The process of calculating the appropriate frequency coefficients typically involves performing a frequency transform operation, such as a discrete cosine transform. Again, this operation can be computationally expensive.
These computational expensive operations can be disadvantageous generally, but particularly are so where the processing power of the encoding system may be limited. This could, e.g., particularly be the case where it is desired to encode “real time” video data using, e.g., a mobile device that may accordingly have limited processing capacity.
The Applicants believe therefore that there remains scope for improvements to differential encoding techniques.