This invention relates to flowfield-based motion compensation for video compression of moving images.
Macroblock Motion Compensation
Block-based motion compensation began with MPEG-1, and has continued into MPEG-2, MPEG-4_part2, MPEG-4_AVC_part10, H.264, and SMPTE VC1 (also know as “Windows Media Player”, a Microsoft product).
The general principle used by these conventional motion compensation systems is that there is one motion vector for a square or rectangular group of pixels. The group of pixels associated with the motion vector is generally called a “macroblock”. It is common in MPEG and other compression systems to use reduced resolution for Red-Y (U) and Blue-Y (V) compared to Y (luminance), where Y usually equals approximately 0.59Green+0.29Red+0.12Blue. Intra-coded frames and the motion-compensated-difference frames use a common quantized Discrete Cosine Transform (DCT). The macroblock structure provides a common boundary for the Y U and V DCT-coded regions. The DCT size is usually 8×8 pixels, but may be anywhere from 4×4 up to 16×16 (e.g., in MPEG-4AVC_part10 and H.264). The nature of the DCT is that it is a self-contained regional transform which does not extend to any pixel outside its block (usually 8×8). In a sense, quantization errors in the DCT wrap from one edge to the other (i.e., left to right, and top to bottom) of the DCT block. Differing quantization errors in adjacent DCT blocks yield block edge discontinuities. Thus, since macroblock boundaries coincide with DCT block boundaries, they share a common edge. When a motion vector differs between adjacent macroblocks, the inherent edge discontinuity does not appear within the DCT block, but rather at its edge. Since the DCT block edge “wraps around” to see its opposite edges, the DCT is quite tolerant of the edges inherent in macroblock-based motion compensation.
However, for non-block-based transforms, such as DWT9/7 bi-orthogonal subband wavelets (e.g., as used by JPEG-2000), any macroblock edges due to motion compensation will appear as sharp edges. Such sharp edges will have significant large coefficients at all spatial frequencies, thus being inefficient for quantized transform compression.
Overlapped Block Motion Compensation
One way that motion-vector-displaced macroblock edge discontinuities have been reduced is to use Overlapped Block Motion Compensation (OBMC) as described in MPEG-4_part2. OBMC overlaps the edges of the macroblocks by a weighted blend, usually using a straight line weighting (linearly varying proportion of the non-linear pixels). It is easily seen that a weighted blend of two adjacent macroblocks will have two displaced copies of the image superimposed, similar to “double vision”. At the corners of the macroblock, four adjacent macroblocks are weighted together and superimposed.
While OBMC smoothes the discontinuity due to motion compensation at macroblock edges, the displaced superimposition is clearly not optimal. No sharp image details can be correctly produced at macroblock edges using OBMC when motion vectors are not exactly the same (which they usually are not). Thus, image details are only preserved clearly in the center of macroblocks when using OBMC at their edges.
Restriction to Block Displacement
MPEG-style macroblock motion compensation is a pure block displacement. There is no provision for image zooming, rotation, warping, nor changing shape due to motion (like a person running). Block displacement cannot account for natural visual affects like atmospheric distortion on telephoto moving images which change size and position locally. Not only is block displacement prone to inaccurate reproduction of many types of motion, but the algorithms which are used to determine the displacement, such as the Sum of Absolute Difference (SAD), often fail, resulting in wildly disparate motion vectors which point to parts of the image having no relationship to the intended macroblock.
B Frame Pixel Weighting
In MPEG-2 and MPEG-4 part 2, an equal macroblock pixel weighting of the previous and subsequent reference frames is used in B frames (one of three types of frames in those systems; see further discussion in next section). U.S. Pat. No. 6,816,552, “Interpolation of Video Compression Frames” by the present inventor, introduced the idea of alternative weighted proportional blends, such as weighting by the frame distance (e.g., ⅓ if there are two intervening B frames), or by a blend of equal weight (½) and the frame distance (e.g., ⅓) to yield something in between (e.g., ⅜). This latter blend is beneficial since the equal weight has the benefit of lower noise, but the frame distance is often a better prediction in the absence of noise.
B Frame Direct Mode
In MPEG-4 Part 2 video compression, B frame macroblocks have four motion compensation displacement modes. They may be coded from a previous reference frame (backward), from a subsequent reference frame (forward), from the average of the previous and subsequent reference frames (bi-directional, with two motion vectors), or via “direct” mode. In “direct” mode, the subsequent reference frame is a P frame wherein the macroblock at the same location has either 1 or 4 vectors, with one vector for 16×16 or 8×8 pixels, respectively, within a 16×16 macroblock. A single delta for this B frame's macroblock is added to the 1 or 4 motion vectors from the subsequent P frame macroblock. The vectors are then scaled by the frame distance and applied bi-directionally (with equal weight forward and backward). A confusing issue concerning direct mode is that the motion vector from the subsequent P frame's macroblock points to a motion vector delta, such that this macroblock would be displaced in the current B frame time, and not at the same location as the current macroblock in the B frame to which it is being applied. The combination of the delta in direct mode, plus direct mode being only one of four coding mode choices for B frame macroblock motion compensation, allows this problem to be avoided in cases where this displacement in reference might prove a hindrance. Direct mode has proven statistically beneficial, especially when used with the subsequent 8×8 P frame mode (thus applying a single delta to four 8×8 motion vectors to yield both four forward and four backward motion vectors).
Note that OBMC is not applied to any B frame mode. OBMC is only applied to P frame macroblocks (in 16×16, or 8×8 modes, or 16×8 mode for interlace).
Adaptation to Layering
In U.S. Pat. No. 6,728,317, “Moving Image Compression Quality Enhancement Using Displacement Filters with Negative Lobes” by the present inventor, a base layer using macroblock motion-compensation and DCT is extended in resolution by a spatial resolution enhancement layer. The enhancement layer uses the base layer motion vectors as “guide vectors” for delta vectors for macroblock displacement of a resolution enhancement delta layer. The motion compensated delta layer difference from the current resolution delta layer is coded using either DCT or wavelet transforms. This system provides not only resolution layering but also efficiency improvement, since the lower resolution layer can be given a proportionally higher number of bits per pixel compared to the resolution enhancing layer(s). Every layer's macroblock edges correspond to DCT edges (if all layers use the DCT), which are somewhat tolerant of macroblock discontinuities. A similar use of base and enhancement layers is described in U.S. Pat. No. 6,510,177, “System and Method for Layered Video Coding Enhancement” by Jeremy de Bonet and Gary J. Sullivan
Use of macro-block displacement as guide vectors and higher layer motion vector deltas is inefficient for non-DCT transform systems, such as wavelets, since block edge discontinuities are present at all resolution layers.
Hierarchical Search for Motion Vectors
It has long been a common practice to search for motion vectors for macroblock motion compensation using a hierarchy of one or more reduced resolution images. Although the reference software implementations perform an exhaustive search for minimum SAD (Sum of Absolute Difference) at full resolution, many practical hardware and software implementations use a guided search over a restricted region of the image (such as searching within a small range limit from the adjacent motion vectors). Further, many practical hardware and software implementations filter to a reduced resolution prior to searching, reducing the number of required computations. Since a common macroblock size is typically 16×16 in Y (luminance), and 8×8 in U and V (e.g., in MPEG-2), a reduced resolution of half would still have 8×8 in Y (and 4×4 in U and V, although some matching algorithms only match Y and unwisely ignore U and V). Similarly, a reduced resolution for motion block matching of one quarter would have 4×4 in Y (and 2×2 in U and V, if present).
Motion-Compensated Frame Rate Conversion and Noise Reduction
U.S. Pat. No. 6,442,203, “System and Method for Motion Compensation and Frame Rate Conversion” by the present inventor, described the use of per-pixel motion compensation (as opposed to block-based) for the purpose of noise reduction or frame rate conversion.
For noise reduction, one or more frames previous and subsequent to the current frame are examined for displacement and used to reduce per-frame noise. When applied to noise reduction, the motion compensation system was used as a separate preprocessor for compression, or as a moving image improvement system (by reducing noise). The motion compensation was not applied to use within compression coding.
When applied to frame-rate-conversion, one or more previous and subsequent frames are examined by the system and motion vectors are interpolated to the new frame time.
A confidence value may be used to determine how much to rely on the motion-vector-displaced pixels versus using a simple blend of adjacent frames. If there is a low confidence value, then the current pixel of the current frame can be used with no noise reduction. If confidence match for a pixel for one or more frames is reasonably high, then those frames can be used to reduce the noise of that pixel. Similar rules may be applied with frame rate conversion. If the confidence value for a given pixel is low, then a proportional blend of adjacent frames can be used for that pixel. If the confidence value is high with respect to one or more nearby frames (previous and/or subsequent), then the motion vector(s) can be used to create a new pixel in a new location for the new frame time.
The projection of pixels from one or more nearby previous and/or subsequent frames onto a new frame time using motion vectors may not completely cover the image area, in which case the proportional blend of adjacent frames is used to fill in incomplete image areas. Further, even if the projection of pixels does cover a region of the new frame, the confidence may be low, and thus the proportional blend is again selected. Using the motion vectors and confidence, a new frame is created for noise reduction or frame rate conversion. Every processing step using the confidence and motion vectors from adjacent frames, as well as deciding not to use them, is potentially visible in some region of each frame. Thus, for these purpose, the very best motion matching must be performed, down to one or more motion vectors for each pixel, determined independently for each previous and subsequent frame referenced, with independent computation of confidence values. The process is therefore computationally intensive.
Inherent Imperfection of Frame-Based Motion Matching
The lack of a match, and a correspondingly low level of confidence, is a common occurrence in moving images. Each image is a frame with an open-shutter and closed-shutter duration. Such a square-wave temporal sample is theoretically suboptimal and will be prone to aliasing and other artifacts. During the shutter open time, moving objects will smear, known as “motion blur”. During the shutter closed time, no image observation occurs, allowing some object motions to be hidden between frames (such as a light which flashes on only when the shutter is closed).
Objects which obscure each other (such as a person walking behind a post, or entering or exiting a door) will not have a corresponding match between adjacent frames. There will be aspects of the object observable in one frame that were not present in a previous and/or subsequent frame. The frame edges during a pan or tilt also inherently reveal or remove portions of the image.
There are also many types of moving image which do not exhibit much similarity from one frame to the next, such as ocean waves or blowing leaves.
Image fade ups or fade outs, and scene cross-dissolves, also tend to gradually obscure or reveal scene changes, sometimes yielding a poor match. This is especially true when exclusively using a minimum SAD for matching, since that algorithm is not effective at handling fades and dissolves.
There is also the practical limitation that most motion compensation searches are limited to a small fraction of a frame. For example, a typical search range is between 5% and 15% of the frame's width and height. A moving ball in the sports of tennis, soccer, baseball, or football is likely to move further than this within a frame time, such that no motion match would be found within the preset limited search range. When B frames are placed between I and/or P frames, the distance between the P frame and previous reference P or I frames is often further increased by the number of frames. For example, with M=3 such that there are 2 intervening B frames, an object will travel 3 frames in time to the next P frame. Thus, there are many cases where practical search range limitations result in an inability to match macroblocks.
With digital sensors, such as CCD and CMOS imagers, it is common to have some degree of sensor sample aliasing. This is due to the deviation of practical square pixel sensor sites from more optimal sinc-like (sine(x)/x) or Gaussian-like filters. An optical filter (usually a bi-refringent quarter-wave layer) helps limit spatial frequencies to the pixel spacing, but the net filter is still imperfect to some degree. Motion of detailed image regions will therefore alias, which is a confounding factor (a “confound”) to motion matching. Also, imagers typically have regional sensor imperfections which impose a fixed pattern on each image frame. Fixed pattern noise does not move with respect to the sensor pixels, even if an image moves across the sensor. Thus, fixed pattern noise represents an additional confound in motion compensation. In U.S. patent application Ser. No. 11/225,665, “High Quality Wide-Range Multi-Layer Compression Coding System” by the present inventor, a system is presented for accumulating the fixed pattern noise on pure image black (e.g., with a capped lens) and subtracting it out to reduce the level of this confound with respect to motion compensation.
Thus, as can be seen from all of these various frame-based issues, motion matching is inherently imperfect. When performing noise reduction or frame rate conversion, the goal is the creation of a sequence of image frames with reduced noise, or having a different frame rate. In the case of noise reduction, the inherent motion matching imperfections result in an inability to reduce noise for some regions of some frames. In the case of frame rate conversion, inherent motion matching imperfections result in an inability to improve upon a simple weighted blending of adjacent frames for some regions of some frames.
Aliasing in DWT 9/7 Low Bands
The DWT 9/7 bi-orthogonal subband wavelet algorithm (as used in intra-coded JPEG-2000) builds a resolution pyramid with factor-of-two resolution layers. However, the low-low band of the DWT 9/7 bi-orthogonal subband wavelet differs somewhat from an optimal windowed-sinc filter. This difference will result in some aliasing artifacts in each factor-of-two resolution layer. The more layers of resolution reduction, the more aliasing will be present in the image.
Optimal Filters for Layers
In U.S. patent application Ser. No. 11/225,665, “High Quality Wide-Range Multi-Layer Compression Coding System” by the present inventor, an optimal (windowed-sinc-based) filter is used to create a resolution pyramid with quantized deltas at each resolution level. Such optimal filtered resolution layers are not prone to the aliasing affects of DWT 9/7 low-low subband filters. The use of block-displacement motion compensation is also described at full resolution, or at a layer below full resolution (such as at half or quarter of full resolution).
In practice, one or more optimal filter plus delta layers can be matched with one or more DWT 9/7 bi-orthogonal subband wavelet layers. In particular, when applying block-displacement motion compensation, it is useful to minimize aliasing by using optimal filters for layers immediately adjacent to the motion compensation layer (above, below, or both, even though other layers may use the DWT 9/7 wavelet), since aliasing interferes with block-displacement matching.
In addition, the compression coding system is extended to arbitrarily high precision and dynamic range by utilizing ubiquitous floating point in computations, excluding the quantization step and corresponding variable-length coding of the resulting integers.
Optimal Windowed Sinc Filter
The theoretically optimal filter function for resampling is a sinc function (sine(x)/x) of infinite extent. The theory assumes that the samples themselves were taken with the same sinc filter of infinite extent. However, no real pixel samples correspond to such sampling, but are rather more similar to a Gaussian sample or a box sample. In light of this, a more useful variant of the sinc function is truncated by being weighted with a “window”. A typical practical windowed sinc function has an extent of ±3 times Pi (6 total pixels). A common window function is cosine(Pi/6) which has its first zeros at ±3. The central Gaussian-like positive lobe extends to zeros at ±1 times Pi (2 total pixels).
When applying the windowed sinc to resizing filters, the scale is set to the lowest of the two resolutions. When applying the windowed sinc to a displacement filter, the size of both resolutions is the same.
U.S. Pat. No. 6,728,317, “Moving Image Compression Quality Enhancement Using Displacement Filters with Negative Lobes” by the present inventor, describes the use of negative lobed windowed sinc filters for displacement of macroblocks. For each macroblock, a selection can be made of one of a number of displacement filter functions of varying levels of sharpness. Such selection is beneficial since noise levels and sharpness will vary over an image, such that the best match for one macroblock may be sharper or softer than the best match for another macroblock.
In addition to fully-sharp windowed sinc scaled with the first zeros at ±1 times Pi, it is possible to “soften” an image, blurring it slightly, when resampling for displacement or resizing. Softening is achieved by linearly increasing the scale of the sinc function (thus widening it) to greater than ±1 to the first zeros, while retaining the ±3 window extent. The amount that the sinc function scale is increased above ±1 controls the amount of softening (blurring) during the resample.
There are other common and useful filters applied to resampling which do not use negative lobes. These include simple Gaussian, box, and triangle filters, as well as spline and bicubic smooth curves. It is also sometimes useful to add the inner all-positive center portion of the sinc in the range ±1 with full amplitude and then zero beyond ±1. This form of sinc is truncated to its all positive central region over the range ±1.
Arbitrary or variable resizing can be achieved by placing a windowed sinc (or other useful sampling function) over the center of each new pixel, applied by normalized weighting to the pixels or the source image. Normalization is necessary to ensure that the sum of coefficients of the resampling function adds to 1.0 (unity), since the number of taps applied within the resampling function (e.g., windowed sinc of ±3 extent) during variable resizing can vary widely.
Amplitude Reduction of Resolution Enhancing Layers Near Frame Edges
In U.S. Pat. No. 5,852,565, “Temporal and Resolution Layering in Advanced Television” by the present inventor, resolution enhancing layers are used to improve an MPEG-2 block motion compensated DCT base layer. The resolution enhancing layers are difference images, which may be motion-compensated from previous frame resolution enhancing layer difference images. The resulting difference to the correct resolution enhancement is then coded using either wavelets or DCT. Near the edges of the frame, the amplitude of this resolution enhancing difference layer may be reduced, since regions of interest within a frame are not usually at the very edge of the frame.
Integration of Noise Reduction with Encoding when Using Lossless Residual(s)
In U.S. Patent Application No. 60/758,490, “Efficient Bit-Exact Lossless Image Coding Residual System” by the present inventor, the requirement is established that noise reduction must be integrated with encoding when using a lossless bit-exact residual coding.
Computation of Correlation and Autocorrelation
Correlation and autocorrelation are widely used in signal processing.
Autocorrelation of a region of pixels is the square of each pixel value's difference within a region from the average (DC) pixel value of that region. The square of these signed values is then summed over the pixels within the region to yield the autocorrelation.
Correlation between two signals, or in this application between two regions of digital image pixels, is also known by the name “cross correlation”. Correlation between two regions of pixels first subtracts the average of each region from each value (to create signed values), and then multiplies the values for each location between the two regions. The result is summed over the pixels within the region to yield the correlation.
Note that correlations can be zero and negative, although autocorrelations are always positive (or zero if all pixels are equal).