This invention relates to processing of digitally encoded video and movie signals, and more particularly to a system and method for de-interlacing, motion compensation and/or frame rate conversion of digitally encoded video and movie signals.
At the present time, the world""s xe2x80x9cstandard definitionxe2x80x9d and xe2x80x9chigh definitionxe2x80x9d television systems have parameters which are relatively incompatible internationally. Even within the U.S., the Advanced Television Systems Committee (xe2x80x9cATSCxe2x80x9d) is proposing a variety of formats which are relatively incompatible with each other, as well as incompatible with other international standards. Of all of the parameters of television systems, the most problematic and incompatible are frame-rate and interlace.
Most video camera and film images are captured with a single picture output rate. The common output rates are 24 frames-per-second (fps) for film, 25 fps (film in Europe for TV), 50 Hz interlaced, and 60 Hz interlaced. It would also be desirable to have 72 Hz and/or 75 Hz display rates in order to eliminate flicker on CRT""s and other flicker-type display devices. (Computer displays most commonly use 75 Hz display to eliminate flicker). A 60 Hz display rate (U.S. and Japan NTSC TV) and 50 Hz display (European PAL and SECAM TV) have substantial flicker, which becomes intolerable on large bright screens.
One approach to resolving the problem of multiple incompatible frame rates is disclosed in U.S. Pat. No. 5,737,027, entitled PIXEL INTERLACING APPARATUS AND METHOD, assigned to the assignee of the present invention (hereby incorporated by reference). That system used a special camera pixel pattern to generate multiple frame rates, which are otherwise incompatible, from a common signal using a xe2x80x9cPixelacexe2x80x9d technique. The xe2x80x9cPixelacexe2x80x9d technique uses sub-groups at a high Least Common Multiple (LCM) frame rate of all desired output frame rates in order to allow output at all of the otherwise incompatible frame rates. However, high frame rates cameras are not yet available which can perform at 1800 fps, which is the LCM of the rates of 24, 25, 30, 50, 60, 72, and 75 fps. Thus, while this system is indeed a solution to the frame rate problem, it requires custom cameras which generate pixels in the xe2x80x9cPixelacexe2x80x9d format.
Interlacexe2x80x94the sequential display of a field of even raster lines and a field of odd raster lines to make a single framexe2x80x94makes any form of video conversion difficult. Thus, re-sizing, speed adjustment, frame rate conversion, or resolution change all become very difficult, and the converted results are usually poor in quality.
For a decade or two, xe2x80x9cstandards convertersxe2x80x9d have been offered to convert between 50 Hz interlaced PAL and 60 Hz/59.94 Hz interlaced NTSC. These standards converters have been used for some live international sports events coverage such as the Olympics. Such converters often provide poor results, such as both a soft blurry image as well as peculiar artifacts (such as gymnasts with three legs and four arms during their transient acrobatics).
Some of the artifacts from frame rate conversion are theoretically incapable of being properly detected or repaired. Both interlace and standard image frame capture leave xe2x80x9cholesxe2x80x9d (i.e., no video information) in their observation of a subject image over time. In particular, interlaced fields have holes between the odd or even scanlines. Thus, for example, small horizontal objects can actually be present in a scene but fall between the unobserved gaps between the scanlines, and thus not appear as part of the video information of any field or frame.
Frame capture on film and video has a duration when a scene is being observed, but also has a time, when the shutter is closed, when there is no observation. This occurs in film because of the need to close the shutter in order to advance to the next frame of film. This occurs in video cameras in order to allow time for the sensor (usually a CCD circuit) to pass the image electrons to the readout electronics. A xe2x80x9cshort shutterxe2x80x9d is also sometimes used to reduce blur in some types of scene, where the amount of time the shutter is closed is manually increased during the capture of a particular scene. For film, the largest duty cycle of open shutter is usually 205 degrees out of 360 degrees for a rotary shutter (57% duty cycle). For CCD sensors, the largest duty cycle is about 80%, depending upon the particular sensor and electronic shutter. FIG. 1 shows an example of a temporal (time) sampling filter for film and CCD cameras. When the shutter is closed (e.g., between the end of Frame n-1 and the beginning of Frame n), no image information is being recorded.
A correct temporal filter cannot be achieved even by a 100% duty cycle (which is a box filter, still subject to some types of aliasing), but would require a time sample for each frame which extended well into the time of previous and subsequent frames. The problem of xe2x80x9cnot lookingxe2x80x9d during some of the frame (this is known as xe2x80x9ctemporal undersamplingxe2x80x9d), as well as the xe2x80x9cbox filterxe2x80x9d shape of a xe2x80x9cconstant lookxe2x80x9d during the shutter-open time, results in theoretically incorrect time filters. This leads to unavoidable xe2x80x9ctemporal aliasingxe2x80x9d.
In particular, during the time a shutter is closed, crucial information may occur which is not observed. For example, if at frame xe2x80x9cnxe2x80x9d a football is to the right of a goalpost, and at frame xe2x80x9cn+1xe2x80x9d the football is to the left of the goalpost, the crucial information about whether the field-goal was good or not is missing because the shutter was closed during the time the football was passing by the goalpost.
A more optimal temporal sampling pattern would modulate the sensor""s sensitivity over time using a function which extends well into neighboring frames. This is not possible with existing 3-CCD sensor cameras (or inexpensive single CCD cameras). Overlap in time implies that multiple CCD""s for each color must be used. In addition, current CCD""s and their on-off shutters do not allow modulation of sensitivity over time, and would need to be modified to support such sensing patterns. FIG. 2 shows an example of the theoretical shutter characteristics that would result in such a more optimal temporal sampling filter.
It is worthy of note that the Pixelace technique cited above allows such modulated temporal sampling filters to be simulated, by applying scale factors to pixel values within pixel plates based upon their temporal relationship to the filter center time. Further, pixel plates can be applied to construct multiple frames, thereby supporting the overlap necessary for more optimal filters. However, care must be taken to xe2x80x9cnormalizexe2x80x9d the pixel values based upon pixel plate overlap and temporal filter function position. Longer frame times (such as with 24 fps) allow more accurate construction of the filter shape using Pixelace, since more pixel groups are available at the LCM rate to support more data points within the filter shape.
In the absence of new sensor structures, high speed CCDs, or Pixelace compatible cameras, conventional CCD cameras and motion picture film cameras will produce frame (or interlaced field) samples which have inherent temporal undersampling and aliasing. The aliasing will result in artifacts, such as backward-rotating wagon wheels. Aliasing due to undersampling and use of a box filter also make it difficult to de-interlace or make frame rate conversions. Artifacts which occur from such aliasing are harmonically related to the frame rate conversion relationships. For example, a factor of two or three increase or decrease in frame rate (such as 48 Hz or 72 Hz display of 24 fps movies) is better than non integral relationships (such as 3-2 pulldown for 60 Hz display of 24 fps movies).
U.S. Pat. No. 5,852,565, entitled TEMPORAL AND RESOLUTION LAYERING IN ADVANCED TELEVISION (assigned to the assignee of the present invention and hereby incorporated by reference), teaches that some of the frame-rate and resolution incompatibilities may be handled by restricting frame rate capture and display to specific frame rates and resolutions. These formats are preferably matched to the capabilities of a conventional encoding scheme, such as the MPEG-2 and MPEG-4 standards.
The problem of arbitrary frame rate conversion and de-interlacing still remains as a challenge when utilizing the relatively incompatible common TV system parameters at 24, 25, 50, and 59.94/60 Hz. The international television community remains divided into camps, each favoring television format parameters which are incompatible with those of other camps.
Full correction of the spatio-temporal aliasing caused by interlace, and the temporal aliasing caused by temporal undersampling, will remain ever elusive due to absolute theoretical limitations. The best that can be done is to attempt to determine some information about the movement of objects within the scene, and use that information in the most appropriate ways.
Another key concept in temporal sampling is that of xe2x80x9cmotion blurxe2x80x9d. During the time a film or CCD shutter is open, a moving object will xe2x80x9csmearxe2x80x9d across a number of pixels as the object moves. With a temporal box filter, this smear forms a uniform blur in the direction of motion during the time that the shutter is open. For example, FIG. 3 shows the smeared image that a ball would make moving across a scene from point A to point B while a shutter is open.
At higher resolutions, the number of pixels crossed during the shutter-open blur time is greater than at the lower resolutions of existing standard definition NTSC and PAL television. At lower frame rates, such as film""s 24 fps, the shutter is open longer (at {fraction (1/40)}th of a second for a 205 degree shutter) than at 60 fps or 72 fps (where 75% ={fraction (1/100)}th of a second shutter), thereby creating larger blur areas for the lower frame rates.
U.S. Pat. No. 5,852,565, entitled TEMPORAL AND RESOLUTION LAYERING IN ADVANCED TELEVISION, teaches a key relationship between motion blur, frame rate, and human visual perception. Based on experimentation, for short shutters (e.g., 20 to 40% duty cycle), a frame rate of 36 fps is much more acceptable to the eye than is 24 fps. 30 fps is found to be on the border of acceptability. These facts become important in temporal layering, since MPEG and other image coders and processors process frames (or fields) as the basic unit of information. Thus, if a subset of frames are to be decoded in order to provide temporal layering, the relationship of frame rate and motion blur becomes a central issue.
It is also worthy of note that the blur from a box-filter (open/closed) shutter is equal over its extent. Thus, a ball moving across the frame will have a uniform dim appearance as a smeared semi-transparent sausage with soft blurry ends and well defined sides, as in FIG. 3. For a more correct temporal filter, however, the smear would be much longer, but would be much more centrally concentrated. FIG. 4 shows how the ball of FIG. 3 would appear if a more correct temporal filter was applied. A semi-transparent ball 40 moving from A to B would appear at the center of a longer xe2x80x9cstreamerxe2x80x9d with ends which faded out on both sides of the central ball 40.
While various types of multi-frame filters and noise filters can be used to reduce noise and/or smooth the problems of interlace, static temporal filters can soften certain patterns which can cancel each other due to their pattern and movement (such as stripes which move into their opposite in a field time), or reduce the amplitude of quick events and details. Although these techniques have proven helpful in reducing image noise and interlace artifacts, all comparison, filtering, medians, and processing are done at the same location in the current frame or field with respect to previous and subsequent frames or fields. This results in low effectiveness under normal conditions of motion, since the pixel at the same location in previous and subsequent frames does not represent the same location on a moving object.
It has been known for some time that computation is reduced when determining motion vectors by utilizing a hierarchical motion search. For example, the MPEG algorithms attempt to find a match between xe2x80x9cmacroblockxe2x80x9d regions, usually having a size of 16xc3x9716 or 8xc3x978 pixels. MPEG, and other motion compensated DCT (discrete cosine transform) coders, attempt to match each macroblock region in a current frame with a position in a previous frame (P frame) or previous and subsequent frame (B frame). However, it is not necessary to find a good match, since MPEG can code a new macroblock as a fresh stand-alone (xe2x80x9cintraxe2x80x9d) macroblock without using previous or subsequent frames. In such motion compensated DCT systems, one macroblock motion vector is needed for each macroblock region.
FIG. 5 is a diagram showing how a current frame macroblock 50 is compared to similarly-sized regions 52 in a previous frame in an attempt to find a good match, and thus define a corresponding motion vector 54. One region 56 will be a best match, and the corresponding vector 58 defining that region""s XY (Cartesian) offset from the current frame macroblock 50 will be the macroblock motion vector for the current frame macroblock 50 with respect to the previous frame.
A hierarchical motion search attempts to match a reduced resolution picture with a wide search range, and then do a xe2x80x9cfinexe2x80x9d match at a higher resolution with a narrower search range, thus optimizing computation. This is accomplished by filtering large macroblocks down (for example, from 16xc3x9716 down to 8xc3x978 and 4xc3x974). A coarse match is accomplished using the lowest resolution (e.g., 4xc3x974) blocks and a wide search region. Using the motion vector from the best match from the lowest resolution blocks, a finer resolution (e.g., 8xc3x978) search is made, moving locally around the center defined by the xe2x80x9ctipxe2x80x9d of the best match motion vector found from the lowest resolution search. Then a final high resolution (e.g., 16xc3x9716) search is made, moving locally around the center defined by the xe2x80x9ctipxe2x80x9d of the best match motion vector found from the finer resolution search, to find the final best match motion vector.
This form of hierarchical motion search is used to reduce the amount of motion vector computation in hardware and software MPEG encoders. The following computation table shows the amount of computations that can be saved for a typical search region scenario (single 16xc3x9716 macroblock search computations for a xc2x131 pixel search region):
The macroblock difference technique of MPEG and similar DCT coding schemes has proven reasonably effective for compression coding. MPEG only requires a statistical benefit from motion vector matches, since it can code any amount of difference using new DCT coefficients. Thus, if one time out of a hundred comparisons the best match is fairly poor, there will be many extra coefficients in the DCT difference, but only for that 1% case. Further, if a match is sufficiently bad in comparison to some threshold, it may be xe2x80x9ccheaperxe2x80x9d to code an MPEG xe2x80x9cintraxe2x80x9d macroblock, and thus not depend at all upon any previous frame. These techniques allow MPEG to statistically provide excellent compression, without relying on every macroblock having a good match.
In attempting motion analysis, the entire picture must look good. For motion blur and frame rate conversion, it is unacceptable to have 1% of the picture flying off in the wrong direction, although it may be acceptable in noise reduction and deinterlacing to gain no benefit on a small percentage of the picture. The macroblock technique described above is limited in its usefulness to applications such as compression coding. In particular, 16xc3x9716 blocks can yield blocking artifacts if the matches are poor, and these artifacts are among the worst in appearance that MPEG produces, especially since most moving objects do not fit neatly into 16xc3x9716 squares. Moreover, the edges of moving objects, along their direction of motion, should be clearly defined down to the pixel. It is also apparent that a stationary object in front of a moving background must also have a clearly defined edge. Further, using only one motion vector per pixel can yield jagged edges and other aliasing artifacts.
This invention teaches techniques for implementing a variety of temporal functions such as de-interlacing, frame rate conversion, and multi-frame noise reduction. A key element in performing such temporal processing is pixel-level motion analysis. Motion analysis attempts to identify where each pixel, which represents a point on a potentially moving object, might be found in previous and subsequent frames. A set of such identified pixels defines a dynamic xe2x80x9cpixel trajectoryxe2x80x9d. A xe2x80x9cpixel trajectoryxe2x80x9d is preferably represented as a motion vector, which collectively indicate where each pixel seemingly has moved from or seemingly will move to from frame to frame.
The invention utilizes multiple motion vectors per pixel of the final image. In a preferred embodiment, this is accomplished by increasing the size of the image, with the amount of size increase depending upon the degree of sub-pixel accuracy desired. In a preferred embodiment, image size is doubled. Thus, four motion vectors are generated for each pixel. Each motion vector is found by searching independently for the best match with previous and subsequent frames.
In one aspect, the invention includes a method of temporal processing of motion picture image frames each comprising a plurality of pixels, including the steps of comparing each pixel of a current frame to at least one previous or subsequent image frame; determining at least one motion vector corresponding to each such pixel relative to such at least one previous or subsequent image frame; and saving the determined motion vectors. The invention also includes the steps of applying motion vectors corresponding to multiple image frames to define a new pixel for each pixel of the current frame, and outputting all of such new pixels as a constructed image frame.
The invention achieves a high resolution and high quality result, with better motion conversion, de-interlacing, motion blur, and noise reduction results than have been heretofore been practically achieved.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.