The present invention relates to the field of video processing, and more particularly to methods and apparatuses for detecting the presence of progressive frames in a sequence of video fields.
A telecine is a well-known apparatus that converts a motion picture film into a video format for display on a device such as a television. Both motion picture film and video create the illusion of moving pictures by sequentially displaying a series of still image frames that represent the image at corresponding sequential instants of time. The conversion process must take into account differences in display format as well as differences in image frame rate.
Considering display format first, each portion of a motion picture film frame is displayed simultaneously to the user. By contrast, video images are created by sequentially xe2x80x9cpaintingxe2x80x9d dots, called xe2x80x9cpixelsxe2x80x9d, onto a suitable screen, such as a cathode ray tube (CRT). The pixels are supplied in an order that draws horizontal lines on the screen, one line at a time. This is performed at a fast enough rate such that the viewer does not experience the individual pixels, but rather sees the combination of displayed pixels as a single image. The lines of horizontal pixels may be drawn in several different ways. If a progressive scan order is used, the lines are supplied in sequence from, for example, top to bottom. Alternatively, an interlaced scan order can be used, wherein the image frame (which comprises the totality of scan lines to be displayed for the given frame) is divided into even and odd fields. The even field comprises all of the even numbered scan lines, and the odd field comprises all of the odd numbered scan lines. In an interlaced video display system, an entire even field is supplied to the screen, followed by the odd field. This pattern is then repeated for each frame to be displayed.
Considering now differences in display rates, standard motion picture film is shot at a rate of 24 frames per second (fps). By contrast, current existing television systems, such as those operating in accordance with National Television Standards Committee (NTSC), Phase Alternation Line (PAL) and High Definition (HD) television standards, have video frame rates that include 24, 30 and 25 fps.
In converting from a 24 fps film image to a 30 fps video image, the frame rate must increase by 25% so that when the film frames are played back as video they transpire in the same 1 second that they would have on film. This can be accomplished by outputting 2.5 video frames for every 2 film frames. Since a telecine typically needs to generate an interlaced video output comprising alternating odd and even fields, this rate difference equates to outputting 5 video fields for every 4 film fields. One way to accomplish this is by extracting 2 fields from one film frame, and 3 fields from the next. In the 3-field sequence (henceforth referred to as the field xe2x80x9ctripletxe2x80x9d), the first and third fields are derived from the same film frame, and are therefore identical. The specific conversion from 24 to 30 fps is called 2:3 pulldown (also referred to as 3:2 pulldown). This process is illustrated in FIG. 1(a). The top strip shows a film sequence 101 of a ball moving from left to right across the frame. Each of these frames may be considered to be in a xe2x80x9cprogressivexe2x80x9d format since, if separated into odd and even fields, both fields will have been captured at the same instant in time. In contrast, the interlaced NTSC video format has odd and even fields that are captured {fraction (1/60)} of a second apart.
The second strip in FIG. 1(a) shows the output 103 of the 3:2 pulldown telecine process. In the figure, the label xe2x80x9cAoxe2x80x9d denotes the first odd video field, the label xe2x80x9cAexe2x80x9d denotes the first even video field, the label xe2x80x9cBoxe2x80x9d denotes the second odd video field, and so on. Each successive pair of odd and even fields constitutes one video frame, capable of being displayed on an interlaced video display device. Note that as a result of the 3:2 pulldown process which selectively duplicates certain fields, the field xe2x80x9cCoxe2x80x9d is not in the same video frame as that constituted by the fields xe2x80x9cBoxe2x80x9d and xe2x80x9cBexe2x80x9d even though the field xe2x80x9cCoxe2x80x9d originated from the same film frame as the fields xe2x80x9cBoxe2x80x9d and xe2x80x9cBexe2x80x9d. Likewise, although xe2x80x9cCoxe2x80x9d and xe2x80x9cCexe2x80x9d are in the same video frame, they originated in different film frames.
It is useful to have the capability of detecting whether telecine processing or other processing (e.g., computer-generated video, which would also be in the form of video fields that can be combined to form progressive video frames) has been employed in the generation of video material, and if so, to be able to identify those fields in the sequence that have been xe2x80x9cpulled downxe2x80x9d. How this information is utilized depends on the type of application that is to take place. For example, when a telecine processed video image is to be compressed (i.e., so that the image can be represented in fewer digital bits) the repeated frames are simply discarded and the compression routine supplies the appropriate field replication markings. In another example, when a telecine processed video image is to undergo interlace-to-progressive format conversion, no processing to generate a synthetic field (either via interpolation or motion compensation techniques) takes place, and the action is merely to bundle back together the appropriate fields into their original progressive frame state. Thus a progressive frame may sometimes be reconstructed by pairing a source field with the field before, or with the field after, or sometimes with either. This is illustrated in FIG. 1(b). The first strip in FIG. 1(b) pairs each field with the immediately preceding field, and is thus labeled xe2x80x9cFieldxe2x88x921xe2x80x9d. The next strip in FIG. 1(b) pairs each field with the immediately succeeding field, and is thus labeled xe2x80x9cField+1xe2x80x9d. Note that while some pairings yield the original progressive film frames (e.g., xe2x80x9cAeAoxe2x80x9d and xe2x80x9cBeBoxe2x80x9d), other pairings yield incorrect results (e.g., xe2x80x9cBoAexe2x80x9d and xe2x80x9cCeCoxe2x80x9d). By correctly selecting the pairings which yield the original progressive film frames, a 60 fps progressive output can be achieved as shown in the last strip in FIG. 1(b).
Note in the last strip in FIG. 1(b) that there are two frames that have two correct pairings: a first frame that could either be xe2x80x9cBeBoxe2x80x9d or xe2x80x9cBeCoxe2x80x9d, and a second frame that could either be xe2x80x9cEoDexe2x80x9d or xe2x80x9cEoEexe2x80x9d. This is a characteristic of a field triplet. The center field can be paired with either the immediately preceding field or with the immediately succeeding field, since both are identical. In the video compression application, the second of the identical fields (xe2x80x9cCoxe2x80x9d and xe2x80x9cEexe2x80x9d) would be labeled as being replicated and would not be subjected to the lengthy compression algorithm.
In a perfect (unedited) 3:2 sequence, replicated fields occur once every fifth field. Prior techniques for pulldown detection utilize this sequence repetition for identifying the field triplet location, and hence the 3:2 pattern. If the 3:2 sequence is not perfect, as is the case with edited material, the pattern is broken. The break in the sequence pattern cannot be detected until the location of the next field triplet arrives and the expected field replication is not found. Thus these conventional techniques must buffer the fields between the triplets or suffer the consequences of incorrect pairing. Decisions have to then be made on how to treat these buffered fields without the knowledge of where they fall in the 3:2 sequence, all of which leads to processing latency.
An example of edited 3:2 material is shown in FIGS. 2(a)-(e). FIG. 2(a) illustrates a telecined 3:2 pulldown sequence 201 with edited frames (Bo, Be) and (Do, De) shown in cross-hatch. FIG. 2(b) shows the same sequence 201xe2x80x2 with the edited frames removed. As in FIG. 1(b), the FIGS. 2(c) and 2(d) show the results of pairings with the immediately preceding field (Fieldxe2x88x921) and the immediately succeeding field (Field+1). Again, the strip depicted in FIG. 2(e) shows the pairing which would yield the original progressive film frame. Note that there is no appropriate pairing for the center two frames 203 and 205. This is because the fields that would have yielded the correct pairing with xe2x80x9cCoxe2x80x9d and xe2x80x9cCexe2x80x9d (i.e., Be and Do) were edited out. Fields xe2x80x9cCoxe2x80x9d and xe2x80x9cCexe2x80x9d are commonly referred to as xe2x80x9changing fieldsxe2x80x9d. To produce a progressive field mate for these hanging fields, techniques such as field interpolation are typically employed.
The above-described 3:2 pulldown is just one type of conversion from motion picture film to a video format. For example, another film-to-video conversion process, called xe2x80x9c2:2 pulldownxe2x80x9d, operates by extracting one odd and one even field from every film frame. This is illustrated in FIGS. 3(a) and 3(b) with the five strips representing the same stages as are depicted in FIGS. 1(a) and 1(b). Note that in contrast with 3:2 pulldown, there are no instances of two correct pairings for one video frame, due to there being no triplets in 2:2 pulldown video. Thus video frames can be edited out without disrupting the 2:2 sequence; however, hanging fields will occur if an edit occurs on a field boundary within a frame. Without the triplet, there is no easy and obvious key upon which to rely in determining when the 2:2 pulldown has begun, or when an edit has occurred in the middle of the 2:2 pulldown sequence.
As a point of comparison, FIG. 4(a) illustrates a strip of so-called xe2x80x9cnative videoxe2x80x9d 401, which is a sequence of video frames that did not originate from a film source or other progressive frame generator. FIG. 4(b) shows the field pairing associated with the xe2x80x9cnative videoxe2x80x9d 401. Note that for a scene with motion, there are no pairings that yield a correct progressive video frame. This is because each successive field is captured at a slightly later instant in time (e.g., at {fraction (1/60)}th of a second later than the immediately preceding field). In this case, as is the case with hanging fields, other techniques such as field interpolation, spatial-temporal filtering, motion adaptive, and motion compensation deinterlacing are necessary to provide the complementary field for pairing.
In accordance with conventional techniques, knowledge of the video type (e.g., 3:2 pulldown, 2:2 pulldown, computer-generated progressive, native) was required to accurately convert the interlaced fields to progressive frames. Since a large portion of source material has been edited, two or more of the video types are often combined. To cope with this possibility, ancillary information (e.g., an in-the-loop workstation operator, or a complete edit list supplied by an operator) was required, defining which video type was to be expected in order to correctly determine the field pairings for generation of the progressive frames. Thus, there is a need for an autonomous technique for detection of progressive frames in a mixed media film/video sequence that is independent of any a priori knowledge of the video type, and the frequency and location of edits. There is a further need for such a technique to be applicable to video processes such as interlace-to-progressive conversion and video compression.
In accordance with one aspect of the present invention, the foregoing and other objects are achieved in methods and apparatuses that detect a progressive video frame in a sequence of video fields, wherein the sequence of video fields includes a target video field. This may be accomplished by generating a first metric by comparing the target video field with an immediately preceding video field. Alternatively, the first metric may be generated by comparing the target video field with an immediately succeeding video field. The first metric is then compared with a first threshold value. If the first metric is less than the first threshold value, then the immediately preceding video field (or alternatively, the immediately succeeding video field) is found to have been derived from a same progressive video frame as the target video field.
In another aspect of the invention, where the first metric is generated by comparing the target video field with an immediately preceding video field, progressive video frame detection may further include generating a second metric by comparing the target video field with an immediately succeeding video field; comparing the second metric with a second threshold value; and determining that the immediately succeeding video field is derived from the same progressive video frame as the target video field if the second metric is less than the second threshold value. In this way, fields are considered three-at-a-time. The first threshold value may be equal to the second threshold value, but this need not be the case in all embodiments.
In either or both of the above aspects, the first and second metrics may be indicative of a quantity of interlace artifacts.
In another aspect of the invention, generation of each metric may be accomplished by, for each of a group of target pixels comprising one or more pixels in the target video field, generating an inflection indicator, by comparing the target pixel with at least one neighboring pixel in the immediately preceding video field; for each of one or more of the target pixels, generating an artifact detection indicator by determining whether a pattern formed by the inflection indicator of the target pixel and the inflection indicators of one or more neighboring pixels matches at least one of one or more artifact-defining patterns; and generating the metric by combining the artifact detection indicators.
In yet another aspect of the invention, generating the artifact detection indicators includes, for each of one or more of the target pixels, performing an artifact detection indicator operation that comprises: first determining whether the target pixel has an inflection. If the target pixel has an inflection, then determining whether there is a first vertically displaced pixel in the line above the target pixel and also a second vertically displaced pixel in the line below the target pixel, wherein the first vertically displaced pixel lies either directly in line with the target pixel or else is horizontally displaced by no more than one pixel location from the target pixel, wherein the second vertically displaced pixel lies either directly in line with the target pixel or else is horizontally displaced by no more than one pixel location from the target pixel, and wherein the first and second vertically displaced pixels each have an inflection indicator of the opposite polarity to that of the target pixel. If the target pixel has an inflection, then it is determined whether there is a horizontally adjacent pixel having an inflection indicator of the same polarity. Additionally, if the target pixel has an inflection, it is determined whether there is not a horizontally adjacent pixel with an inflection indicator of opposite polarity to that of the target pixel.
In still another aspect of the invention, the inflection indicator comprises a positive contrast inflection flag and a negative contrast inflection flag that are generated in accordance with:
I+(x,y)=((i(x,y)xe2x88x92i(x,yxe2x88x921)) greater than +T)∩((i(x,y)xe2x88x92i(x,y +1)) greater than +T)
Ixe2x88x92(x,y)=((i(x,y)xe2x88x92i(x,yxe2x88x921)) less than xe2x88x92T)∩((i(x,y)xe2x88x92i(x,y+1)) less than xe2x88x92T)
where:
I+(x,y) is the positive contrast inflection flag at pixel location (x,y);
Ixe2x88x92(x,y) is the negative contrast inflection flag at pixel location (x,y);
i(x,y) is an intensity value at pixel location (x,y); and
T is an inflection intensity threshold.
In yet another aspect of the invention, the first metric may be generated by summing the artifact detection indicators. Alternatively, the first metric may be generated by, for each of one or more of the target pixels, computing a local average of the artifact detection indicators, whereby a set of local averages is generated; and selecting a highest local average from the set of local averages for use as the first metric. This latter technique is useful for detecting artifacts when the video fields have only one or more very small portions that are representative of motion.
In yet other embodiments of the invention, a progressive video frame is detected in a sequence of video fields, wherein the sequence of video fields includes a target video field, and wherein each of the video fields comprises a plurality of pixels. This is accomplished by, for each of a group of target pixels comprising one or more pixels in the target video field, generating a first metric by comparing the target pixel with a corresponding pixel in an immediately preceding video field, whereby a set of first metrics is generated; for each of the target pixels, generating a second metric by comparing the target pixel with a corresponding pixel in an immediately succeeding video field, whereby a set of second metrics is generated; and using the set of first metrics and the set of second metrics to determine which, if any, of the immediately preceding and immediately succeeding video fields is derived from a same progressive video frame as the target video field.
In another aspect of the invention, the act of using the set of first metrics and the set of second metrics to determine which, if any, of the immediately preceding and immediately succeeding video fields is derived from the same progressive frame as the target video field includes, for each of the target pixels, forming a ratio of the first metric with respect to the second metric, whereby a set of ratios is formed, and wherein each ratio is an indicator of whether none, one or both of the immediately preceding and immediately succeeding video fields are progressive matches to the target video field. Then, a first value is generated that represents how many pixels have a ratio that indicates that none of the immediately preceding and immediately succeeding video fields are progressive matches to the target video field; a second value is generated that represents how many pixels have a ratio that indicates that one of the immediately preceding and immediately succeeding video fields is a progressive match to the target video field; a third value is generated that represents how many pixels have a ratio that indicates that both of the immediately preceding and immediately succeeding video fields are progressive matches to the target video field. Then, it is determined which, if any, of the immediately preceding and immediately succeeding video fields is derived from the same progressive frame as the target video field based on which of the first, second, and third values is largest.
In still another aspect of the invention, for each ratio, the immediately succeeding video field is derived from a same progressive frame as the target video field if the ratio is greater than an upper threshold value; for each ratio, the immediately preceding video field is derived from the same progressive frame as the target video field if the ratio is less than a lower threshold value; and for each ratio, both the immediately preceding and immediately succeeding video fields are derived from the same progressive frame as the target video field if the first metric and the second metric are both equal to zero.
In yet another aspect of the invention, for each ratio, neither of the immediately preceding and immediately succeeding video fields are derived from the same progressive frame as the target video field if the ratio is both greater than the lower threshold value and less than the upper threshold value.
In still another aspect of the invention, the target video field is synthesized by interpolating vertically aligned pixels in a source video field that is between the immediately preceding and immediately succeeding video fields.
In yet another aspect of the invention, the first metric is determined in accordance with:
Dxe2x88x921[x,y]=|FIELDnxe2x88x921[x,y]xe2x88x92FIELDnxe2x80x2[x,y]|;
and the second metric is determined in accordance with:
D+1[x,y]=|FIELDn+1[x,y]xe2x88x92FIELDnxe2x80x2[x,y]|,
wherein:
FIELDnxe2x80x2[x,y] is a pixel located at location x,y in the synthesized target video field;
FIELDnxe2x88x921[x,y] is a pixel located at location x,y in the immediately preceding video field; and
FIELDn+1[x,y] is a pixel located at location x,y in the immediately succeeding video field.
In still another aspect of the invention, the progressive video frame detection technique further includes detecting a set of pixels in the target video field that are representative of motion in the video input. In such embodiments, the step of using the set of first metrics and the set of second metrics to determine which, if any, of the immediately preceding and immediately succeeding video fields is derived from a same progressive video frame as the target video field may be performed by utilizing only the set of pixels in the target video field that are representative of motion in the video input.
In yet another aspect of the invention, the step of detecting the set of pixels in the target video field that are representative of motion in the video input includes comparing each of one or more pixels in the immediately preceding video field with a corresponding one of one or more pixels in the immediately succeeding video field. Here, for each of the one or more pixels, a representation of motion is detected if an absolute value of a difference between the pixel in the immediately preceding video field and the pixel in the immediately succeeding video field is greater than a threshold amount.
In still another aspect of the invention, the first metric is determined in accordance with:
Dxe2x88x921[x,y]=|FIELDnxe2x88x921[x,y]xe2x88x92FIELDn[x,y]|;
and the second metric is determined in accordance with:
D+1[x,y]=|FIELDn+1[x,y]xe2x88x92FIELDn[x,y]|,
wherein:
FIELDn[x,y] is a pixel located at location x,y in the target video field;
FIELDnxe2x88x921[x,y] is a pixel located at location x,y in the immediately preceding video field; and
FIELDn+1[x,y] is a pixel located at location x,y in the immediately succeeding video field.